Utilizing the Logistic Regression Model in Analyzing the Categorical Data of Economic Effects

Abstract: The categorical data has a significant role in representing statistical binary variables, and they are analyzed by means of grouping the response variable into ordered categories. Thereby, the dependent variable becomes of type binary qualitative variable. The data related to the financial position of world countries is classified within the categorical data. This work is to study the economic effects of an individual's different factors on determining the richness or poorness levels of a selected population of countries. Moreover, a logistic regression model is to be created to estimate these levels. As a sample of research, the categorical data relevant to the financial status of 20 Arabic countries were drawn from the website of the World Bank, WB. In addition, for comparison purpose, another similar set of categorical data was generated by MATLAB too. The paper has been based on two hypotheses, first is the well-known regression models, like the ordinary least squares or maximum likelihood, are not accurate in case of binary qualitative variables. Second, is utilizing the logistic regression model as an alternative model to achieve the paper goal. The paper results, for both WB data and MATLAB data, have successfully proved the ability of the logistic regression model in manipulating the categorical data and predicting the coefficients of the corresponding regression models.


Introduction
Qualitative variables are of binary values (0 or 1) (Yes or No) are almost based on the variable nature (e.g. colour of the eye, black or blue, / gender, male or female, etc.). Regression models of these variables cannot be accurately estimated by applying the conventional regression methods, such as the Ordinary Least Squares method (OLS). This is because the conventional models encounter several problems when used in estimating the coefficients of regression models whose dependent variables are qualitative. These problems can be summarized by; Multicollinearity, Autocorrelation and the non-homogeneous variance. [1][2][3B] [4][5][6].
Alternatively, the logistic regression model, of binary response, is regarded as the most proper model to overcome such obstacles. For logistic regression, the predicted dependent variable is expressed by a function of the probability that a certain event will be in one of the binary categories which commonly specified by (true or false) (zero or one). Practically, it is not possible to create a regression model for binary data. Therefore, Mathematical solution is presented by the logistic regression model, LGM, by utilizing a logarithm function called "logit". This function is regarded as a transfer function to transfer the probability of binary events into non-binary regression values [16][17][18][19][20].
Where is the probability that the logistic regression value is at logic "one" which means a certain event is true. In contrast, 1 − is the probability of the logistic regression value is "zero"; i.e., the event is false. Accordingly, the range of dependent variable value "Y" will vary from negative infinity, when p=0, to positive infinity, when p=1. Then, it becomes predictable by the conventional regression models like the OLS or the maximum likelihood, ML [2] [8][9][10][11][12]. So:

Description of the data
The economic statistical data employed in this work, to achieve the paper issue, were drawn from the website of the World Bank (WB), for the year 2019. The WB publishes, on its website, a per capita Gross Domestic Product (GPD) matrix. This GDP matrix breaks down the domestic economic outputs of the world countries (per person) relative to the country population [7]. Twenty of the Middle East Arab countries were selected for the paper study from the GPD matrix.
The data under study is of a binary response-dependent variable known as "Economic Status", which is equal to 1 if the country citizen has an annual income of more than 15 thousands USD. Otherwise, the dependent variable is of zero value. There are five predictors X1, X2, X3, X4 and X5 to specify the person; annual income, life rate, school life, unemployment condition and the continental location(1 for Asia and 0 for Africa) respectively. The predictors X2 through X5 explicitly affect the value of X1 predictor, which was determined, in this work, as a base to define the dependent variable status. Figure 1 is a descriptive block diagram to illustrate the various stages of the proposed logistic regression model. The predictors X1 through X4, which have continuous values, are fed to the logistic decision block. In addition, the predictor X5 (which labelled by cont. because it is represented by "0" or "1" binary values) is also fed to this block. The output of the logistic decision block is the dependent variable in its binary form "0" or "1". This output form represents the input of the "log function" block which has to widen the range of the independent variable "Y" into (infinity to +infinity). By this range transformation, values of "Y" become ready to be manipulated by the likelihood estimation block. The output of this block is the required logistic regression model. Fig. 1 The proposed logistic regression model

Results and Discussion
The data under this study is shown in appendices.1 and 2. This data was fed and processed by the statistical software SPSS. The output results of this software for the data of appendix 1 are shown and discussed by the tables shown in figures 2 through 6 given below. Figure 2 shows the sample size utilized in this work. It tells that all the twenty input data of the twenty countries, concerned in this study, were processed and there is no any missing in the input data.   Figure 5 shows the table of the iteration history of estimations of the predictor coefficients. The proposed model was constructed by a procedure based on an iterative maximum likelihood, ML. The initial values of the regression coefficients, βs, were arbitrarily chosen. In each iteration, the SPSS predicted new, more accurate values for regression coefficients. Thereby, the likelihood of the observed data would be made greater under the new model coefficients. Iterations procedure continued till model converging was taken place, which means that the differences between the values of previous and current model coefficients can be are neglected. The iteration history table shows that the coefficient estimations processes proceeded for 20 steps. The table also shows the deviance statistic (-2LL). These statistics are obtained from the natural logarithm of likelihood multiplying by (-2). It represents a criterion of how the coefficient estimations are good and, correspondingly, how the logistic regression model exactly fit the data. The smaller value of this Statistic the better estimation of predictor coefficients [1] [13][14][15].  (3) can be summarised by the following points: • The negative coefficients of the predictors X2, X3 and X4 mean that they have a negative effect on tending of output to be a state "1" "richness". Increasing the value of X2, X3 or X4 by one leads to reduce the logit of logistic regression by 2.571, 0.474 and 2.511, respectively. • In contrast, the X1 and X5 predictors have a positive effect on bringing the output up to "1" state.
Increasing each of these predictors by one will improve the opportunity of the logit regression by 0.002 and 1.695, respectively. • According to the above two points, it is obviously clear that the person life period and its unemployment condition highly affect on reducing the output of the logistic model. In a reverse manner, the output rises with the person annual income and the Asian geographic position of the country. • The exact output binary was categorizing of countries into poor "0", and rich "1" without any per cent of error affect the value of Wald test making it equals to zero for all estimated coefficients. Whereas the table given in figure 7 shows the results of the estimation of coefficients of the logistic regression model according to the data generated by MATLAB simulation.

Verification
To verify the validity of the obtained logistic regression equation given in (3), the data of the first country, for instance, are substituted in the regression equation. By this substitution yields: Taking the inverse of (logit) function yields: Result of equation (6) is correct if and only if the value of the probability of the country to be rich (p) is very close to zero. This means that the country whose date was substituted in the logistic regression equation (equation 3), is more likely to be a poor country. Thereby, the regression result well fit the data of the first country given in appendix 1.
Similarly, if the data of the second country is substituted in the logistic regression equation, a value of (p) very close to one will be obtained. So, this country is more likely to be rich, which is consistent with this country data.

Conclusions
The paper projects a spot of light on the difficulties that may encounter the researchers when they try to apply the traditional regression tools on data of binary form. In addition, the results of this paper have confirmed the ability of the logistic regression model in dealing with the binary qualitative variables and accurately estimating the coefficients of the predicted regression model. It can be concluded that the logistic regression model is convenient in modelling the binary data because of its simplicity and its high explanatory meaning. Comparing the equations of regression models given by equations 3 and 4 has shown that the value estimated coefficients and their effects differ according to the data to be manipulated.