A Study of Multicollinearity Detection and Rectification under Missing Values

_______________________________________________________________________ Abstracts In this paper, the consequences of missing observations on data-based multicollinearity were analyzed. Different missing values has a different effect on multicollinearity in the system of multiple regression model. Therefore, to ascertain the clear relationship between both multicollinearity and skipping values on monotone and arbitrary missing values, the collinear effects were potentially studied on two types of missing values. Similarly, the comparison was done to investigate each response of multicollinearity on each pattern of the missing values with the same informatics data. It was found that tolerance and variance inflation factor fluctuates due to the missing of information from the sample analyzed at a different percentages of the missing values.It was observed that the more missing values available in the sample obtain from either population statistics or survey than multicollinearity will be found in the system of multiple regression, this is because as the number of Missingness increase it shows a drastic decrease from the tolerance level on both monotone and arbitrary types as observed from the analysis. ______________________________________________________________________


Introduction
Today missing data is becoming more challenging than ever before, this is due to the rapid advancement in technology of computations couple with the current statistical techniques enhancing the analysis of variance. The higher demand for essential accuracy and efficient reliability from Governments, industries, and non-Governmental organizations to obtain better achievements during implementation and executions of their policies make a special topic that requests attention. This demand calls for the need to discover more on missingness and multicollinearity to minimized errors (Peugh, J.L and Enders, C.K. 2004). Absent of complete or partial information due to nonresponse or from any experimental research is problematic in statistics, this is because it will affect the result and render it invalid. It was well known that nearly most of the standard statistical procedures and techniques need complete data to process at higher accuracy but in most cases, it has no provision to handle the missing information, therefore missing observations will essentially reduce and shrunk the sample size which in return will directly increase the level of the standard error during analysis Marina, Soley-Bori (2013). Similarly, it will affect the precision of the confidence intervals which normally lead to type I error after the analysis (SAS Institute, 2005). Missing information is a big problem affecting the manual database, the electronic database it is sometimes responsible for making most of the statistical packages inactive (Catia M. ET, al. 2016).
Missing value occurs at any time and in any given experiments, surveys, or population study no matter how it is well designed and implemented. This missing data always undermines the efficiency and precision of research investigations. It brings much of instability and no reliability in processing and making of a final inference on any statistical investigation Marina, Soley-Bori (2013) and ). The statistical power and ability of a model are compromised due to the absence of sensible information which is necessary to complement all the realities involves in any statistical analysis. Similarly, missing value brings a biased estimate that provides invalid interpretations due to the error originated from a lack of complete information during the recording of data collection in the survey field (Korean, J. Anesthesiol 2013). It is well known that missing information occurs when there is no data value or variable recorded for an observation. Missing data is a common phenomenon in a statistic. It has a significant effect on the conclusion of analysis and can affect it is final precision Acock AC (2005).Imputation procedures are normally used to correct the missing information but it requires a careful study of the data pattern, nature of the missingness and sequence at which the available information come into being, even though it is important if possible to trace the reasons of the missingness if is available before engaging to corrects the missingness (Pourahmadi, M. 1989). researchers in the empirical research field are now putting much serious effort on how to treat the problem arising from the missing date, this is because it is now clear during statistical survey some selected respondents may voluntarily refuse to give out any information more especially when it comes to private information despite missing of such participation from the respondents always appear to affect the survey negatively Graham, J.W. (2009). One of the best solutions of the missingness of data a surveyor shall not allow anyone to happen and to achieve this one has to be careful in making designed and ensure good execution of the entire research procedures in the field because all the statistical methods and adjustments that will be applied to make corrections or imputations when missingness occurs will never be as appropriate as the original observations (Paul D. Allison, 2001).

Multicollinearity
Multicollinearity is a state of having higher inter-correlation among the dependent response and independent explanatory variables in multiple regression equations due to the existence of linear relationships among variables in the model (Farrar, D. E., and Glauber, R. R. 1981). To provide more reliable inferences at the time of the dissemination of result, the effect of missing values that reduce the representativeness of the desired sample to be studied from the main population which has a direct influence on multicollinearity do to shrinkage of size from the sample was carefully studied (Jamal, I. Daoud 2017) and (O'Hagan and Brendan 1975).
Farrar and Glauber 1969 study about the severity of multicollinearity in which it was categorized into non-harmful, medium, and severe multicollinearity in linear relationships among explanatory variables. Yoel Haitovsky also in his paper title Multicollinearity in Regression Analysis (1969) explained the existence or non-existence of multicollinearity in a system of multiple regression analysis. And In 1975 Farrar and Glauber proposed a criteria for detecting multicollinearity presence, which repressor variables are collinear and the nature of multicollinearity by the use of chi-square, F-test and T-test respectively. And John O'Hagan and Brendan McCab in 1975 study the tests for the severity of multicollinearity in Regression analysis while in this paper we have studied the implication of multicollinearity with missing values base on the parentages of the missing information. It was observed that large missing values are associated with higher multicollinearity found in the system of multiple regression analysis.
Besides,in section 1. This paper explains the concept of multicollinearity and missing value altogether, section 2. Talk about missing value mechanism, types, and a class of the missing values and reason for missing data either from any of dependents or independents variables. Section 3. It deals with the principals of obtaining dependent, explanatory variables, co-efficient of independent variables in a complete system of multiple regression analysis. In section 4. The paper provides insight into the statement of the problem, pattern, class of the missing value, and imputation technique for correction of the missing information. Section 5. Detects multicollinearity based on Monotone and Arbitrary types of missing data at a different level and percentages of the missing information and also present graphs which visualize the effects and consequences of both multicollinearity and missing data together by percentages. Section 6. Concluded the finding of the study where it discovered that both multicollinearity and missing values has negatives effect or bauble tragedy on the system of multiple regression analysis.

Missing-Value Mechanisms
It is on records data is been missed due to different reasons at the time of investigation or during data mining and processing. Of whatever reasons it needs to be handled scientifically to avoid the adverse negative effect of any missing information that occurs along the way due to one reason or another Marina, Soley-Bori (2013). Scientists who are working in the data visualization process are working hard to see that the issue of missing value is always properly addressed to avoid a shortage from the sample to be visualized or analyze. There are many methods in existence to replace the missing value but always it depends on the nature and pattern of the missingness.

2.2Missing value completely at random
Missing any of the explanatory variables completely at random occurs only if all the explanatory and response variables in the model have the same and equal chance of been missed along the process. In this case, the deletion or ignoring of any missing information base on either of the row or column affect the final inference drawn after the analysis. If the sample size is enhancing than the least square coefficient will be more consistent and unbiased ). Even though efficiency measure the optimality of the estimator in the recovery of missing information essentially (Allison, 2001) and (Briggs et al., 2003).

2.3Missing Value at Random
The probability that a response or a predictor variable is missing from a set of samples is depended only on the availability of the obtained information from an investigation. It depends on the accessibility of the available recorded information. This can be linked with the process of logistic regression dealing with either 1 or 0 in place of availability or missing of a value a variable respectively Pourahmadi, M. (1989). When an explanatory variable is missed at random is acceptable to exclude the missing cases from either raw or any column from the investigation if the multiple regression model controls all the response that affect the probability of the missing observation. Missing value at random is a much more realistic assumption to study the performance and accuracy of the recovery procedure because missing information at random is term as the probability that response and predictor variables are missing base on the set of the observed responses ).

2.4Missing Value Due to Unobserved Predictors
This process is no longer at random but only depends on the accessibility of the information which has not been recorded and it can predict the missing observation from an investigation Hyun Kang (2013). The process must be modeled explicitly if the missing information is not at random or else acceptance of bias in the interpretation will be non-avoidable (Graham, J.W. 2009).

2.5Missing Value depends on the Missingness itself
His is where the missing of information is leading to another missing of inflation on some response and predictor variables Paul T. von Hippel (2004). A difficult situation normally arises where the probability of missingness depends on potentially missing information from the variable that is supposed to be in the sample ).

2.6Reasons for Missingness of a Data
Taking into consideration the main reasons for missing data always pre-empty a preview on the best mechanism to adopt for the systematic recovery of the missing information. Numerous cases have different reasons for the Missingness ranging from the time of design of a survey to the process of data mining activities, recoding phase, analysis, and interpretation. The missing value may occur due to the escape of one's memory or lost, nonapplicability of the value at the instance, lack of interest at the point of recording, the variable was measured but not recorded due to identify or unidentified technical errors from the database (i.e. Disconnection of sensors, errors in communicating the value, accidental human omission, electricity failures), (Young W, Weckman G, Holland W 2011).
It was established and well distinguished between data missing that has identifiable or no identifiable reasons. This is redefining the nature and status of the missing information as either recoverable or not recoverable. Whenever information was missed for unidentified reasons it normally terms with the assumption that the missing is at random and unintended causes such type of Missingness is always classified as the recoverable one and otherwise, if the reasons for the Missingness are identified and is for reasons such is always no recoverable. Many a time the nature of the missingness and it is assumption are used to illustrate which type of methods shall be employed to recover the missing infarction.

Materials and Methods
From the principal of standard linear regression analysis that involve the algebraic formula in which dependent variable (Y), independent variables (Xi), coefficient of the explanatory variables (βi) and the statistical error term (α) Calculating the Regression Coefficients β1 and β2 the formulae are given below ry,x1= Correlation between blood pressure and age x1. ry,x2 = Correlation between blood pressure and weight x2. rx1,x2 = Correlation between age x1 and weight x2 (rx1, x2)2 = the coefficient of determination (r squared) between age x1 and weight x2 SDy = Standard Deviation for our Y (dependent) variable.
To find the coefficient of determination we have the following below; Whereas; ry,x1= Correlation between blood pressure and age x1.

Variance Inflation Factor (VIF)
Variance inflation factor is determine to explore the level of multi-linear relationships that always exist in between variables (Johnston 1972; Green 1990; Kroll et al. 2004). Each explanatory variable is regressed against the other remaining explanatory variables and response variable.And the VIF is calculated as:

Research Article
Where R 2 is the regression model coefficient of determination (Rawlings et al. 1998). A VIF greater than 10 is a common threshold in detecting severe multicollinearity.Variance inflation factor inflate sample variance and the dependence properties of the variables involved in the system Tolerance level is one unit minus the potion of variance an explanatory variable shares with the other independent variable which is not map to by other predictor variables and in the same vein tolerance is a measure of multicollinearity obtain from statistical analysis like SPSS, the variable's tolerance is given below; (7) In recent days investigators and scientific researchers more especially those from a statistical point of views dose not doubt the challenges and fear due to the influence of both phenomenon's multicollinearity and missing value presence in any statistical project ranging from a survey, population study, statistical analysis and other areas where statistical application on demand for evaluation activities, this is because every day accuracy and precision is on high demand by Governments, Industries and other non-governmental organization from National, international and global perspective for the purpose of good achievement and higher delivery of their aims and objectives in order to reaffirm the execution of their policies but unfortunately those two phenomena's bring instability and rises the chances of higher standard error which affects good estimation and prediction process of which finally causes high non-reliability to the final inference drawn after all the statistical procedures.

Statement of the problem and the data to be analyzed
The study is aimed at investigating how multicollinearity is related either directly

3.3Patterns of missingness
The nature of the missing patterns influences the stability of the analysis, this is because some missing values can be recoverable and no recoverable base on the type of the sample available. Predictions and estimations are found to be much more essential and stable if there is no missing value at all and the power of the prediction mechanism remains unaltered. The missing value pattern has it is own influence on the size of a sample and multicollinearity also has it is the effect on the missingness, this is also because missingness has a direct effect on sample sizes of each observation.

3.4Monotone type of missing values
This type of missing pattern can be generated due to a specification of a sequencing method that is unilabiate and occurs normally in a Colum wise making a section of some Colum to be incomplete especially toward the end of the columns. This method has a series of synthetic observations in which the missing information is happening always (Ruben 1987b) and this type of missing data occurs in a longitudinal study with drop-out of a section of information (SAS Institute, 2005). X1  X2  X3  X4  1  85  79  140  140  2  45  62  120  103  3  60  54  190  94  4  92  88  104 Research Article   5  44  76  235  125   6  85  79  140  -7  45  62  --8  60  54  --9  92  88  --10  44  ---11  45  ---12 ----Sample of the Monotone messiness of data

4.0Remedies for Missing Values
There are many procedures use to handle the issue of Missingness such as deletion, ignoring, imputation and Model-Based Methods (regression, multiple imputation, k-nearest neighbors), (Catia M. Salgado, et al, .2016) in May cases if the sample of the Missingness is small from the data to be analyzed very roughly, less than 5% of the total number of from either respond or explanatory variables and all the missing values occurs at random that is, whether the missing information is not depends up another values then the typical method of leastwise deletion is relatively "safe" than directly use delete and ignore procedures to get rid of missing value involved in the sample, delete and ignore process involve deletion or ignoring of the whole raw or column with defect of missing information but if the number of missing value is large from a big data sample the best way to resolve is that of imputation principle which is based on prediction and estimation from the existing information available from the original data (Steffi Pauli Susanti and Fazat Nur Aziza 2017). Single Imputation Methods involves the technique and of using mean/mode substitution, linear interpolation, hot deck, and cold deck (Marina, Soley-Bori 2013). And. This process will enhance the data and will lead to having complete information on each of the variables involves and because the sample size is improved therefore it will automatically reduce the effect of multicollinearity on the data (Pourahmadi, M. 1989).
A unit of cell or cells that happened to be missing data, where particular information of a variable is not available at all, then excluding or removal of such unit from the entire analysis is the most paramount to avoid the negative consequences of the missing variable in the analysis. This is usually considered as the default of the statistical packages and procedures (Briggs et al., 2003).

4.1Two Main Imputation Techniques
Imputation criteria include the use of neural network methods, Bayesian network, regression process it always depends on the nature of the missing value and sample data that is available (Nakai and Weiming, 2011) and Marina, Soley-Bori (2013).

4.2Marginal mean imputation criteria
This always involves the use of marginal variables in between the missing value and calculate the arithmetic mean that can be used to cover up the missing information . It normally led to the eventual biased estimate of the variances and covariance which is generally is not needed because it affects the inference during analysis ).

4.3Conditional mean imputation;-
This procedure is used if explanatory variables have missing of particular information from among the variables to be analyzed, then the available variables can be used to estimate the missing value using multiple regression model to obtain the missing value with higher accuracy and precision (Briggs et al., 2003). This procedure will maintain higher reliability from the source of the data to minimize the chance of type ii error in the system (Allison, 2001).

5.0Detecting multicollinearity
Among the measures use for detecting the existence of multicollinearity from the model as a result of the missing values here we consider tolerance and variance inflation both indicators used here were computed for the regression parameters of all response and predictor variables present in the system of the regression model. Collinear relationships were revealed by different level of the missing values from both monotone and arbitrary types in which we closely monitor the level of deteriorations as the percentage of the missing values keep on increasing.
It was observed from different tables and percentages of the missing values variance inflation factors suffer from constant inflation from the variance of their parameters, the tolerance level also goes down as the percentages of the missing values keep on accelerating, and as such the standard error of all estimate relatively increased. It was discussed earlier monotone type of missing information it occurs normally at the end of the table and as a particular pattern unlike the arbitrary type of the missing information which occurs at random and it has no specific pattern at all From the above table 01, it indicates where the redundancy of an explanatory variable is relied more upon than any other among the explanatory variables more precisely X1 because of lower value of tolerance which is the measure or account of variability in the independent variable x1 which is never accounted for by other predictor variables present in the system and henceforth it is affected by multicollinearity more than all other explanatory variables in the model due to the low tolerance of 0.805 and equally having higher value of 1.243 as variance inflation factor which indicated how much the variance was inflated.

Research Article
Multicollinearity as a measure of linear relationships among response and explanatory variables that are moderately or highly correlated either from a database or structural sources and from table 01 base on t-statistics and history of the correlation then we shall eliminate x1 to get rid of the existence multicollinearity in the system. In table 02 above we have introduced 5% monotone type missing of Missingness and it shows that due to the sudden loss of about 5% of the data the level of tolerance practically fluctuate and deteriorate among all of the predictor variables where the tolerance of X1 changer from 0.805 to 0.793 which means about 1.5% of the tolerance was deteriorated due to 5% monotone Missingness from the data. While VIF which changes from 1.243 to 1.261 shows that there is 1.4% of the increase in the variability or the variance of an estimated regression coefficient is increased by 1.4% to raise the moderately discover collinear effect among the predictor variable.  Equally the result continuer to change from one step to another in which and from table 04 there is at least 1% change as a result of the difference of the missing values of 10% to 15%. From the values of variance inflation, it was nearly less than a 1% increase in the variability. In the above table 05 from the variables x6 at 0% missing value, there is a change of tolerance from 0.855 which falls under low multicollinearity against 0.806 tolerance which indicates lower tolerance value that gives way to higher multicollinearity than before due to up to 20% missing values. And for the variance inflation factor also follow the same trend.      From table 07 it shows that multicollinearity is relatively more contributes by the explanatory variable X1 than other predictor variables in the system, this is due to the small value of tolerance which is found with the explanatory variable X1 of about 0.805 and in the same vein the variance inflation factors of X1 was higher than predictors from the model which is up to 1.243 indicating more variability than others. Generally, table 07 shows more promising values of high tolerance from 0.805 to 0.979 which felled under a higher level of tolerance and it allows only the chances of a small or slight presence of collinear issue because of 0% missing values in the system.     In the above table 11 from the variables x6 at 0% missing value, there is a change of tolerance from 0.855 which falls under low multicollinearity against 0.677 tolerance which indicates lower tolerance value that gives way to higher multicollinearity than before due to up to 20% missing values arbitrarily. And for the variance inflation factor also follow the same trend.

Fig. 03Table 14 Variance Inflation Factor on Different Percentages by Arbitrary Type of Missing Values
As it is showing vividly on the above figure 03. The higher the missing information the more deteriorating result on the graph.  It was observed that missing information at random has a higher effect on variance inflation factors then missing of the same information not at random even though both of them are now easy to handle due to the advancement in computation and roughens use of much statistical application. There are little difficulties in try to define the nature and potentiality of the missed information and so it is not possible to rule them out in totality, generally, there has to be an assumption by checking off proper references on other studies that were done practically. For instance an extensive follow-up in a particular survey done to investigate and ascertain the real earning of a respondent who was absent on the previous visit, this will cover up the shortcoming of the missing of information on that respondent. It is well to know that in such a survey nonresponse to the question of earning somehow depends on the characteristics such as education, race, religion, and gender and all this will not depend on the assumption that nonresponse probability is constant.

Conclusion and Discussion
Multicollinearity and missing values both have a great influence on the linear relationships always that exist inbetween response and explanatory variables in the well-balanced system of the linear regression model, it is observed from the finding in this study that missing value affect the correlation and higher correlation indicate good presence of multicollinearity directly.
Both multicollinearity and missing values are always affected by the mode of the recording of the data, human error and the heterogeneity of the sample taking during a survey, therefore while dealing with such variables in many domains of the data mining and cleaning to effectively handle such a scenario a data scientist is always advised to use rebuts evaluation techniques while selecting an imputation method to take care of the missing information. In this paper, it brings out categorically that no missing values are small no matter how it is, can change the nature of correlation, Tolerance, and variance inflation factors which finally in return will affect the linear relationships among the response and predictor variables and end up producing a severe multicollinearity. It was also established that both multicollinearity and missing values of what so ever types have a direct effect on the linear relationships between variables as such if they increase always add more values to the error of estimates statistics.
To take proper care of such short comes to bring about due to multicollinearity and missing values always imputation of the missing values is very essential by a linear combination of the existing values to predicts the Missingness rightfully. When analyzing data with missing values which is expected to have multicollinearity be it small, moderate or severe one has to study the pattern, Nature, and causes of the missing values to handle it effectively not have the double tragedy of both Collinearity and Missingness involves together it will affects the reliability of all statistics involved. Other obstacles including over and underestimation which bring about biased results shall always be taking proper care to ensure maximum accuracy and higher reliability of the estimates statistics.
It has been proved from above tables, that missing values lower the tolerance level and increase the level of multicollinearity in the system, the more data missed and more chance of having the increase in the level of the collinear relationships. Therefore it is now established that missing values courses multicollinearity and the larger the missing values and the higher is the presence of multicollinearity in the system, this is because from monotone missing value average tolerance level is 0.809 at 0% level of missingness while at 25% level of missing value the average tolerance level change to 0.84 and in the same vein from the arbitrary missing value at 0% it shows the average tolerance of 0.89 while at 25% level of missingness records the average tolerance level of 0.108. The change recoded from the average tolerances in both cases indicates how multicollinearity increases with increasing level of missing values at different percentages.