Statistical Evaluation of Item Nonresponse Methods Using the World Bank’s 2015 Philippines Enterprise Survey

The main objective of the study was to evaluate item nonresponse procedures through a simulation study of different nonresponse levels or missing rates. A simulation study was used to explore how each of the response rates performs under a variety of circumstances. It also investigated the performance of procedures suggested for item nonresponse under various conditions and variable trends. The imputation methods considered were the cell mean imputation, random hotdeck, nearest neighbor, and simple regression. These variables are some of the major indicators for measuring productive labor and decent work in the country. For the purpose of this study, the researcher is interested in evaluating methods for imputing missing data for the number of workers and total cost of labor per establishment from the World Bank’s 2015 Enterprise Survey for the Philippines. 
The performances of the imputation techniques for item nonresponse were evaluated in terms of bias and coefficient of variation for accuracy and precision. Based on the results, the cell-mean imputation was seen to be most appropriate for imputing missing values for the total number of workers and total cost of labor per establishment. Since the study was limited to the variables cited, it is recommended to explore other labor indicators. Moreover, exploring choice of other clustering groups is highly recommended as clustering groups have great effect in the resulting estimates of imputation estimation. It is also recommended to explore other imputation techniques like multiple regression and other parametric models for nonresponse such as the Bayes estimation method. For regression based imputation, since the study is limited only in using the cluster groupings estimation, it is highly recommended to use other possible variables that might be related to the variable of interest to verify the results of this study.


Introduction
One major challenge of conducting surveys is that of having nonresponse. It has been proven repeatedly that nonresponse can have large effects on the results of survey. Nonresponse, interchangeably termed as missing or incomplete data, is a common occurrence in surveys, even if great care is taken before and during the data collection. Missing data, either unit or item, creates potential for bias in estimates derived from survey data (Lohr, 2010).
This study aimed at evaluating item nonresponse procedures through a simulation study of different nonresponse levels or missing rates using the World Bank"s 2015 Philippines Enterprise Survey. A simulation study was conducted to explore how each of the response rates perform under a variety of circumstances. Also, the performance of procedures suggested for item nonresponse has been investigated under various conditions and variable trends from the survey.

2.1.Sources of Data
The study was conducted to compare imputation methods that would best conform for both discrete and continuous type of variables. For the purpose of this study, the survey data of the World Bank"s 2015 Philippines Enterprise Survey was used. The data are not publicly available, therefore, one has to apply for access to the World Bank.

2.2.Statistical Treatment of Data
The data were examined and analyzed using the statistical software R. The researcher employed the following statistical processes and procedures to attain the objectives of the study:

Research ArticleResearch Article
Research Article Research Article 1. Created a database file using R and MS Excel; 2. Computed for the characteristics of the selected discrete and continuous variables such as means and variances; 3. Performed simulation under different levels of nonresponse using the Bootsrap resampling method; 4. Evaluated and compared the characteristics of estimates for the different nonresponse rates from the pseudo-population estimates; and 5. Imputed missing values using selected procedures for item nonresponse; 6. Evaluated the procedures by comparing estimates using Bias and Variances.

2.2.1.Selection of Variables
For trend of variables, the following important indicators from the Enterprise Survey both for discrete and continuous type were used in the study: Discrete: total number of workers per establishment; and Continuous: average cost of labor per establishment.

2.2.2.Characteristics of the Pseudo-Population
The full sample data for the variables on total number of workers and average cost of labor per establishment was treated as the pseudo-population. Hence, evaluation and description of the characteristics of the population were done in terms of means and variances.

2.2.3.Simulation Using Different Levels of Nonresponse
The simulation experiments were done to evaluate the procedures across the different percentage of nonresponses: 5%, 10%, and 20%: 1. Given the database of all responding sampling units of the Enterprise Survey for selected variables, a sample without replacement 1,000 times was drawn. Bootstrap, one of the popular resampling methods discussed in the book of Lohr (2010), was used by simply drawing the sample using a simple random sampling without replacement of size n, which will reproduce properties of the whole population.
2. To simulate nonresponses, the values of the selected variables from the database equal to the level of nonresponses: 5%, 10%, and 20% were dropped at random.
3. Using the database with values of the variables dropped in some portions of the database in step 2, the statistics of interest of the variables for each of the samples using the different methods for item nonresponse were calculated.
4. The characteristics of the estimates for the different nonresponse rates with that of the pseudo-population estimates were finally compared.
The following options were explored for both labor cost and number of workers during simulation:

3.Evaluation of Methods
The imputation methods used are the cell mean imputation, random hotdeck imputation, nearest neighbor imputation, and regression imputation. For the purpose of this study, the following methods for imputing missing data for the total number of workers and average cost of labor per establishment were evaluated.

3.1Cell Mean Imputation
This method assumes that missing values within the cells are missing completely at random. First, classes or cells are used to group the respondents according to known variables. The average of all responding establishments in a class or a cell is used to replace for each missing data.

3.2.Random HotDeck Imputation
A donor is randomly chosen from the establishment in the cell with information on all missing items. To preserve multivariate relationships, usually values from the same donor are used for all missing items of an establishment.

3.3.Nearest-Neighbor HotDeck Imputation
This method works by defining a distance measure based on one or more clustering variables then by imputing the missing value of a unit using the non-missing value of a unit nearest to it (the nearest neighbor) based on the distance measure.

3.4.Regression Imputation
Regression imputation predicts the missing value by using a regression of the item of interest on variables observed for all cases. A variation is stochastic regression imputation, in which the missing value is replaced by the predicted value from the regression model, plus a randomly generated error term.
The following are regression models used for regression imputation: Where: = vector of 1"s and 0"s as indicator of the clustering variables = vector of corresponding coefficients 0 = intercept term ~ (0, 2 ) = normal error term for OLS model The natural logarithm of labor cost was used since it is highly skewed and do not scale linearly. The Negative Binomial generalized linear model was used for number of workers since the variable is of discrete type and exhibits overdispersion (variance = 74483.18 is so much larger than the mean = 111.2349) which violates the characteristic of the Poisson distribution where the mean and variance are equal.

3.5.Comparison of the Estimates/Assessment of the Performance of the Techniques
The estimates to be obtained from the methods will be compared using a set of criteria for selecting a better procedure to compensate missing data for the variables on total number of workers and average cost of labor per establishment. The criteria to be used in assessing the estimates include measures of accuracy and precision.
To mitigate the effects of sampling error, 1000 simulated simple random sampling without replacement of size samples were used to obtain the expected accuracy (average percent bias and average absolute percent bias) and precision (average CV of sample mean) of the sample mean per scenario. Furthermore, comparability was ensured by using the same set of 1000 simulated samples per scenario. A value for the bias that is near zero indicates better estimator. Estimates are said to be precise if it has a coefficient of variation below 10%.

4.1.Simulation of Samples and Clusters
For both number of workers and labor cost, sample sizes of n=100 and n=200 were generated during simulation using the Bootstrap method of resampling. Expected accuracy (average percent bias and average absolute percent bias) and precision (average Coefficient of Variation or CV of sample mean) of the sample mean per scenario were obtained. For the number of workers and labor cost, sample size at n=200 shows more accurate and precise estimates than at sample size of n=100 (Table 1). Accuracy and precision of the simulated samples can also be visualize in the boxplot presented in Figure 1. To decide which clustering group to use in the analysis, a comparison of the different combinations of clusters was performed in terms of accuracy and precision. There are 7 possible clusters: (i) size, (ii) region, (iii) sector, (iv) size-region, (v) size-sector, (vi) region sector, and (vii) size-region-sector.  show that single clusters such as size, region, and sector have performed better in terms of accuracy and precision exhibiting lower CVs and Bias closer to zero for the number of workers than the pairwise combination of the stratification variables. In terms of precision, the combination of three cluster group sizeregion-sector also showed a value of bias that is near zero. However, having a narrowed down groupings can affect the source of donor values from clusters during imputation. Missing values may not be filled if its cluster did not match any donor values of non-missing values in the sample. Hence, for a given sample, clustering based on all grouping variables will least likely impute all missing values. The same results were generated for labor cost (Figure 4 and Figure 5).

4.2.Levels of Nonresponses
The following options for nonresponse rates were explored for both number of workers and labor cost during simulation: 5%, 10%, and 20%. To artificially create nonresponses from the sample of size n=200, dropped at random the values of the selected variables equal to the nonresponse levels. A simple analysis of the newly created databases for the number of workers and labor cost with missing values at 5%, 10%, and 20% reveals that when the level of missing items increases, the estimates become less accurate and less precise (Tables 2 and 3).

4.3.Evaluation of Imputation Methods
Results on the evaluation of imputation methods for the "number of workers" using the simulated samples of "n=200" and cluster group "size" showed that cell mean and regression imputation methods have the same performance in terms of precision at 5%, 10%, and 20% nonresponse. The three methods of imputation -random hotdeck, cell mean, and regression techniques have the same performances in terms of the accuracy of estimates at 5% level of nonresponses. While cell mean imputation outperformed the other methods in terms of accuracy at 10% nonreponse rate, the cell mean and regression gave accurate estimates at 20% missingness (Table 4).  Table 5 shows an evaluation of the imputation methods for the "number of workers" using the simulated samples of "n=200" and cluster group "region". As displayed in the table above, the cell mean and regression imputation methods have the same performance in terms of accuracy and precision of estimates at 5%, 10%, and 20% nonresponse. Table 6 shows an evaluation of the imputation methods for the "number of workers" using the simulated samples of "n=200" and cluster group "sector". As displayed in Table 6, the cell mean and regression imputation methods have the same performance in terms of precision at 5%, 10%, and 20% nonresponses. In terms of the accuracy of the estimates, cell mean and regression performed best among all methods at 5 and 10% nonresponses while random hotdeck performed best at 20% missing rate. On the other hand, it is shown that random hotdeck imputation method is the best performer in terms of accuracy of estimates only at 10% and 20% nonresponses, while regression perfomed best at 5% missing rate for "labor cost (Php "000)" using the simulated samples of "n=200" and cluster group "size". In terms of the precision of estimates at 5% level of nonresponses, the cell mean imputation outperformed the other methods all levels of nonreponse rates (Table 7).  Table 8 shows that cell mean imputation method outperformed all other techniques in terms of precision at all levels of nonresponses for the "labor cost (Php "000)" using the simulated samples of "n=200" and cluster group "region". In terms of the accuracy of the estimates, nearest neighbor imputation technique performed best among all methods at 5%, 10%, 20% levels of nonresponses.  Table 9 shows that the cell mean imputation method outperformed all other techniques in terms of precision at 5%, 10%, 20% levels of nonresponses for the "labor cost (Php "000)" using the simulated samples of "n=200" and cluster group "region". In terms of the accuracy of the estimates, nearest neighbor imputation technique performed best among all methods at all levels of nonresponses. Boxplots of the performance of the four evaluation techniques were compared at sample size of "n=200" and cluster group "size" regardless of the level of nonresponses for the number of workers and labor cost.
For the number of workers, cell mean imputation outperformed all other techniques in terms of producing precise estimates, exhibiting the lowest CV among all methods. This was followed by random hotdeck and nearest neighbor imputation methods. The regression imputation have yielded the highest CV ( Figure 6).
Moreover, cell mean and random hotdeck imputation methods have almost the same performance in terms of producing accurate estimates with biases almost close to zero for the number of workers. This was followed by nearest neighbor imputation imputation method. The regression imputation have yielded a more bias estimates (Figure 7). For labor cost, cell mean imputation outperformed all other techniques in terms of producing precise estimates, exhibiting the lowest CV among all methods. This was followed by random hotdeck and nearest neighbor imputation methods. The regression imputation have yielded the highest CV (Figure 8). Moreover, cell mean and random hotdeck imputation methods have almost the same performance in terms of producing accurate estimates with biases almost close to zero for labor cost. This was followed by nearest neighbor imputation imputation method. The regression imputation have yielded a more bias estimates (Figure 9).

5.1.Summary of Findings
The findings of the study are summarized as follows: 1. Newly created databases for the number of workers and labor cost across different levels of nonresponses or missingness of 5%, 10%, and 20% reveal that means of the database with 5% missingness are closest to the true population mean (number of workers -111 and labor cost -Php 71.615 million). The lowest percentage bias can be found at 5% missingness and the highest bias is seen at 20% missingness both for the number of workers and labor cost. The same result can be generated in terms of variances. Moreover, for the number of workers and labor cost, sample size at n=200 shows more accurate and precise estimates than at sample size of n=100. Single clusters or classes such as size, region, and sector have performed better in terms of accuracy and precision exhibiting lower CVs and Bias closer to zero for the number of workers and labor cost as compared to the pairwise combinations of the stratification variables.
2. Results on the evaluation of imputation methods for the "number of workers" using the simulated samples of "n=200" and cluster group "size" showed that cell mean and regression imputation methods have the same performance in terms of precision at 5%, 10%, and 20% nonresponse. The three methods of imputationrandom hotdeck, cell mean, and regression techniques have the same performances in terms of the accuracy of estimates at 5% level of nonresponses. While cell mean imputation outperformed the other methods in terms of accuracy at 10% nonreponse rate, the cell mean and regression gave accurate estimates at 20% missingness. Evaluation of the imputation methods for the "number of workers" using the simulated samples of "n=200" and cluster group "region" showed that cell mean and regression imputation methods have the same performance in terms of accuracy and precision of estimates at 5%, 10%, and 20% nonresponse. Evaluation of the imputation methods for the "number of workers" using the simulated samples of "n=200" and cluster group "sector" showed that cell mean and regression imputation methods have the same performance in terms of precision at 5%, 10%, and 20% nonresponses. In terms of the accuracy of the estimates, cell mean and regression performed best among all methods at 5 and 10% nonresponses while random hotdeck performed best at 20% missing rate.
3. On the other hand, evaluation of imputation methods for "labor cost (Php "000)" using the simulated samples of "n=200" and cluster group "size" showed that random hotdeck imputation method is the best performer in terms of accuracy of estimates only at 10% and 20% nonresponses, while regression perfomed best at 5% missing rate. In terms of the precision of estimates at 5% level of nonresponses, the cell mean imputation outperformed the other methods all levels of nonreponse rates. Evaluation of the imputation methods for the "labor cost (Php "000)" using the simulated samples of "n=200" and cluster group "region" showed that the cell mean imputation method outperformed all other techniques in terms of precision at all levels of nonresponses. In terms of the accuracy of the estimates, nearest neighbor imputation technique performed best among all methods at 5%, 10%, 20% levels of nonresponses. Evaluation of the imputation methods for the "labor cost (Php "000)" using the simulated samples of "n=200" and cluster group "region" showed that the cell mean imputation method outperformed all other techniques in terms of precision at 5%, 10%, 20% levels of nonresponses. In terms of the accuracy of the estimates, nearest neighbor imputation technique performed best among all methods at all levels of nonresponses.
4. The most appropriate imputation technique for estimating the item nonreponse for the number of workers are cell mean imputation and regression imputation at all levels of missingness (5%, 10%, 20%) and for all cluster groups (size, region, sector). For the labor cost using the clustering group size, cell mean imputation and regression imputation showed as the superior techniques for estimating item nonresponse at 5% missingness while cell mean imputation and random hotdeck imputation showed superiority at 10% and 20% missingness. For the clustering group region, the best method for estimating item nonresponse for labor cost is the nearest neighbor imputation for all levels of nonresponses in terms of the accuracy of estimates (bias). In terms of the precision of estimates (CVs), cell mean imputation is the most appropriate technique to impute for missing items at all levels of missingness. On the other hand, for the clustering group sector, the best method for estimating item nonresponse for labor cost is the nearest neighbor imputation for all levels of nonresponses in terms of the accuracy of estimates (bias). In terms of the precision of estimates (CVs), cell mean imputation is the most appropriate technique to impute for missing items at all levels of missingness.

6.Conclusions
The following conclusions were drawn based on the findings of the study: 1. The newly created databases for the number of workers and labor cost with missing values at 5%, 10%, and 20% reveals that when the level of missing items increases, the estimates become less accurate and less precise. Therefore, it would be best to treat our data using appropriate techniques when missingness occurs. Moreover, higher sample sizes provide better estimates in terms of accuracy and precision. Simple clusters or classes can be used to select the donor value for a missing item. Clustering based on all grouping variables will least likely impute all missing values.
2. Imputed estimates for the number of workers using the clustering groups of size, region, sector showed accurate and precise estimates at all levels of missingness for cell mean and regression imputation techniques.
3. For labor cost, imputed estimates using the clustering groups of region and sector showed that the cell mean imputation and nearest neighbor provided more accurate and precise estimates at all levels of nonresponses. For cluster group size, cell mean imputation and random hotdeck provided better estimates at 10% and 20% missingness while cell mean imputation and regression imputation gave more accurate and precise estimates at 5% missingness.
4. Overall, cell mean imputation method has provided the best estimates for both discrete and continuous variables (number of workers and labor cost) at different levels of nonresponses (5%, 10%, 20%) in terms of providing accurate and precise estimates for item nonresponses.

7.Recommendations
The following recommendations are offered based on the derived conclusions: 1. For regression based imputation, since study is limited only in using the cluster groupings estimation, it is highly recommended to use other possible variables that might be related to the variable of interest to verify the results of this study.
2. Explore choice of other clustering groups. Clustering groups greatly affects the resulting estimates of imputation estimation.
3. Explore multiple imputation method with different models for nonresponse, where each missing value is imputed m (>=2) different times.
Also, explore the use of other parametric models for nonresponse by fitting a superpopulation model such as the Bayes estimation method..