A Novel Approach of Ensemble Learning with Feature Reduction for Classification of Binary and Multiclass IoT Data

Mr. Vijay M. Khadse, Dr. Parikshit N. Mahalle, Dr. Gitanjali R. Shinde Assistant Professor, College Of Engineering Pune (COEP), Pune, India, Email: vmk.comp@coep.ac.in Senior Member IEEE, Professor & Head, Department of Computer Engineering, Smt. Kashibai Navale College of Engineering, Pune India. PostDoc Researcher, Center for Communication, Media and Information Technologies (CMI), Aalborg University, Copenhagen, Denmark, Email: aalborg.pnm@gmail.com Assistant Professor, Smt. Kashibai Navale College of Engineering, Pune India, Email : gr83gita@gmail.com


Introduction
The Internet of Things (IoT) is the most widely spreading fields in every aspect of human life (Singh and singh, 2015). IoT systems are integrated into many applications such as Home automation, Smart cities, Manufacturing, Aviation, Health care, Transports, Network security, Self-driven automobiles are the few to be mentioned (Atzori et al., 2010). IoT devices supports number of applications such as smart cameras and smoke detectors for security; smart light bulbs for home and indusry, and sockets facilitate power savings; and so forth ( Meidan et al., 2017).
Application of machine learning (ML) is expanding rapidly in IoT systems especially with the emergence of fast mobile devices that also have access to cloud computing (Ularu et al. 2013). IoT devices generate huge amounts of data in every field of their application. Data generated for IoT systems is mostly continuous values. It has an advantage over categorical data, as it can be naturally ordered and similarity and distance functions can be defined on them (Boriah et al., 2008;Wilson and Martinez, 1997). Raw data generated by IoT devices need to be abstracted. Analytics should performed for patterns and useful inferences. ML is widely applied in IoT for knowledge extraction. There is a widespread use of ensemble models of ML and pattern recognition application due to their ability to significantly improve accuracy as compared to base algorithms.
Ensemble Learning (EL) is the state-of-the-art for different ML problems. In EL there are a group of base learners (on average 5 to 6) which means a group of models for processing. The main aim of EL is to combine these models, make the one strong learner. Therefore, the obtained result will be much better than the single base learner (Atzori et al., 2010).
There is common situation where there is no availability of sufficient historical data from an IoT application for learning. One of the major challenges of ML to IoT systems is to identify an optimal learning algorithm for classification that could be applied over diverse IoT domain.
The objective behind this research is to: (1) To identify an optimal ensemble learning technique suitable for diverse IoT domain. (2) To identify suitable feature reduction technique to be applied for effective performance.
(3) To identify suitability of learning algorithm based of number of class labels i.e binary or Multiclass data separately. (4) To study and compare proposed hybrid ensemble model of learning with bagging, boosting and stacking ensemble model over diverse IoT domain data. This research tries to achieve above objectives by taking datasets of varying size, varying no of features from different IoT application domains and tries to address the problem by comparing ML ensemble techniques for classification based on their performance.
The paper is organized as: section 2 contains a literature survey about the previous and related work done by others authors. Section 3 focuses on the analysis of gaps in the previous work done. Section 4 describes our proposed work and methodology. Section 5 contains our experimentation and results. Finally, the study ends with section 6 containing observations and conclusions

Literature Survey
In learning techniques, the number of component classifiers in an ensemble model and extracted components from original features using the feature reduction technique has a great impact on its performance. Junior et al. (2020) have compared feature selection and dimensionality reduction techniques on gesture recognition sensor data to increase the performance. They have used eight dimensionality reduction techniques namely Linear Discriminant Analysis, Manifold Charting, Autoencoder, t-distributed Stochastic Neighbor Embedding, Principal Component Analysis, Large Margin Nearest Neighbor (LMNN) and Isomap. They have also used seven different classifiers. They observed that 87% to 90% accuracy achieved for ELM, SVM and RBF classifiers in feature selection and 95% accuracy achieved for a combination of LMNN and SVM in the dimensionality reduction process. This study showed dimensionality reduction improves the performance of the hand gesture dataset. Ribeiro ) proposed a stacking ensemble methodology using Logistic Model tree and three well-known ensembles namely extra tree, random forest and gradient boosting. They concluded that stacking methodology gave remarkable performance than individual classifiers performance.
Suganthi and Karunakaran (2019) used the Cuttlefish optimization algorithm for data point reduction. This optimally extracted subset of data points and a reduced set of features provided by PCA, providing almost the same accuracy, a false positive rate that they obtained from the original dataset.  Wan and Yang (2013) compared four popular ensemble methods namely Bagging, Boosting, Stacking and Random forest on 31 UCI datasets and justified that, depending upon the dataset domain the result varies means accuracy varies. Therefore no one was the winner from EL methods.
Ye and Suganthan (2012) discussed four bagging-based ensemble classifiers, namely, the ensemble ANFSI, the ensemble SVM, the ensemble ELM and random forest. Ensemble classifiers evaluated on thirteen UCI binary datasets with different bagging numbers (20, 50 and 80). Out of four ensemble classifiers, the ensemble SVM has been identified to be the most favourable ensemble classifier and random forest tree identified second most favorable ensemble classifier.
Syarif et al. 2012 investigated network intrusion detection systems by applying three ensemble methods (bagging, boosting and stacking). Results show that only the stacking method was able to reduce false positive rate as compared to other ensemble methods. Among the four classifiers (J48, naive Bayes, JRip and iBKnearest neighbour), J48 performed better than the three other methods by achieving the highest accuracy rates and lowest false positive rate.
Wang (2012) compared Bagging, Boosting and Stacking ensemble techniques with Decision tree, Artificial neural network, Support vector machine and logistic regression as a base learner on credit scoring datasets. In this study, Accuracy, type I and type II error were considered for performance measurement of models. They concluded that Stacking and Bagging with decision trees performed better than all ensemble models in terms of accuracy, type I and type II error. Graczyk et al. (2010) examined six distinct ML classifiers to three ensemble techniques i.e. Bagging, Stacking and Additive Regression. Accordingly, models produced by stacking have lowest prediction error. Bagging approach found to be more stable but gave poor performance than stacking and additive regression.

Gap analysis
Based on the literature survey, it is observed that researchers while using ensemble methods for comparison on multi-domain datasets do not consider feature reduction techniques. Similarly, while comparing feature reduction techniques on multiple domain datasets they do not use ensemble methods.
Unlike existing studies, this study not only compares the performance of Bagging, Boosting and Stacking models but also the proposed hybrid ensemble model. This study considers PCA, LDA and Isomap as feature reduction techniques to improve the performance of the model on diverse multi-domain binary and multiclass multi-domain IoT datasets. Figure 1 is the proposed methodology for this study. It involves the EL of individual classifiers and feature reduction techniques which are mentioned above. This methodology is divided into five stages. Following are the stages included for performing the comparative study of feature reduction techniques and EL methods

Dataset
This study has collected Ten (10) binary and Ten (10) multiclass IoT sensor datasets of different domains from UCI ML storehouse (Asuncion et al., 2007) and kaggle. Some of the data sets are of high dimension and some with low dimension to reduce any favourable and unfavourable impact on the performance of algorithms. Table 1 contains the information about features, classes, class types and instances of datasets.

Data pre-processing
Out of range values, missing values, impossible data combination etc lead to undesirable effect on ML prediction model. Gathered data from the resources is not in a standardized form. This data contains a lot of null and missing values. This study removes the null or missing values in data. Large numerical values require normalizing data for feature reduction. Column normalization technique is applied datasets. Data values are rescaled between ranges of 0 to 1. Also checked the data for positive or negative infinity.

Feature reduction
Due to a large number of features in the dataset, it becomes complex to visualize the data. Many of the features are correlated, so it becomes redundant. Using feature reduction techniques, higher dimensions of dataset get converted into a new set of synthetic dimensions and extracted lower dimensions to avoid overfitting problem and improve the performance of the model.
This study used two linear and two nonlinear feature reduction techniques. Principal component analysis (PCA) is a linear technique for reducing dimensionality as well as minimizing information loss. Using these methods, new uncorrelated data is created by minimizing variance. Linear discriminant analysis (LDA) is also a linear technique for reducing the dimensionality based on the classes of the dataset. Using this technique, this study finds the dimensions which can maximize the separability between the classes to make a good decision to classify data. Isomap is a nonlinear technique for visualization of data and computes low dimensional embedding of high dimensional data. The number of neighbours depends on the number of instances in datasets. Isomap is very efficient and used for a high number of dimensionalities. For the visualization of the data, the study applied T -distributed stochastic neighbour embedding (T-SNE) technique. To observe how the data is separated, the study visualizes the data in 2 dimensions.

Ensembel method
This work considers Bagging (Breiman, 1996), Boosting (Freund and Schapire, 1996), Stacking (Wolpert, 1992) ensemble techniques and proposes a new "hybrid" model for classification. The proposed methodology used a decision tree classifier as a base learner for bagging and boosting. Also, Adaboost is used as a meta classifier for boosting.
Stacking combines multiple models, makes training data from their predictions and applies a meta classifier on the training data (Wolpert, 1992). Because of this approach, the performance of the model gets improved. For the first layer in the stacking model, eight classifiers were selected from several families of algorithms. Random forest classifier from tree family (Amasyali and Ersoy 2008), Multilayer layer perceptron classifier (MLP) from neural network family (Wilamowski, 2009), Gradient boosting classifier from ensemble family (Mason et al. 2000), Bernoulli and Gaussian classifiers from Naïve Bayes family, K-nearest neighbour classifier (KNN) from instance-based family (Aha et al. 1991), Logistic regression classifier from regression family (Gay and Welsch, 1988) and Support vector machine (SVM) classifier belong to the generalized linear classifier (Tyagi and Manry, 2019). Out of eight classifiers, three high-performance classifiers were selected for binary as well as multiclass datasets for making a stack. There are two methods to build a stack of classifiers in stacking ensemble models i.e. using prediction values of the classifier and using probabilities of prediction values of the classifiers. This research work selects probabilities of prediction values of the classifiers method for both binary and multiclass datasets to build the stacking model. This technique helps to boost the performance of the stacking model. For the second layer in the stacking model, since the data in this study is derived from the original dataset with complex transformation, it is not necessary to select complex classifiers in the output layer. Logistic regression is a good choice and it also prevents over-fitting. That is why Logistic Regression is selected as a meta classifier.
This study merged predictions of bagging, boosting, stacking ensemble models and proposed a new "Hybrid Ensemble Model" (HEM). For binary datasets, it combined the prediction values (values which are predicted on the training dataset) of bagging, boosting, stacking ensemble models and made new predicted values. For Multiclass datasets, it combined the prediction values as well as the probability of prediction values (probability of predicted values made on the training dataset) of bagging, boosting, stacking ensemble models and made our new predicted values. Finally, the study compares predicted value to test dataset (contained actual value of the inputs) to measure the performance. The "HEM" aims to improve the stability of the EL model. Even if training data is slightly modified, the prediction will not change.

Performance matrices
For this comparative study, the performance of models is compared using three techniques. Accuracy is the most common and essential technique to measure the prediction rate of the model. For multiclass classification problems that come into the picture, Area under the ROC Curve (AUC) is a must for use. This technique shows how much a model can classify between labels. Higher the score of AUC well predicted the classes. F1 score shows the balance between precision and recall. However, it does not consider true negative in measurement.

Experimentation and results
Scikit learn is a useful library for ML (Pedregosa, 2011). It offers a number of supervised and unsupervised learning algorithms via a simple python framework. In the scikit learn, PCA, LDA and Isomap include common one parameter i.e. "n-components".
This parameter indicates the number of features to be returned for further processing. To figure out the value of a parameter, a corresponding function of its feature reduction technique is used. Before applying the feature reduction technique, data is converted into the standardized form using the "standard scalar" function.

Selection of "n" Features
In PCA, using "Explained variance ratio" new "n" features are created from features. A cumulative sum of variance ratio of each feature is returned in ascending order and the cumulative variance is plotted. This tells how many "n-components" are required to cover the whole variance. For the study, threshold variance is set at 95%.
LDA creates its own new components based on the labels of the dataset. Suppose a dataset has 'x' features then LDA creates its own new 'x-1' features. Due to this strategy, this study observes the reduced dimensions of LDA are far lesser than the PCA and Isomap. To extract new feature set, "Explained variance ratio" is applied similar to PCA. A function is built which consecutively adds the explained variance of features until the threshold can't fit any more features. Finally, it returns the number of features added. The threshold is set to be 95%.
Isomap determines the number of neighbours. Selecting a large number of neighbours makes it computationally expensive. Due to this problem, the square root of the total instances of the dataset is taken and initializes the number of neighbours for the Isomap. To extract the features, the "Reconstruction error" function of Isomap is applied. This function signifies the distance between the original data point and its projection point onto a lower-dimensional space. This study provides the error rate of each feature. This increases the number of features, decreases the reconstruction error rate. After a certain number of features, the reconstruction error rate stabilizes. Number of features was chosen where the reconstruction error rate gets stabilized. Table 2 contains details about the total features and reduced features of each IoT binary and multiclass dataset.

Results
To evaluate the efficiency of the model and reduce the overfitting and underfitting problems, the crossvalidation technique is used. Datasets utilized have not enough instances to construct an optimal model and results fluctuate for different splits of the data. Due to these problems, K-fold cross-validation technique is applied. In K-fold cross-validation, there is a bias-variance trade-off correlated with the decision of K (James et al. 2013). Generally, despite these criteria, one performs K-fold cross-validation with K=5 or K=10, since values have been experimentally shown to provide test error rate estimates that do not suffer from extreme bias or extremely high variance. For performance evaluation, this study has used 5-fold cross-validation technique.
First, used bagging model with decision tree classifier working as a base learner and evaluated on each dataset for Accuracy, AUC and F1 score with PCA, LDA and Isomap. The results are averaged for PCA, LDA and Isomap over binary and multiclass datasets.
Secondly, the Adaboost model is used with decision tree classifier as a base learner and evaluated for each dataset for Accuracy, AUC and F1 score with PCA, LDA and Isomap. The results are averaged for PCA, LDA and Isomap over binary and multiclass datasets. Due to space constraints, individual results on each data set for bagging and boosting are not shown in the paper.
Next, for the stacking ensemble model, this study applied Random forest (RF), SVM, KNN, Bernoulli's Naïve Bayes (BNB), Gaussian Naïve Bayes (GNB), Gradient boosting (GBM), MLP and Logistic regression (LR) on each dataset. The data is visualized to see how the predictions from all eight models are different. For that, TSNE technique is used and created scatter plot which shows the predictions of different models. Then created a heat plot to compare the correlation of their prediction. Finally, frequencies of predicted classes are visualized using count plot of all classifiers. Detailed experimentation is carried out to calculate accuracy for eight classifiers of stacking model with PCA, LDA and Isomap. Due to space constraints, individual values are shown here in the paper.
To select the three best performing classifiers among the eight classifiers, accuracy is averaged across all feature reduction techniques for both binary and multiclass datasets. Table 3 shows the average accuracy value of all eight classifiers on binary and multiclass datasets. KNN, SVM and GBM are the three best performer classifiers for both binary and multiclass datasets, selected for further process. Next, the study builds one level prediction set for stacking classifier. This work created a level-1 train dataset using 5-fold cross-validation. Level-1 test dataset is created by selected models on complete original train dataset and predicted on the test dataset. Finally, LR is trained as a meta-classifier on level-1 train data and predicted on level-1test dataset. Results are obtained by averaging values on each dataset for binary and multiclass data set for Accuracy, AUC and F1 score for PCA, LDA and Isomap.
Next, for the hybrid ensemble model, predictions of bagging, boosting and stacking models were merged and created data for predicted classes. The concept of majority voting is used, that is, if 'yes' is predicted more times than 'no' then 'yes' is selected and vice versa. In binary datasets, it is certain to have a majority vote for either 0 or 1 class. But in multiclass datasets, which have 3 or more classes, 3 ensemble models can predict different classes. In this scenario, there is no clear winner. Therefore, a class that has the highest probability value is selected. Finally, it is tested data on the original test dataset. Table 4 contains Accuracy, AUC and Prediction of hybrid ensemble model with PCA, LDA and Isomap. For measuring the performance of models, the average value of Accuracy, AUC and F1 score of PCA, LDA and Isomap is calculated with all ensemble models, on all binary and multiclass datasets. Table 5 shows the average accuracy rate of PCA, LDA and Isomap with Bagging, Boosting, Stacking and hybrid ensemble model on all binary and multiclass IoT datasets.  Table 6 shows the average AUC score of PCA, LDA and Isomap with Bagging, Boosting, Stacking and hybrid ensemble model on binary and multiclass data for IoT dataset.  For better understanding, the average value of Accuracy, AUC and F1 score is visualized. The graphical representation shows a visualization of Table 5. Figure 2 describes the average accuracy rate of PCA, LDA and Isomap with Bagging, Boosting, Stacking and Hybrid ensemble models on all binary and multiclass IoT datasets. data. Below graphical representation shows a visualization of Table 6. Figure 3 describes the average AUC rate of PCA, LDA and Isomap with Bagging, Boosting, Stacking and Hybrid ensemble models on all binary and multiclass IoT datasets. Below graphical representation shows a visualization of Table 7. Figure 4 describes the average F1 score of PCA, LDA and Isomap with Bagging, Boosting, Stacking and Hybrid ensemble models on binary and multiclass IoT datasets.

Observations
It is observed from table 5 , for binary datasets, hybrid with PCA model achieved highest accuracy score with 94.468% and boosting with LDA model achieved lowest accuracy score with 77.029%. LDA average accuracy scores of all ensemble models are very less as compared to PCA and Isomap. For multiclass dataset, stacking with PCA model achieved top score with 91.712% while boosting with Isomap model get a low score with 80.269%.
From table 6, it is seen that, for binary datasets, the hybrid model with PCA obtained the best average AUC score of 0.927 and boosting with LDA earned the lowest AUC average score of 0.683. Compared to LDA and

Research Article
Vol. 12 No.6 (202112 No.6 ( ), 207212 No.6 ( -2083 Isomap average AUC scores of all ensemble models, PCA average scores of all ensemble models are relatively high. For multiclass datasets, Stacking with LDA performs the best and with 0.950 average AUC score while boosting with Isomap model received a low average AUC score of 0.855. From table 7, for binary datasets, the hybrid model with PCA performed excellently and obtained a 0.917 average F1 score while boosting with LDA model performed very poorly and obtained an average F1 score of 0.683. For multiclass datasets, stacking model with PCA received the highest mean F1 score of 0.916 and the boosting with Isomap model obtained the lowest mean F1 score of 0.802.

Conclusions
This comparative study investigated the possibility of applying bagging, boosting, stacking and hybrid ensemble algorithms with PCA, LDA and Isomap to improve the performance on IoT sensor datasets. In both binary and multiclass cases, PCA works perfectly to all ensemble models compared to LDA and Isomap. For binary datasets, Hybrid with PCA works the best against other models. Boosting with LDA performed ineffectively compared to other models. For multiclass datasets, Stacking with PCA performed better than other models in question and the close runner-up is Hybrid with PCA. Boosting with Isomap worked very poorly in case multiclass datasets. Bagging performed average in binary as well as multiclass datasets.