Comparing the Performance of Algorithm with Relevant Features for Histological Categorization of Lung Cancer

Due to increasing cancer cases around the world, Lung cancer has become the favorite topic of research for a long period of time. The actual reason is due to the increasing rate of new cases across the globe. Therefore, many researchers used prediction or classification algorithm to identify the factors that contribute to the increase of this deadly disease. Two models were built namely WRF and RF. RF model provides the result of features selected by a predominant feature selection method whereas WRF model provides result of all features without performing any selection process. A comparison is made to inform the importance of selecting the feature for classification or prediction algorithm. The accuracy provided by WRF model is higher than RF model which highlights the importance of selecting the feature for classification algorithm.


Introduction
Around the globe, Lung cancer (LC) is most repeatedly identified cancer in 37 countries and it is responsible for high death rate in males [1]. Unlike other cancer cells, lung cancer patients have higher survival rate, once detected earlier. There are many histological categorizations in the Lung cancer cells [2]. Based upon the size of the cancer cell, they are classified into many types [3]. Certain type ofcancer cells is frequently found in heavy smokers than non-smokers, also the progress of particular type of lung cancer cell is higher in non-smokers [4]. Though there are many parameters contributing to the development of Lung cancer, the exact reason is not known. Therefore, many prediction and classification algorithms are used to find out features that contribute to this deadlydisease.
The aim of this paper are as follows • This work identifies appropriate features that are related to histological categorization of cancercells.
• This work has created two models. One model provides the result of features selected by a predominant feature selection method and another model provides result of all features without performing any selection process. A comparison is made to inform the importance of selecting the feature for classification or predictionalgorithm.
• Performance of these two models are evaluated to determine the bettermodel.

II. Related works
Wail A.H Mousa et al. [5] used an SVM classifier that provided sensitivity of 87.5%. Swati P. Tidke et al. [6] developed a model to classify the cancer cells. Input image is preprocessed. segmentation using thresholding is done followed by certain operations and an accuracy of 92.5% was shown. Elmar Rendon-Gonzalez et al. [7] employed SVM algorithm for classification. The model developed includes preprocessing step, segmenting lung parenchyma , identifying nodule and produced 78.08% accuracy.
DmitriyZinovevet al. [8] evaluated an algorithm where Area under the curve (AUC) was used as a performance metric and it provided 69% performance. DmitriyZinovevet al. [9] built a classifiers for Lung Nodule Interpretation. It included some learning approach. Different strategy was employed and probabilistic labels are learned , therefore using them to form classifiers. M H Hasnaet al. [10] created a classifier that gave an accuracy of 80%. Sarah Soltaninejad et al. [11] built a classifier for detection ofnodule.
SakshiWasnik et al. [12] made used of k-nearest neighbors (KNN) algorithm classifier which provided an accuracy 96.25%. Three stage of implementation was done by P. Bhuvaneswariet al. [13] and the accuracy obtained was 90%. S.L.A. Lee etal. [14] provided 100 % true positive and 1.27 false positive per scan by random forest. SubratoBharatiet al. [15] gave a high accuracy texture and spiculation. Jose et al. [16 ]proposed medical image classification where Random forest performed well and produced an 92%accuracy.   Table 2 provide the features description of the dataset to be used in the following steps.

Data Visualization and FeatureSelection
Data is managed, prepared and cleaned to make it available for visualization. Many data exploring techniques are available to know and to infer conclusions based on the requirements. Some of the visualization tools used are scikit learn, tableau, Qlikview, FusionCharts and HighCharts. Data Visualization uses presentation, to gain added understanding about the information within the data. Scikit learn is used for implementation purpose. Figure 2 depicts the bivariate distributions in the dataset. Figure 4, Figure 5, Figure 6 provides distribution of data in the dataset and Figure 7, Figure 8, Figure 9 provides relationship between features.
The primary goal of selecting the features is to recognize those attributes or highlights which are associated with yield esteems where the qualities rely on a particular information which is gathered by applying some valuable test. Usually in statistics, the correlation used are Pearson correlation (PC), kendall rank correlation, Spearman correlation and Point-biserial correlation. In our work we use PC to measure of the strong point of a linear association between variable. The rightness and adequacy of histological categorization of lung cancer can be done by selecting the right features. Selecting the dominant features by PC has been done by AnimeshHazra et al. [20] in predicting the survivability of lung cancer patient dataset. The negative correlation with histology is considered as irrelevant features whereas all the positive correlation is considered as important features. The feature selected are age, height and weight while BMI and tumor size are considered as irrelevant features by PC. Figure 4 shows the feature selection process of PC.   WRF and RF model comprises of machine learning algorithm such as support vector machine (SVM), Logistic Regression (LR), Decision tree (DT), K-Nearest Neighbor (KNN) and Random Forest (RAF). The selected features by PC are given as input to RF model whereas all the features without undergoing feature selection by PC are given as input to WRF model. The accuracy produced by RF model and WRF model are compared. SVM and RF algorithm with feature selection produced greater accuracy of 73.529% than other algorithm. Figure 10 provide the comparison of algorithm with and without feature selection.

Summary of Current Work
In this section, we summarize our current research work as follows: 1. Input data collected from cancerimagingarchive.net undergoes cleaning process to eliminate missingvalues.
2. data visualization is done by ScikitLearn 3. Features are selected usingPC. 4. Two models are created namely WRF and RF for histological classification. Dataset with all features are loaded to WRF and dataset with features selected by PC are loaded intoRF. 5. Accuracy provided by WRF model algorithm is compared with RF model algorithm. WRF and RF model comprises of machine learning algorithms. SVM and RF algorithm with feature selection produced greater accuracy of 73.529% than other algorithm.

Conclusion
In this paper, we have created two models WRF and RF which comprises of machine learning algorithm. Dataset with all features are loaded to WRF and dataset with features selected by PC are loaded into RF model. Accuracy provided by WRF model algorithm and RF model algorithm are compared. SVM and Random Forest algorithm with feature selection produced greater accuracy of 73.529% than other algorithm. This informs the need of selecting the feature while predicting some deadly disease like lung cancer. In future, research work can be made to improve the accuracy of classification or predictionalgorithm.