Elimination and Backward Selection of Features (P-Value Technique) In Prediction of Heart Disease by Using Machine Learning Algorithms

Background: Early speculation of cardiovascular disease can help determine the lifestyle change options of high-risk patients, thereby reducing difficulties. We propose a coronary heart disease data set analysis technique to predict people’s risk of danger based on people’s clinically determined history. The methods introduced may be integrated into multiple uses, such for developing decision support system, developing a risk management network, and help for experts and clinical staff. Methods: We employed the Framingham Heart study dataset, which is publicly available Kaggle, to train several machine learning classifiers such as logistic regression (LR), K-nearest neighbor (KNN), Naïve Bayes (NB), decision tree (DT), random forest (RF) and gradient boosting classifier (GBC) for disease prediction. The p-value method has been used for feature elimination, and the selected features have been incorporated for further prediction. Various thresholds are used with different classifiers to make predictions. In order to estimating the precision of the classifiers, ROC curve, confusion matrix and AUC value are considered for model verification. The performance of the six classifiers is used for comparison to predict chronic heart disease (CHD). Results: After applying the p-value backward elimination statistical method on the 10-year CHD data set, 6 significant features were selected from 14 features with p <0.5. In the performance of machine learning classifiers, GBC has the highest accuracy score, which is 87.61%. Conclusions: Statistical methods, such as the combination of p-value backward elimination method and machine learning classifiers, thereby improving the accuracy of the classifier and shortening the running time of the machine.


Introduction
Identifying the evidence of risk factors that increase the incidence of cardiovascular illness is one of the significant achievements in the study of disease transmission in the 20th century (Einarson et al. 2018). In addition, analysts can choose to establish multivariate risk prediction calculations to help clinicians perform risk assessment. In the last 10 years, the author has proposed many risk scores (Sofi et al. 2014). These are all created for hazard assessment in a limited time of ten years or less. In order to meet this demand, some reports have introduced the whole life risks of CVD, CHD and stroke. Some experts work to calculate life span and long-opportunities in a class or class of hazardous variables (WHO 2012). Their findings emphasize the importance of the level of risk factors in early adulthood to the risk of CVD, just as CVD risk factors have a huge impact on all-cause mortality. They also pointed out that ten years of work might reduce the real dangers, especially among young people and ladies. These outcome highlight require for continuing models of CVD threat expectations that are very important for young adults and represent a competitive reason for non-CVD mortality (Singh et al. 2020). In any case, obviously, no calculation method has been proposed to measure the direct ability of 10-year CVD risk as a risk factor. The trouble of finding a long enough and thoroughly developed methodological complexity associated with integrating competing death risks into multivariate risk assessments for various reasons (Proust-Lima et al. 2016). This exploratory article clarifies a procedure for assessing the 10-year risk of hard CVD function among people liberated from baseline conditions. Our risk scale will consider changes to the serious danger of non-CVD deaths, and will utilize standard danger factors that can be gathered during doctor visits. This process depends on the Research Article Vol.12 No.6 (2021), [2650][2651][2652][2653][2654][2655][2656][2657][2658][2659][2660][2661][2662][2663][2664][2665] proposed by FCBF, PSO and ACO, the most extreme layout accuracy of 99.65% can be achieved (Khourdifi and Bahaj 2019). In order to make a artificial Lampyridae classifier, and further compare it with the Takagi Sugeno Kang fluffy classifier and the ANN classifier to predict the accuracy, susceptibility, particularity and connection coefficient of Mathew. Despite other execution measurements, the essence of MCC is to test the capabilities of AI classifiers. The use case is completed in Scilab, and it is inferred from the obtained results that the constructed ALC is better than the TSK fluffy classifier and the ANN classifier. The results are encouraging. It is speculated that the accuracy of male diabetic patients is 87.60% and the accuracy of female diabetic patients is 87.27% (Narasimhan and Malathi 2019).

Methods
Several techniques and methods were used in this experiment to assess the ten-year threat of CHD. The method section is divided into two parts, one part describes applied machine learning algorithms, and the other part describes experimental methods.

I.
Applied Machine Learning Algorithms In this section, we discuss machine learning algorithms, which will be used as methods throughout the research article.

Logistic Regression (LR)
Logistic or logit models are used to prove the possibility of a particular category or function. Some functional categories can be expanded to display. The probability of each article identified in the picture will be reduced to a value between 0 and 1, and the number will be 1 (Balu et al. 2019). Consider a model with two indicators x 1 and x 2 and a parallel response variable Y, we mean p = P (Y = 1). We accept the direct link between the index factor and the log chance of the function Y = 1. This direct relationship can be written in the accompanying digital structure. Among them, l is the logarithmic chance, b is the base of the logarithm, and β i is the boundary of the model: = log 1 − = 0 + 1 1 + 2 2

Random Forest (RF)
Random forest is an ensemble learning technique for ordered, recursive, and different tasks (Dudek 2015, Chaurasia and Pal 2021). It works by developing a large number of decision trees in preparation time and generating classes as a method for arranging or recursively predicting a single tree. Arbitrary decision trees are suitable for decision trees and tend to overfit their preparations. Random forest consists of large lateral decision trees, but its accuracy is lower than that of gradient boost trees. Nevertheless, the nature of the information will affect its display.

Decision Tree (DT)
The structure of the decision tree is similar to a flowchart, in which each internal center point tests the quality, each branch measures the test results, and each leaf center measures the cycle class label (Adebayo and Chaubey 2019). The way from root to leaf is related to representation rules. In decision-making investigations, decision trees and closely related influence outlines are used as visual and scientific selection aid tools to determine the normal benefits of competing choices.

Naïve Bayes (NB)
The naive Bayes classifier is a set of basic probabilistic classifiers that rely on the application of Bayesian assumptions and reliable autonomy assumptions between features. They are one of the most direct Bayesian organization models. However, they can be used in conjunction with core thickness evaluation and achieve a higher level of accuracy. The naive Bayes classifier is highly adaptable, and requires the boundaries of various directly influencing factors in learning problems (Krawczyk 2017). In contrast to the costly iterative guesswork of evaluating some different types of classifiers, the most extreme probability preparation should be made by evaluating the clarity of the closed structure that requires immediate time. Utilizing Bayes' hypothesis, the contingent likelihood can be written as:

K-Nearest Neighbor (KNN)
K-nearest neighbor calculation is a non-parametric strategy for sequence and recurrence. In both cases, the information includes the k nearest preparation models in the composition space. k-NN is a kind of occasion-based learning or slow implementation, in which the ability is only approximated locally, and all calculations are retained until the work evaluation. Since this calculation depends on the separation of groups, normalizing preparation information can greatly improve its accuracy (Chaurasia and Pal 2018). Whether it is characterization or recurrence, a useful method can be to distribute the load to neighbors' promises so that the closer neighbors provide more normal services than the more inaccessible neighbors.

Gradient Boosting Classifier (GBC)
Gradient boosting is an AI program for redundancy and change problems. It serves as a set of prior models and decision trees to form a hypothetical model. Like other advanced technologies, it develops the model in a phased, distinct style and summarizes the model by allowing enhancements on optional works (Stamate et al. 2018). For now, let us consider a gradient boosting calculation with M stages. The slope of each stage is increased by m (1 <= m <= M), assuming that the model F m is not perfect. In order to improve Fm, some new estimator's h m (x) should be added to our calculations. Therefore,

Logistic Regression and P-value Interpretation: Backward Elimination (Feature Selection)
Regression surveys create conditions for describing the measurable link between at least one indicator factor and the reply variable (Suguna et al. 2019). The p value for each term tests the invalid hypothesis, that is, the coefficient is equivalent to zero. A low p-value (<0.05) demonstrates that it can disregard the invalid hypothesis. In the final analysis, indicators with low p-value may become an important extension of the display, because changes in indicator values can be identified by changes in response variables. On the other hand, the adjustment of the larger p-value suggested index has nothing to do with the change in response.

Iteration Log
Iteration log is the release of log probability at each iteration. The main logarithm (iteration 0) is the log probability of an "invalid" model; that is, a model without any indicators (Harrell Jr 2015). In each iteration, the log probability will increase, and the purpose is to expand the log probability. When the contrast between progressive logging is small, it is said that the model is satisfied, the iteration is stopped, and the results are displayed.

Log likelihood
The estimation of log probability is not important in it. Rather, this number can be used to help consider established models (Smith and Levy 2013).

Number of observation
This is the amount of perception used in the survey. If lack quality in any of the factors used for strategic relapse, this number may be more moderate than the absolute number of perceptions in the information index. Statistics naturally uses list erasure, which means that if any factors are missing from the strategic relapse, the entire case will be rejected for review.

Research Article
Vol.12 No.6 (2021), 2650-2665 Pseudo R-squared Logistic regression is different from R-squared in OLS recurrence. There are many types of pseudo R-squared measurements (Ye et al. 2019). This metric does not mean the R-squared method in OLS regression.

Dependent Variables
This is a relative variable in logistic regression.

Coef.
These are the calculated quality of the recurrence conditions and are used to predict the required variables based on the free factors (Gao et al. 2016). They are based on log chances. Like OLS relapse, the expectation condition islogit(p) = log( /1 − ) = 0 + 1 * + 2 * + 3 * + 4 * + 5 * + 6 * Where, p is the possibility of being in the structure. On the factors used in this model, the logistic regression conditions arelog(p/1 − p) = −9.1264 + 0.5815 * Sexmale + 0.0655 * age + 0.0197 * cigsPerDay + 0.0023 * totChol + 0.0174 * sysBP + 0.0076 * glucose These assessments teach us about the links between independent factors and related variables, which require variables to be on a logarithmic scale. These evaluations tell us that the expected logarithmic ratio = 1 expansion measure, which will be expected by each additional unit in the indicator and keep all the different indicators stable.

Std. Err.
These are the standard mistakes related with the coefficients (Cole 2004). The standard blunder is utilized for testing whether the boundary is fundamentally unique in relation to 0; by separating the boundary gauge by the standard mistake we acquire a z-value. The standard mistakes can likewise be utilized to frame a certainty span for the boundary.

z and P>|z| Values
These subdivisions provide the z-value and 2 p-values to test invalid guesses with a coefficient of 0 (Miyamoto et al. 2018). In fact, the coefficient of p-value not completely equal to α is very large. For example, if we choose an alpha of 0.05, the coefficient of p-estimated value equal to or less than 0.05 is actually crucial, that is, we can ignore the invalid theory and point out that the coefficient is inherently unique with respect to 0.

Odds ratio (OR) and Logistic Regression (LR)
An odds extent is an extent of connection between a presentation and an outcome. The OR addresses the odds that an outcome will happen given a particular presentation, appeared differently in relation to the odds of the outcome occurring without that introduction. Exactly when a LR is resolved, the LR coefficient (b) is the evaluated increase in the log odds of the outcome per unit increase in the assessment of the introduction (Park 2013

Confidence Intervals (CI)
The 95% certainty stretch is utilized to gauge the accuracy of the Odds proportion. A huge CI demonstrates a low degree of exactness of the OR, while a little CI shows a higher accuracy of the OR. It is essential to note in any case, that dissimilar to the p-value, the 95% CI doesn't report a measure's factual hugeness (Park et al. 2016). Practically speaking, the 95% CI is frequently utilized as an intermediary for the presence of factual criticalness in the event that it doesn't cover the invalid worth. By the by, it is improper to decipher an OR with 95% CI that traverses the invalid an incentive as demonstrating proof for absence of relationship between the presentation and result.

Model Validation
In AI, model approval is alluded to as the cycle where a prepared model is assessed with a testing informational index. The testing informational collection is a different segment of a similar informational collection from which the training set is determined. The fundamental motivation behind utilizing the testing informational collection is to test the speculation capacity of a prepared model.

Confusion Matrix
Confusion matrix is an arrangement of row and column that is frequently used to depict the exhibition of an order model on a bunch of test information for which the genuine qualities are known (Aggrawal and Pal 2020).

Positive Likelihood Ratio (LR+):
It is acquire when TPR divided by the FPR.

Negative Likelihood Ratio (LR-):
It is the likelihood of a patient testing negative that has an infection separated by the likelihood of a patient testing negative who doesn't have a sickness.

Threshold Values:
In order to describe logistic regression value for binary categories, we should describe a classification threshold (decision threshold) (Besse et al. 2013). A value exceeding this limit means "infection"; and the underneath specifies "no disease". The classification threshold should be always 0.5. The threshold is a subordinate problem, so it is the value we should adjust.

ROC Curve:
Receiver operating characteristic (ROC) curve is a graph demonstrating the introduction of a grouping model at all portrayal limits (Chaurasia and Pal 2020). This curve plots two limits: True Positive Rate (TPR) and False Positive Rate (FPR)

Area under Curve:
Area under the curve (AUC) measures the entire two-dimensional area under the entire ROC twist from (0, 0) to (1, 1). AUC gives an absolute extent of execution over all possible portrayal edges (Gao and Wang). One technique for interpreting AUC is as the probability that the model positions a sporadic positive model more significantly than a discretionary negative model.

II. Experimental Methodology
The ongoing cardiovascular data is focused on residents of Framingham, Massachusetts. The objective is to envision whether the patient has 10-year risk of future coronary ailment (CHD).The dataset gives the patient's information. It consolidates in excess of 4,000 records and 15 features. After preprocessing the data set, logistic regression has been applied to obtain statistical results, such as standard error, z-value, p-value, and confidence interval (25-95%). In addition, these P values will be used to select features with P values <= 0.5. Six machine learning algorithms are applied to obtain accuracy. At the next level, all these results obtained from the classifier will enter the verification level, where ROC, AUC values and confusion matrix are checked. Figure 1 describes the steps used in this experiment.

Experimental Setup
The experimental data is taken from Framingham Heart Research Data Set (Kannel et al. 1979). The data set contains 4240 records and 15 attributes. Variable information is provided in Table 1 below. In the following data set, some values are missing in the attributes, such as cigsPerDay, BPMeds, totChol, BMI, heartRate and glucose. The total number of missing values was 489, so these rows with missing values were excluded for further analysis. Now, among 3751 records, 3179 patients have no 10-year danger of coronary illness, and 572 patients are at risk after this time period.

Results
In this part, all the outcomes delivered by the model and their significance is investigated and clarified.

Logistic Regression
In following case, use the opposite approach, eliminate these features one by one with the most important P values, and then relapse again and again until all attributes have P values below 0.05. In Table 2 below, there are different P values with different attributes.

Odds Ratio, Confidence Intervals and P-values
In Table 4 below, the odds ratio, confidence interval and P value are calculated.

Model evaluation with corresponding Statistics
As shown in Table 5, calculation has been made for accuracy, misclassification, sensitivity, specificity, positive predictive value, negative predictive value, positive likelihood ratio and negative likelihood ratio of the classifier. The accuracy of the classifier GBC is higher, so there are fewer classification errors.   Table 6, the threshold value (0.5) calculated by the classifier to predict whether the patient has a heart disease.    2 show section of properties with P-value superior than the favored alpha (= 5%) and in this way representing little measurably critical association with the likelihood of coronary illness. This fitted model (Table 4) shows that, holding all various features consistent, the odds of getting resolved to have coronary ailment for people groups (sex = 1)over that of females (sex = 0) is 1.788687. With respect to transform, we can say that the odds for people groups are 78.8% higher than the odds for females. The coefficient for age says that, holding all others predictable, we will see 7% additions in the odds of getting resolved to have CDH for a one year increase in age since 1.067644. Furthermore, with each extra cigarette one smokes there is a 2% extension in the odds of CDH. For Total cholesterol level and glucose level there is no basic change.
There is a 1.7% development in possibilities for every unit increase in systolic Blood Pressure. Out of 15 features (Table 5), we have selected only six features by backward elimination P-value based features for analysis (Maldonado et al. 2014). Measurable investigation of the information was performed and unmistakable measurements were resolved for segment and illness explicit factors.
Since the model predicts heart disease, too many Type II errors are not suitable. In the current situation (Table 6), false negatives are more dangerous than false positives. Therefore, in order to expand the influence, the threshold can be lowered. A run of the mill strategy to picture the trade offs of different thresholds is by using a ROC (Figure 3), a plot of the certified positive rate versus the false positive rate for all likely determinations of thresholds. Model with extraordinary portrayal accuracy should have in a general sense more apparent positives than false positives at all limits. The ideal circumstance for roc curve is towards the upper left corner where the specificity and sensitivity are at ideal levels.
The territory under the ROC measures model characterization precision; the higher the region, the more noteworthy the dissimilarity among valid and false positives, and the more grounded the model in grouping individuals from the preparation dataset. A territory of 0.5 compares to a model that plays out no in a way that is better than arbitrary grouping and a decent classifier remains as distant from that as could reasonably be expected (Figure 4). The closer AUC is to 1, the better. We compare the results with the earlier studies in Table 7. Our method achieves better results by using feature selection (p-value) techniques and six machine learning methods. In view of these results, our model beat those results in this article. The important thing is that experts can only handle three or more times instead of a given number of features, and can complete results compared to full features. Our strategy can help reduce meaningless features and increase the amount of information.

Conclusion
In this report, we propose a straightforward technique to survey the 10-year danger of hard CVD, which relies upon the danger factors assessed routinely during clinic visits. The result depends on more than 10 years of comprehensive development and determining the occurrence and passage of CVD. Our calculation takes into

Research Article
Vol.12 No.6 (2021), 2650-2665 account the assessment of risk factors, which include uninterrupted and unmitigated risk factors. It also represents a competitive risk of non-cardiovascular death. Our method is based on p-value based statistical feature selection and six ML classifiers. Table 5 is a performance table, in which GBC's performance is better than other classifiers. The performance of the classifier is measured by confusion matrix, ROC and AUC. The 10-year heart disease data set estimates the patient's future heart disease, so the threshold prediction is calculated as p = 0.5 in table 6. The accompanying end has been assessed by this exploration are: -All features chose after the end cycle demonstrate P-values inferior than 5% and accordingly proposing critical function in the Heart illness expectation.
-Men seem, to be more vulnerable to coronary ailment than women. Expansion in Age, number of cigarettes smoked each day and systolic Blood Pressure in like manner show growing odds of having coronary disease.
-Cholesterol shows no gigantic change in the odds of CHD. This could be a result of the presence of 'good cholesterol (HDL) in the absolute cholesterol reading. Glucose also causes a completely immaterial change in possibilities (0.2%).
-The model anticipated with 87.61% exactness by GBC. The specificity of the model is more sensitive.
-The Area under the ROC curve is 73.86 which are genuinely agreeable.
-Generally model could be improved with more data.