Regression Tree Based Correlation Technique in Spatial Data Classification

.


Introduction
Spatial data mining is mining knowledge from huge amounts of spatial data.It is extracting knowledge from spatial data like Geographic Information Systems whose information is related to geographic locations.Spatial data are data that comprise the location characteristics that are stored in databases called spatial databases.Spatial data mining is the process of applying various mining techniques such as clustering and classification to a spatial database to extract significant patterns from the spatial data.One of the most important data mining methods is classification.Classification is the task of categorizing the objects from the spatial database into different classes in such a way that the data in one class are similar and it has common features.The traditional techniques have been applied for mining the spatial data but it takes higher complexity.Therefore, the classification task helps in discovering and extracting the interesting patterns from the spatial dataset with lesser complexity.
A Logitboost Ensemble-based Decision Tree (LEDT) method was developed in [1] for mapping the forest fire vulnerability.But the designed method failed to improve the classification accuracy.Graph Convolutional Neural Network (GCNN) architecture was introduced in [2] to evaluate the graph-structured spatial data.Though the GCNN improves the accuracy, the classification time was not minimized.
A support vector machine (SVM) algorithm was introduced in [3] with a data modeling technique to estimate forest fire burning for area approximation.But, the error and false positive rate were not minimized by the SVM algorithm.A Naive Bayes classification method was introduced in [4] for the fire alarm system.The Naive Bayes classification method was introduced with better prediction accuracy results in a data training set based on the smoke source.But, the time complexity was not minimized by the Naive Bayes classification method.
An adaptive ensemble method was introduced in [5] to improve the classification performance of the spatial characteristics of the imbalance data.But the designed method failed to minimize the time complexity in the spatial data classification.A Map-Reduce based approach was developed in [6] to find the entire co-location patterns from a spatial dataset.The designed approach reasonably minimizes the execution time for pattern mining but the accuracy was not improved.Three different methods were introduced in [7] to reduce the land cover classification problems.The methods improve the classification accuracy but the mapping problem was not solved.A random forests (RFs) classifier was developed in [8] for mapping the land cover through the classification of remote sensing big data.The designed classifier minimizes the classification error but the classification time was not reduced.
A stacked sparse autoencoder was introduced to learn the high-level features and spatial data classification was performed in [9] using a random forest classifier.The designed classifier failed to achieve more robust performance.A Spatio-temporal data classification method was developed in [10] with multidimensional patterns.But the method failed to improve the performance of data classification with minimum time.
The integration of remote sensing data and GIS concept was introduced in [11] to find the high-risk of the fired area of forest.But the concept failed to use the machine learning technique for effective risk prediction.The least-squares support vector machines (LSSVM) and artificial bee colony (ABC) optimization were introduced in [12] for spatial prediction and mapping of landslides.The designed model failed to minimize the prediction error.

Problem Statement
The issue of spatial data analysis is major concern in the spatial data mining by using huge sizes of the database.Recently, numerous research works are developed for spatial data classification with aid of dissimilar data mining techniques.But, the classification accuracy of existing works was not adequate.The conventional techniques were introduced for spatial data mining although it takes higher complexity.However, the classification time was higher.But, the false positive rate was not reduced.To resolve the issues, a novel Pearson Correlated Regression Tree-based Affine Projective spatial data Classification (PCRT-APSDC) technique is introduced.
The PCRT-APSDC technique employs the fuzzy rule-based classification algorithm and works with multiple spatial data.The multiple spatial data are positioned on the dimensional space and projected the data into different subsets.The Fuzzy rule procedure is used for constructing the regression tree to classify the input data into different classes with minimum error based on the Pearson correlation measure.The Pearson correlation is measured between the training features and the testing features.After the classification, neighboring spatial data paths in the constructed tree are identified by computing the stress function based on the distance measure.

Contribution
A PCRT-APSDC technique is developed for spatial data classification.In comparison with other related works, our proposed technique exhibits improved performance.
The major contributions are described as follows.

•
The PCRT-APSDC technique is introduced to improve the spatial data classification accuracy and minimize the time.This contribution is achieved by an affine spatial projection which is the process of mapping the total data into different subsets using a fuzzy rule-based regression tree classification technique.The internal node in the regression tree measures the relationship between the training features and testing features using the Pearson correlation coefficient.Based on the correlation value, the data are classified into different subsets with minimum time.
• Fuzzy rule-based classification is used for discovering the fired region in the forest according to the correlation value.
• The gradient descent function is applied after the spatial data classification to minimize the training error.This helps to reduce the false-positive rate.In the tree, the neighboring spatial data path is identified by calculating the stress function.The stress is the distance function that is measured between the nodes in the tree.
The paper is organized as follows.Related works are presented in Section 2. The problem definition and proposed methodology Pearson Correlated Regression Tree-based Affine Projective Spatial Data Classification (PCRT-APSDC) is presented in Section 3. In Section 4, experimental evaluation and parameter settings are presented and the Performance analyses of different parameters using three different classification techniques are described in Section 5. Finally, the conclusion of the paper is presented in Section 6.

Related Works
A Random Subspace (RSS) and Classification and Regression Trees (CART) was developed in [13] for forecasting the landslides with the help of spatial data.The designed hybrid technique failed to minimize the prediction error.A novel approach that uses Extinction filters were designed in [14] that accurately extract spatial and contextual information from remote sensing images.However, this is not applicable to conventional Attribute profiles.Four dissimilar classification algorithms were introduced in [15] for identifying the burned areas on a global scale.The performance of the classification accuracy remained unaddressed.The stacked sparse autoencoder (SSAE) was developed in [16] for classifying the data based on the local spatial information.Though the method improves the classification accuracy, the false positive rate was not minimized.
A GIS-based machine learning technique was introduced in [17] for groundwater nitrate concentration based on spatial data.But, the spatial data classification time was not minimized.A Bayesian spatial generalized linear mixed model (SGLMM) was enveloped in [18] to classify the spatial data.The designed model has a higher complexity in the spatial data classification.A formal concept analysis (FCA) was presented in [19] for the dynamic classification of spatial data.But the classification error was not minimized.Machine-Learning models were developed in [20] for improving the predictive performance with spatial data.Though the designed model improves the prediction accuracy, the prediction time was not minimized.An Extreme Learning Machine (ELM) was introduced in [21] for classifying the spatial environmental data.The ELM minimizes the mean square error but the performance of time complexity remained unsolved.
A Differential Flower Pollination (DFP) and mini-match backpropagation (MnBp) was introduced in [22] for predicting the forest fire danger using spatial data.But the advanced machine learning or soft computing techniques was not used to increase the forest fire danger prediction.An artificial neural network was developed in [23] for predicting forest fires using a multilayer perceptron.The designed network minimizes the global error at the output layer but time complexity was not minimized.
Piecewise linear regression and predictive modeling was introduced in [24] for data management systems (DMS) predictive analytics.A novel multifeature dictionary learning algorithm (MF-SADL) [25] was developed for hyperspectral image classification.However, the classification accuracy was not improved.Deep neural network (DNN) was introduced in [26] to extract the features for improving the accuracy.But, the classification time was not reduced.A new mining paradigm named spatial-temporal fluctuating patterns (STFs) was introduced in [27] for determining frequent patterns from the spatial-temporal data.A spectral clustering approach was designed in [28] for multivariate geostatistical data.However, it failed to focus the classification accuracy.

Methodology
The size of the spatial dataset is also growing significantly in recent days.The problem of spatial data analysis is difficult for human beings since it has large sizes of the database that requires novel techniques to discover the patterns.Moreover, analyzing such a database is more time-consuming and provides errors since the spatial data structure is more complex than the ordinary database.Therefore, spatial data mining is a difficult and complex task to discover interesting patterns from this database.
Therefore an efficient data mining techniques called classification is employed for solving the above issues.Based on the motivation, the proposed Pearson Correlated Regression Tree-based Affine Projective spatial data Classification (PCRT-APSDC) technique is developed to improve the spatial data classification accuracy with minimal error rate.

Fig.1 Flow Process of PCRT-APSDC technique
Fig. 1 shows the flow process of the proposed PCRT-APSDC technique to classify the spatial data with minimum time.The spatial dataset (i.e.forest fire dataset) includes a number of attributes (i.e.features)  1 ,  2 ,  3 … . .  and each attributes contains the set of data { 1 ,  2 ,  3 , … .  }.By applying the forest fire dataset, the burned area is predicted based on the classification.Initially, the number of data are collected from the dataset.After collecting the data, the classification is performed using Pearson correlated regression tree.The classification process of the proposed PCRT-APSDC technique is described in the following subsection.

Pearson Correlated Regression Tree-Based Affine Projective Spatial Data Classification
The multiple spatial values of the data are positioned on the dimensional space.In mathematical, the affine spatial projection is the process of mapping the total dataset into different subsets based on the fuzzy rule.Here, the total dataset represents the number of data taken from the spatial dataset and the subsets denote the classification outcomes.The proposed PCRT-APSDC technique performs the classification through the fuzzy rules.
The Pearson Correlated Regression Tree is a machine learning technique and the flow-chart-like structure is used to classify the given dataset into two classes such as fired region or non-fired region.A regression tree includes three types of nodes such as root node, internal node, and leaf node.The topmost node in a decision tree is the root node where the decision is taken by applying the fuzzy rules.Each internal (non-leaf) node performs a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node provides the class labels.The root node in the tree measures the correlation between the features and then the fuzzy rule is applied to classify the data.(1) In ( 1),  denotes a correlation coefficient and '' represents several features.∑   *   denotes a sum of the product of paired score of two features, ∑   2 represents a squared score of   and ∑   2 represents a squared score of   .The correlation coefficient () provides the two results such as '+1' and '-1'.The coefficient provides '+1' indicates a positive correlation and it provides '-1' which represents the negative correlation between two features.

Fuzzy Rule-Based Classifications
After finding the correlation between the features, the fuzzy rule is applied to the nearest neighbor dimensional space for classifying the spatial data.The fuzzy rule is used for connecting the inputs (i.e.spatial data) with the outputs (i.e.classification results).The rules are formulated using algorithmic formalism are  (condition) and  (conclusion).The condition part checks the correlation value between the features and the conclusion part provides the desired classification results.
In ( 2), '' represents the classification output.In this way, the total dataset is projected into two different subsets.After the classification, the error is computed to minimize the incorrect data classification.The training error is calculated using the following mathematical equation, In ( 3),   denotes a training error,  represents the actual classification and   represents the predicted classification results.The gradient descent function is used to minimize the error in the classification process, In ( 4), () represents the gradient descent function,   denotes an argument of the minimum function   denotes a training error.In this way, all the data are classified and predicts the fired region in the forest.After the classification, the frequent and persistent soft cycle's path in the tree is identified with spatial data to speeds up the tree construction process by computing the stress function.The stress function is calculated in terms of distance.The distance with spatial data points computes the stress function to identify the soft cycle neighboring spatial data paths in the constructed tree.Let us consider the coordinates of the two nodes represented as ( 1 ,  1 ) and ( 2 ,  2 ) in the two-dimensional space.The distance between the nodes in the tree is computed as follows, In (5),  represents the distance between the nodes.The minimum distance is used to find the neighboring spatial data paths in the tree.This helps to accurately find the neighboring fired area in the forest with minimum time.The algorithmic procedure of the proposed PCRT-APSDC technique is described as follows.for each data   4.
Positive correlation between training Algorithm 1 describes the process of Pearson Correlated Regression Tree-based Affine Projective spatial data classification with minimum error.The spatial data are positioned in the given dimensional space.Then mapping from the input dataset into different subsets is performed by constructing the regression tree.The regression treebased classification is performed through the correlation between the training features data and testing features data.If the two features are highly correlated, then the data are classified into one subset.Otherwise, the data are classified into another subset.Followed by, the classification error is calculated and minimized using gradient descent function.This helps to improve the spatial data classification accuracy and minimizes the error rate.Finally, the neighboring spatial data paths are identified through the distance function to find the neighboring fired paths in the forest with minimum time.
The above algorithm is implemented in the experimental evaluation to show the performance of the proposed PCRT-APSDC technique.
The experiments are carried out with different parameters given below: • classification accuracy • false-positive rate • classification time

Datasets
In this section, Forest Fires Dataset [29] is taken from the UCI machine learning repository.The main aim of the dataset is to predict the burned region of forest fires, in the northeast area of Portugal with the help of meteorological data.The dataset comprises 517 instances and 13 attributes.The associated task of the dataset is the regression.The attributes characteristics are real and the dataset characteristics are multivariate.The proposed PCRT-APSDC technique uses holdout method for performing the cross-validation process.The input dataset is separated into two sets such as training set and testing set.Most of the data is used for training (i.e., 60 percentage of data) and a smaller portion of the data is taken for testing i.e., 40 percentage of data.To conduct the experiment, the number of spatial data (i.e.instances) considered in the range from 50-500 from the forest fires dataset.
The experiments are carried out with different parameters given below:

Results And Discussion
The experimental results of the proposed PCRT-APSDC technique and existing methods namely LEDT [1], GCNN [2], SVM Algorithm [3], and Naive Bayes classification Method [4] are discussed in this section with different parameters such as classification accuracy, false positive rate and classification time.Performance results are evaluated with the help of graphical representations.For each subsection, the sample mathematical computation is presented.

Performance Results of Classification Accuracy
The classification accuracy is defined as the ratio of a number of data correctly classified for predicting the burned area to the total number of spatial data.The formula for calculating the classification accuracy is given below, In ( 6), '' refers to the total number of spatial data.The classification accuracy is measured in the unit of percentage (%).The classification accuracy result using the PCRT-APSDC Technique is compared with the four conventional methods LEDT [1], GCNN [2], SVM Algorithm [3] and Naive Bayes classification Method [4].The performance result analysis of classification accuracy is shown in Table 1.
Table 1 Fig. 4 illustrates the experimental results of classification accuracy versus a number of spatial data.

Fig.4 Performance results of classification accuracy
For the experimental evaluation, the spatial data are taken in the range from 50 to 500.Totally ten results of classification accuracy are obtained with various input data as shown in fig. 4. The graphical results clearly show that the classification accuracy is found to be higher using the proposed PCRT-APSDC technique as compared to the conventional technique.This significant improvement is achieved by projecting the total spatial data into different subsets.The mapping of the spatial data is carried out using the regression tree.The spatial data are collected from the forest fire dataset.Then the correlations of training features with the fire testing features are measured to classify the given instance (i.e.data) into the burned area.This helps for the proposed PCRT-APSDC technique to improve the number of spatial data correctly classified and effectively predicts the burned area in the given location.Besides, the neighboring burned area also identified using the PCRT-APSDC technique by measuring the distance between the nodes in the tree.The statistical results confirm that the classification accuracy of the proposed PCRT-APSDC technique is higher than the existing methods.Let consider the 50 spatial data, 43 data are correctly classified using the PCRT-APSDC technique and their percentage is 86%.Similarly, the 33, 37, 30, and 28 data are correctly classified by the existing LEDT, GCNN, SVM Algorithm, and Naive Bayes classification Method, and their classification accuracy percentages are 66%, 74%, 60%, and 56% respectively.
The proposed classification results are compared to the classification accuracy of the existing technique.The comparison results show that the proposed PCRT-APSDC technique improved the classification accuracy by 20%, 10%, 36%, and 56% than the LEDT, GCNN, SVM Algorithm, and Naive Bayes classification Method respectively.

Performance Results of False-Positive Rate
The false-positive rate is defined as the ratio of a number of data incorrectly classified to the total number of spatial data.The false-positive rate is mathematically calculated as follows, In ( 7), ' ' refers to the total number of spatial data.The false-positive rate is measured in the unit of percentage (%).
The experimental result of the false-positive rate using the PCRT-APSDC Technique is compared with four state-of-the-art methods LEDT [1], GCNN [2], SVM Algorithm [3], and Naive Bayes classification Method [4].The tabulation result analysis of the false-positive rate is demonstrated in Table 2.  5, the performance result of the false-positive rate is illustrated with the number of spatial data.Also, the fuzzy rule is applied to the tree structure to classify the given spatial data with the help of the correlation between the features, from which, the fire in the specific location, as well as the neighboring area, is predicted with minimum error through the efficient classification results.There are 10 different results of the false-positive rate which are obtained for each technique.The results of the proposed PCRT-APSDC technique are compared to the results of the existing classification methods.Hence, the average false-positive rate is found to be lesser using the proposed PCRT-APSDC technique by 62% when compared to LEDT, 47% as compared to GCNN , 73% as compared to SVM Algorithm, and 79% as compared to Naive Bayes classification Method.

Performance Results of Classification Time
The classification time is defined as the amount of time required to classify the spatial data.The classification time is mathematically calculated as follows, From equation ( 8),  represents the classification time,  denotes the number of spatial data, and '  ' represents the time taken for classifying a single spatial data.The classification time is measured in the unit of milliseconds (ms).The experimental result of time complexity using the PCRT-APSDC Technique is compared with four state-of-the-art methods LEDT [1], GCNN [2], SVM Algorithm [3], and Naive Bayes classification Method [4].The tabulation result analysis of classification time is demonstrated in Table 3.

Fig. 2
Fig. 2 Structure of the Regression tree Fig.2 illustrates a structure of the regression tree to classify the spatial data into a fired region or non-fired region.For each node in the tree, the correlation between the training features and the testing features (i.e.forest fire causing features) is measured using the Pearson correlation.The Pearson correlation is measured as follows,  =

Fig. 3
Fig. 3 Fuzzy Rule-Based Classifications Fig.3 shows the fuzzy rule-based classification to identify the fired region in the forest-based on the correlation value.By the established rules, the data are classified into two subsets such as fired region and non-fired region based on the correlation values.The output of the regression tree is given below,  = {  = +1,    = −1,  −   (2)

Fig. 5
Fig.5 Performance results of the false-positive rate The false-positive rate is the number of spatial data that are incorrectly classified.The false-positive rates of five different methods namely PCRT-APSDC, LEDT [1], GCNN [2], SVM Algorithm [3], and Naive Bayes classification Method [4] are represented by the five different colors of lines as shown in fig. 5.The false-positive rate of the proposed PCRT-APSDC technique is minimized as compared to existing results.The reason behind the classification error is minimized by using a gradient descent function.By applying the gradient descent function after the classification, the training error is minimized.