An Optimal Feature Selection with Wavelet Kernel Extreme Learning Machine for Big Data Analysis of Product Reviews

In recent times, generation of big data takes place in an exponential way from diverse textual data sources like review sites, media, blogs, etc. Sentiment analysis (SA) finds it useful to classify the opinions of the big data to different kinds ofsentiments. Therefore, SA on big data helps a business to take beneficial commercial understandings from text based content. Though several SA approaches have been presented, yet, there is a need to improve the performance of SA to interpret the customer’s feedback and increase the product quality.This paper introduces a novel social spider optimization based feature selection based wavelet kernel extreme learning machine (SSO-WKELM) model. The proposed model initially undergoes preprocessing to remove the unwanted word removal. Then, Term Frequency-Inverse Document Frequency (TF-IDF) is utilized as a feature extraction technique to extract the set of feature vectors. Besides, a social spider optimization (SSO) algorithm is utilized for feature selection process and thereby achieves improved classification performance. Subsequently, WKELM is employed as a classifier to classify the incidence of positive or negative user reviews. For experimental validation, a Product review dataset derived from Amazon along with synthetic data is used. The experimental results stated the superior classification performance of the SSO-WKELM model.

.Data complexity in big data Prediction of sentiments in decisions making is a major concern used for preventing the wrong solution in human-based actions of businesses. Moreover, SA is depicted as a principle of examining the emotions expressed in a text [7]. Hence, in this competing world, learning the user demand and market-related manufacturing is one of the promising objectives in businesses. In line with this, text and SA can be applicable for a business to make possible decisions from text-based content such as word documents, email, and posts on social networks like Twitter, Facebook (FB), and LinkedIn. Consequently, it results in enhanced attention for presenting views and replies for pervasive SA in academic searching as well as business. Tsai et al [8] reviewed the studies developed on data analytics from classical data analysis to present big data examination. From the system perception, knowledge discovery is carried out and consolidated as 3 portions namely, input, prediction, and result. Sharef et al [9] outlined the advanced models in SA, such as sentiment polarity forecasting, SA features, sentiment classification models, and utilities of SA.Moreover, Graham et al [10] examined the classification of "big data logistics." It has provided a novel view for accurate words and predict "positive" and "negative" emotions. Diverse businesses like public and private, industrial, healthcare, and retail are defined for combining big data and logistics. The deployment of KNIME text modeling is learned and converted as mathematical objects to apply normal KNIME data mining models.
Besides, Sharma et al [11] originated learning on SA for big data. It has recent updates in opinion mining. Also, developers have identified that sentiment mining is one of the well-known and important objectives. As a result, numerous works have been processed, unfortunately, specific challenges to sentiment mining are relevant to unstructured data. Lastly, Balaji et al. [12] performed a study on stock prediction according to emotion and big data analysis. This work has implied the review of each framework with stock prediction models and big data analytics. It demonstrates that numerous approaches are used in various processes and consequently predictive analytics is extremely optimal in-stock prediction.
This paper introduces a novel optimal social spider optimization based feature selection (FS) with wavelet kernel extreme learning machine (SSO-WKELM) method. The proposed model initially undergoes preprocessing to remove the unwanted word removal. Then, Term Frequency-Inverse Document Frequency (TF-IDF) is utilized as a feature extraction method to extract the set of feature vectors. Besides, a social spider optimization (SSO) algorithm is utilized for FS process thereby achieves improved classification performance. At last, the WKELM model is applied as a classification technique to allocate the class labels of the product reviews. To ensure the effective performance of the SSO-WKELM model, a series of simulations were performed on the product review dataset from amazon and synthetic dataset.

The Proposed SSO-WKELM Model
The workflow involved in the SSO-WKELM model is illustrated in Fig. 2. The input product review dataset is initially preprocessed to remove the unwanted words that exist in it. Afterward, a set of feature vectors are effectively extracted by the use of TF-IDF technique. At the same time, the SSO algorithm properly chooses the appropriate set of features. At last, the WKELM model is applied as a classification technique to allocate the class labels of the product reviews.

TF-IDF Model
The prominently applied metrics in data retrieval is "TD-IDF. The data weighting approaches are employed in measuring the probability-weighted amount of data in the applied document. In traditional data theory, IDF is interpreted as 'volume of data' provides the log of inverse probability. Based on the definition, TF-IDF is defined as a value that increases 2 two quantities namely TF and IDF. Followed by, the term frequency offers evaluation of probability incidence where it is normalized by TF in a document which depends upon the scope of estimation. According to the fundamental expression of data theory [13], a document is regarded as unordered set of terms.
Assume D = {dj, …, dn} as a collection of documents and W = {wi, …, wM} be the set of diverse system in D. Here, documents D is implied by corpus of data gained from tweeter are induced while W implies the query term. Hence, parameters N is overall number of documents whereas M signifies the count of terms. In data theory, selecting the term wi from W and election of document dj from D has been assumed.

SSO based Feature Selection Process
In general, SSO considers the search space as communal spider web and a candidate solution in this population implies a spider. A spider takes a weight regarding the fitness measures of a solution which is symbolized. In this model, 2 diverse search set of evolutionary operators are used in simulating the variations among cooperative behaviors adopted in the colony.
It is modeled for resolving the nonlinear global optimization issues with box limitation as: where : ℝ → ℝ means a nonlinear function and = {x ℝ d |l h ≤ ≤ u h , ℎ = 1, … } defines a limited and possible space restricted by low (l h ) and upper (u h ) limits.
Here, SSO applies a population of candidate solutions for solving optimization issues. A solution indicates the new position while the typical web implies a search space . Here, population is classified as 2 search agents namely, Male ( ) and Female ( ). By this aid of simulating an actual spider colony, number is of females, and is selected randomly from the overall population , while the rest Nm is assumed as male individuals ( = − ) . During this scenario, group develops female individuals ( = { 1 , 2 }) , thus group, male individuals, ( = { 1 , 2 , … , }) , where = ∪ ( = { 1 , 2 , . . . , }) .In this algorithm, a spider is composed of weight which upon the solution fitness, and weight is estimated using the given expression: where depicts the fitness of -th spider position, ∈ 1, … , , and and illustrates the best and worst fitness value of complete population . The data exchange is a basic action of SSO in optimization model [14]. As a result, it can be simulated by the vibrations generated in a web. Hence, vibration of a spider receives from a spider is developed as given in the following: where denotes the weight of -th spider and depicts the distance from 2 spiders. A spider is capable of perceiving 3 class of vibration, , , , , and , . , which are the vibration originated by adjacent spider with maximum weight regarding (w n > i ). , has generated by closest female spider and suitable if is a male spider and consequently , is generated by optimal spider in population . Here, a population of spiders are computed as initial stage k = 0 for determining value of iterations ( = ). Based on gender all spidersare estimated by a diverse set of evolutionary functions. For female spiders, novel position +1 is attained by changing the recent spider position . Hence, the modification is managed by a probability factor and movement is generated in relation to alternate spiders and the vibrations forwarded by the search space: where , , , and rand implies the random values from [0 , 1 ] and denotes the iteration value and individuals and implies the closer spider with maximum weight when compared with and optimal spider in a communal web, correspondingly.
Conversely, the male spiders are categorized into 2 classes namely, Dominant ( ) and non-dominant ( ). The group is combined with male spider which has the fitness values about complete male set. Therefore, group is developed by remaining male spiders. In optimization method, male spiders are processed by the given expression: where , , and rand defines the random values from [0, 1] and indicates the closest female spider to male individual . . When a novel spider is produced, it can be related to remaining population, and when the new spider contains optimal fitness, then worst spider is changed by ; else s new is removed. Fig. 3 illustrates the flowchart of SSO technique. The SSO algorithm for FS process involves the following operations [15].

Initial stage
Here, the newly developed approach is initialized by generating random population with collection of solutions (spiders' location) , = 1, … , . To estimate the solutions, fitness function (FF) has been applied; but, a solution ( ( )) at iteration has to be transformed as binary vector with the help of given expression: where ∈ [0, 1] indicates a random value. The boolean vector depicts the decided features corresponding to 1 and remaining features are 0 which are neglected finally.
The FF value applied in the newly projected approach is illustrated as:

= × No of in correctly samples Total Number of samples
where |. | means the count of selected features and parameters 1 and are applied in balancing classification error and |. |. Eq. (7) is comprised of 2 portions, initially, the first part is employed to estimate the classification error of using the WKELM classification which depends upon the selected features whereas the second portion implies the count of decided features. The WKELM classifier is learned according to the training set and forecasts the labels of testing set; followed by, error is evaluated by relating to the simulation outcome with actual labels.

Updated stage
It is simulated by dividing 2 classes namely, Female and Male; followed by, a female spider and a male . Afterward, mating is processed among female and male spiders which depend upon the mating values. The FF value for a spider is processed and the values are related to offspring FF, in which offspring interchanges the poor male when it can be female spider. Once the spider locations are upgraded, the FF values are estimated for . The optimal solutions are decided from 2 populations for developing new population and optimal solution is performed. Therefore, the predefined steps are followed still the termination condition is accomplished and result from the FS process is optimal which illustrates the subset of chosen features Feat Sel .

WKELM based Classification m
Here, the wavelet-mix kernel function for KELM method and develop the theme of weighted model in KELM has presented weighted WKELM framework.In sigmoid kernel function is defined as a global kernel function as well as translation-invariant kernel function. In addition, wavelet kernel functions are considered as local kernel functions. Thus, the integration of 2 kernel functions manages the learning as well as generalization capability and createto complete sense of features in rotation invariant as well as translation invariant.The weighted ELM approach is presented for dealing with the samples that are imbalanced in probability distribution, and it carries out quite well when compared with classical frameworks. Hence, weighted WKELM scheme develops the weighted mechanism for cost function and accomplishes identical effect of weighted ELM [16]. Therefore, KELM method has existed from ELM scheme, and weighted cost function is represented as shown below: In KELM approach, the simulation outcome is illustrated in the following: where denotes the kernel matrix, refers the weighted matrix, and defines regularization variable.Next, the wavelet-mix kernel functions are introduced for KELM and examine the weighted WKELMframework.

Experimental Analysis
To validate the performance of the presented SSO-WKELMtechnique, Amazon product review dataset with synthetic data. The dataset includes a set of 235,000 positive and 147,000 negative review files in the dataset. In experimental is performed in 3 Intel Xeon E3-1220 processors running at 3.1 GHz speed with 64 Gb RAM and Hadoop Framework. Table 1 and Fig. 4 investigates the accuracy analysis of the presented SSO-WKELM model on the applied dataset. From the experimental outcomes, it is apparent that the MI-CART method has resulted in an insignificant classification outcome with an accuracy of 0.7945 whereas the MI-RF model has resulted in a slightly increased accuracy of 0.8285. At the same time, the FSO-CART and FSO-RF models have demonstrated moderate and closer accuracy values of 0.8599 and 0.8914. But the SSO-WKELM model has accomplished superior performance by offering a maximum accuracy of 0.9456.     A recall analysis of the SSO-WKELM methodology is made with the existing models on the classification of positive and negative sentiments as depicted in Fig. 6 and Table 3 Fig. 7 and Table 4. The simulation result ensured that the MI-CART approach has resulted inthe least result by offering a lesser F-measure of 0.8229 and 0.7553 on the classification of positive and negative sentiments correspondingly. Likewise, the MI-RF methodology has exhibited certainly improved results by reaching an F-measure of 0.8545 and 0.7913 on the classification of positive and negative sentiments respectively. Along with that, the FSO-CART manner has offered a moderate performance by attaining an F-measure of 0.8833 and 0.8250 on the classification of positive and negative sentiments correspondingly.

Conclusion
This paper has presented an efficient SSO-WKELM model for the SA of big data for online product reviews. The SSO-WKELM model performs SA on big data through different processes such as preprocessing for unwanted noise removal, TF-IDF for feature vector generation, SSO algorithm for FSand WKELM for classification. The usage of SSO algorithm for FS processled to enhanced classifier results. In order to ensure the effective performance of the SSO-WKELM model, a series of simulations were performed on the product review dataset from amazon and synthetic dataset. The resultant experimentation outcome verified the proficient results analysis of the SSO-WKELM model on the applied dataset. As a part of future scope, the classification performance can be improvised by the use of FS methodologies.