E-Mail Spam Filtering Through Feature Selection Using Enriched Firefly Optimization Algorithm

Abstract: E-mail is the most common method of communication because due to its ability to obtain, the rapid modification of messages and low cost of distribution. Spam causes traffic issues and bottlenecks that limit the amount of memory and bandwidth, power and computing speed. For data filtering, various approaches exist that automatically detect and suppress these indefensible messages. A methodology based on SineCosine Algorithm (SCA) introduced which address the problem of space and time complexities are increased in E-Mail spam detection. In this method, WordNet optimized semantic ontology applies different methods based on semantics and similarity measures to reduce the large number of extracted textual features. This paper proposed the Enriched Firefly Optimization Algorithm (EFOA) method effectively selecting suitable features from an upper dimensional space using the fitness function. Once the best feature space is identified through EFOA, the spam classification is done using ANN. Intially, E-mail spam dataset is preprocessed, then the extracted textual features are Semanticbased reduction and Features weights updated using optimized semantic WordNet. The results obtained showed that the ANN classifier after selection of features using EFOA was able to classify e-mails as spam and non-spam. This EFOA demonstrates that the proposed method has led to a remarkable improvement compared to the SCA methods.


Introduction
With the growth in number of Internet users, e-mail has become the most widely used communication mechanism. Over the past few years, the increased use of emails has led to the emergence and aggravation of the problems caused by spam [1]. E-mails have maintained business communications leadership and continue to be a prerequisite for other electronic communications and transactions. The use of e-mails has led to a noticeable improvement in group communications, the impact of which is seen in growth of enterprises worldwide [2].
People use it for illegal and infernal purposes, phishing and fraud. Sending malicious link via spams that can damage our system and may also search your system. A spammer may collect the name of the individual who has a specific email address and include that name in the greeting of the message [3]. So, it is necessary to identify these spam mails which are frauds using ANN techniques.
The enriched firefly optimization algorithm (EFOA) [4][5][6] is a meta-heuristic algorithm. It is based on the communicating behavior of tropical fireflies. There are two important issues in the EFOA that involve changes in light intensity and formulation of attractiveness. The attractiveness of a firefly simply depends on its luminosity, but since attractiveness decreases with the distance between two fireflies, it seems that lower intensities involve less attractiveness. Even so, the EFOA still has a good capability. EFOA is also deficient, and will inevitably fall into local optimality, but its simple structure means that improving the algorithm has great potential. Address the shortcomings of EFOA by adding mechanisms to make it more effective.
The proposed approach Enriched Firefly Optimization Algorithm (EFOA) includes various components for select the optimal feature size to filter the E-mails using ANN classifier method into two classes: Spam and Non spam. The rest of the article is organized as follows: Section II presents a literature review related to the earlier spam detection techniques. Section III outlines the spam detection approach proposed EFOA. Section IV presents the performance analysis for the EFOA methodology. Finally, Section V concludes the whole discussion.

Related Works
Li et al., [7] proposed three different environments that are the Research Institute, the University and the business corporation in terms of their users. Five supervised basic machine learning classifiers were managed: Naive Bayes, J48, IBK, Radial Basis Function Network (RBF-Network) and Library for Support Vector Machines (Lib-SVM). The result of the classification outcome indicates that the decision tree and support vector machines can produce better results than the other classifiers involved in this study [5].
Mallik et al., [8] proposed text parsing in the field of spam filtering to parse text that is embedded in junk mail. The Naıve Bayesian (NB) classification algorithm is used to construct the template and the R tool is used for the pre-processing step. This method identified the most commonly used topics more unused topics in spam emails. This approach could be used with various algorithms in order to obtain best results. It could also be utilized with hybrid algorithms to get the best results. Moreover, the orange software could be used to find out the outcome of each algorithm in a short period of time. After that, it might be developed into system of actual environment and organizational system. Zhang et al., [9] proposed an automated detection approach specific to Chinese e-business websites by using the URL and functionality specific content of the website. Four machine learning classifiers were used, including RF, Sequential Minimum Optimization (SMO), logistic regression, and Naïve Bays (NB), and their results were evaluated using Chi-square statistics.
Laorden et al., [10] proposed the importance of finding anomalies discovery in UBE filtering to reduce the requirement for classification UBEs. Their work again analyzes an anomaly-based UBE screening approach that uses a data minimization approach that reduces pre-processing while maintaining information on the relevance of email messages to the nature of email. More recently, many task aimed at studying the suitability of various machine learning approaches including K-Nearest Neighbors (KNN), SVM, NB, neural networks, and others, to spam and malware email filtering, due to the ability of such viewpoint to learn, adapt, and generalize. [11] proposed a hybrid algorithm to optimize the rbf neural network and the particle swarm(HC-RBFPSO) for the classification spam emails. They used the particle swarm optimization algorithm to enhance the parameters of Radial Basis Function Neural Networks (RBFNN) based on PSO's scalable heuristic research. They split the dataset of spambase into 70% training set and 30% testing set. The experiments are measured using a different number of coverages from 10 to 50. The accuracy achived was 91:4% for the set of tests which concluded that the hybrid approach performed well compared to other algorithms tested on the same dataset.

Awad & Foqaha
Vyas et al., [12] offered alternative classification techniques using WEKA to filter spam mails. The technique of Naive Bayes demonstrated the apparent accuracy and least time among others. This paper provides a comparative review of all procedures in terms of accuracy and as well as the busy time is generated.

Proposed Methodology
EFOA based spam detection method is described in detail to address the problem of space and time complexities are increased and improve the E-Mail spam filtering. A critical step in the spam filtering of E-Mail is to find good feature selection and representation. The optimized semantic WordNet ontology is introduced for reducing the extracted textual feature and representation of updating weighted feature. It enhances the accuracy of spam filtering with high dimensionality of E-mail. Figure 1 shows the architecture of EFOA.

Preprocessing
In pre-processing, tokens are extracted and the nonrelevant tokens such as numbers and symbols are elliminated. The tokens are deleted from the body and subject matter of the Email. Following this, the stop words are removed. For more data cleansing, refer to WordNet as the Natural language Processing tool. WordNet is a optimize semantic network of words and there synonyms, antonyms, hyponyms, hypernyms, meronym and many more relations between words, which arrange English words into collection of synonyms called optsynsets [13]. Once an electronic document has been pre-processed, electronic messages may be represented by: where, each term 't' is regarded as a feature that has a matching weight 'w' in a given outcome.

Feature Extraction and Calculate Feature Weighting
The weight of the retrieved feature, in which every term 't' is weighted by a weight 'w' using the reverse document frequency and term frequency (TF-IDF) method. The frequency of the term indicates how many times the term 't'appears in the Email document 'd' as indicated in the Eq. (3.2).
when fd(t) is the frequency for the title 't' in e-mail. It measures the rarity of a certain term throught the document by means of equation (3.3).
where dft is a number of e-mails with heading 't', and 'N' represent the total number of e-mails. Lastly, the TF-IDF is calculated following the multiplication of Eqs. (3.2) and (3.3):

Feature Reduction using Semantic Approach
In this process, synonyms for each feature are extracted, and then extracted features are replaced by their synonym ensemble concepts. In addition to extracting synonyms from each term, the hypernymous/hyponymic relationships are considered through the WordNet semantic optimization [21,13]. Word-to-word similarity is measured through various semantic similarity measures.

Semantic-based Reduction
The optimized semantic WordNet used as a method that stores the different forms of a word like such as English names, adjectives, verbs and adverbs to all synonyms. These synonyms are referred to as optsynsets, which are related to each other by semantic relationships. In semantic-based reduction, synonyms set of each term in the Email are used to group the terms that have common synonyms. The hyponymic relationships refers to ''is a kind of" or ''is a", which connect new general optsynsets to specific ones, while the hypernymic link represents the converse of the hyponymic relation. After applying semantic relationships, various optimized semantic similarity measurements are applied to increase the rate of reduction rate for the features.

Path based Measurements
Steps based on path [20] are based on the length of path between two concepts. Three versions of similarity measure versions are tested for their performances that are path length measurement, WUP measurement and LCH measurement.
Path Length computes the semantic similarity of a concept pair by counting the number of nodes along the shortest path among concepts in WordNet's 'is-a' hierarchies. The path-like score is converse correlated with the number of nodes along the shortest path between the two words. Therefore, the equivalent metric equation is as follows: where t1 and t2 are either terms. WUP measurement calculate similarity by examine the depths of the both terms in optimized semantic WordNet with the depth of the least common subsume (LCS) as shown the given equation: Where the lowest common subsume (LCS) is the most specific common ancestor between two optsynsets ( 1, 2 ).
Leacock and Chodorow measure (LCH) obtain the shortest path-length among two concepts, and according to the given equation then scales the outcome value by the maximum depth found in the ''is-a" hierarchy.
The information Content (IC) provides an indication of the specific nature of the concept [18]. An IC measurements is the Resnik measurement which calculates the information content of the least common subsume (LCS) of the two terms, using the given equation: The IC for a term (t) is defined as: when P(t) the probability of a term (t) in a given messages with (N) separate terms. Relationship measurement: A different type of relationship measurement has been applied, namely HSO (Hirst and St-Onge) [16]. The HSO measurement estimates relation between the search terms by the journey distance between the nodes. The number of rework in the direction of the path linking two terms and the enable it of the path. The HSO function reads as follows: where dir is the number of directional changes from two terms t1 to t2, and C, K are constants whose values are based on experiments.

Update of Feature Weights
After obtaining the reduced functionality is achieved, a new weight is given to the terms depending on the optimized semantic similarity metric applied. Once the semantic measure is computed, a match between two terms is generated. If that distance is below a certain threshold, then the weight of this term is refreshed with a new weight value computed by: In which wi and wj are the weights (TF-IDF) of two terms i, j and is the length of similarity between the two terms. The resulting sequence is transformed into a binary sequence by means of the relation in eq (3.14).

Enriched Firefly Optimization Algorithm for Feature Selection
Where = location of each firefly, 1 = chance of one function being selected, and 0 = likelihood of a feature not being selected. Each Firefly initiated in the swarm has its own location according to the number generated from each Firefly [4]. In the proposed algorithm, the FF is determined in a way that minimizes the classification error rate on the validation dataset, as demonstrated by eq (3.15). This distance between two fireflies is computed using the hammering distance method, in which every bit of firefly is deducted from the firefly. In this way, the distance is shown as the difference between the binary strings of the both fireflies [6]. This method enhances the ability of the EFA to work more effectively better with binary features than it does with continuous values.
In the swarm, every firefly is swallowed up by a brighter fireflies. In the algorithm, the best firefly position is updated by means of equation (3.18).
A two-dimensional stress state consists of three diferent stress components.The normal stresses σyy and σxx and the shear stress σxy = σyx. Randomness is reduced by different constant rate δ, where δ ∈ [0.95,0.97] so that at the final level of optimization, the value of α will be maximised, as in equation (3.19).

EFOA Algorithm
Step 1: The spam dataset is preprocessed and normalized.
Step 2: Extract the Email features and get synonyms set of a term.
Step 3: Merge the terms that have common synonyms and increase its weight.
Step 4: Reduce the feature using semantic based path reduction.
Step 5: Calculate the similarity distance between terms for each measure.
Step 6: Calculate the new weight and updated weighted feature.
Step 7: Selecting the best feature using Fireflies.
Step 8: Initialize all swarm of n Fireflies.
Step 9: Calculate the fitness and the light intensity of each firefly using.eq.(3.16).
Step 10: Generate spam optimal topologies using stress method.
Step 10: Update the position of best ( ) using eq. (3.18) Step 11: Get the swarm sorted and locate the best firefly.

A. Datasets
Spam dataset is intended to categorize e-mail as Spam or Spam-free.There are 4601 emails and 58 attributes in this dataset. Each instance within SPAM is made up of 58 attributes. Most attributes denote the frequency of a particular word or character in the email which matches the instance. The frst 48 attributes include the frequency of the attribute name within the e-mail. Attributes 49-54 are the number of characters ‗;', ‗(', ‗[', ‗! ', ‗$ ', and ‗# '. Attributes 55 to 57 define the average; longest and total duration of uppercase letters. Attribute 58 specifies the type of mail which is either "nonspam" or "spam".
Enron-Spam is a collection of e-mails consisting of six different datasets each containing ham messages from a single enron corpus user. In six datasets, most emails in Enron 1-3 are legitimate, whereas most emails in Enron 4-6 is spam. This dataset consists of 30,041 electronic messages [17].

B. Evaluation Metrics
Generalized and performing indices are necessary to assess the proposed algorithm in order to compare its results with those of other methods in this respect.

Accuracy
Accuracy is computed as the percentage of the dataset correctly caregorized by the algorithm. The percentage of total number of properly recognized e-mails defined by the following formula:

Recall
Spam recall is defined as the probability of correctly classifying spam e-mails as spams, and the legitimate recall is defined as the probability of properly classifying correctly legitimate e-mails. The Recall formulas are listed below:

False Positive Rate
The false positive rate defines the error in judgment ratio of legitimate messages as spam. The FPR formulas are listed below:

False Negative Rate
The false negative rate defines the spam mail misjudging ratio as legitimate. The FNR formulas are listed below:

Time
It measures the amount of time taken to filtering the spam emails from the database.

C. Results
The proposed method produces high accuracy when compared with previous method.The efficiency of SCA and EFOA is analysed and compared against existing methodologies, namely MLP [15] and SCA [14]. Table 1 shows, performance of filtering accuracy values of spam and non spam messages. The results shows that the EOFA method shows better filtering of spam and non spam accuracy values when comparing with other existing methods in spam dataset. The results shows that the EOFA method shows better filtering of spam and non spam accuracy values when comparing with other existing methods within enron spam dataset. Table 3 shows, the performance assessment for the spam datasets. The EFOA method on spam dataset has better precision, Recall, False Positive Rate,False Negative Rate, time than EFOA method on enron spam dataset. Table 4 shows, performance evaluation of enron spam dataset. This results shows the comparison between spam and enron spam datasets in terms of Precision,Recall,False Positive Rate, False Negative Rate, Time on different Methods. From this analysis, it is proved that the proposed EFOA takes less time than SCA for Email spam filtering on spam and enron spam datasets.

Conclusion
In this article, EFOA is proposed to effectively deal with the space and time complexity problem in the context of email spam filtering system. The optimized semantic WordNet is employed to clean the noise data and reduce the large number of text features extracted from the email. In optimized semantic WordNet, Semantic-based reduction and Features weights updated are used for better feature reduction and representation. After selecting best features are fed into ANN classfier technique and classify the spam and non-spam of email messages. The experiments are carried out in a spam dataset and the experimental results demonstrate that the proposed EFOA is better than the existing SCA method.