Automatic Genre Categorization of Emails into predefined categories using machine learning

In today’s dynamic world, there is a need for fast, efficient, and reliable means of communication. To meet these requirements email system was developed and it got popular with the invention of WWW. Now, the Email system has been used extensively for official, business, and personal communication. On average individual users receive 50-60 mails each day. It is becoming a burden to easily manage emails. So there is a need for effective and reliable means to organize the mails for easy and fast retrieval. An efficient approach is proposed in this paper to classify the mails based on the predefined genres. It has been observed in the proposed research that the classification of emails greatly improves efficiency and saves time and effort to manage them. The results obtained in this paper are very encouraging. Over 90 % of emails are categorized correctly. Email genres are predefined and corresponding keyword lists are generated. Frequency tf-idf of the keywords in the email decides the genre of mail. SVM is used as a multiclass classifier. In this paper need for negative training data has been removed as the proposed classifier works on the principle of one class against the rest.


Introduction
Data classification has been done since ages for fast and easy retrieval of information. Earlier, amount of data and applications of data were limited. In non-digital era data managers were force to classify data by applying manual efforts. They succeeded to an extent to solve their local problems by putting cumbersome and laborious manual efforts. But, with the usage of digital technology, huge amount of data start getting generated. This corpus of data started posing challenge to the data managers. Engineers started putting efforts to automate the job of data classification. They applied various automatic and semi automatic techniques for classification [3]. The results obtained were not encouraging. Human intervention was still required to large extended to increase the faith in the classification procedure. The next big lead happened with the invention of the World Wide Web. In these days, daily TB of data is generated in the form of web pages/doc, blogs, articles, news items, educational content and emails. This enormous data puts forward further challenge to classify data in to various categories to gain maximum potential to use data in more meaningful ways. Data can be classified based on the predefined category such assentiment [2], subject, genre and functional category etc. Large number of applications can consume the classified data and get inherent benefits of categorization. Email application is one of the most prominent one to generate and consume data. Over 70-80% population is using this application for personal and official communication media for fast and effective ways. Daily, billions of emails are generated resulting in large data. There is the need to automatically classify [10] the contents of email to gain maximum benefits. Automatic classification of email data leads to efficiency in organization and performance of time large number of emails are gathered in the inbox and sent folder [8]. It becomes difficult to immediately organize, search and retrieve relevant mails. The purpose of the proposed research is to automate the process to classify the mails in to various categories for easy and fast access. It further helps in managing relevant and irrelevant mails. The author is convinced to develop a scheme which can classify mails to get the following benefits. Both binary and multiclass classification techniques as shown in fig1 and fig2 are used in the literature.

Organization
Section 2 discusses the literature survey corresponding to proposed work. Section 3 describes the motivation behind this work.. Section 8 explains various algorithms developed for this scheme. Section 10 discusses the experimentation results and discussion. Finally, Section 11 describe conclusion of the article with future scope. Gomes et. al. (2017) [1] did a study on two approaches Naive Bayes and Hidden Markov Model (HMM). Combination of natural language processing techniques was tried on both techniques to compare the accuracy and find the best method. Saidani et.al. (2020) [2] used semantic analysis to enhance the spam detection performance. Their scheme was based on two semantic level analyses. In first step domain based email classification was done and in next step semantics features for each domain was applied to detect the spam emails. They conclude it is a better method for spam detection. Mohammad (2020) [3] used data mining and machine learning techniques and proposed an enhanced model ensuring lifelong spam classification model using Adjustable Dataset Partitioning (ELCADP). This method concluded enhanced performance in comparison to stream mining algorithms. Research work further emphasized that offline spam emails for creating lifelong classification systems.

Researchers
Chen et. al. (2019) [4] also worked on spam detection using Long-Short-Term-Memory model. Active learning model was used to reduce the cost of labeling. Deep learning approach was applied to attain the better performance. Results concluded that this technique is better than classical CNN and RNN based models. Sainiet.al. (2018) [5] used self-organizing map (SOM) in exploratory phase of data mining. Input data is projected as a lower dimensional map. This technique was based on based use of no labeled classification data. Cascaded SOM was capable of solving any multi-class classification problem that was not labeled. Li et.al.
(2019) [6] proposed multi-view disagreement-based semi-supervised learning to reduce the threat of suspicious mails to Internet of Things (IoT). This method provides rich information for email categorization. In the opinion of researchers the multi-view data has the higher possibility to achieve higher accuracy in comparison to single view data. Bahgatet. al. (2018) [7] used syntactic feature selection. This is relatively new technique. The study concluded that this approach takes less time and a significant performance with higher accuracy. Gupta et. al. (2017) [8] studied the issues of online websites and service providers using single mail-Id to address the issues and concerns of the customers. They applied artificial neural network (ANN) in their work correctly identify spam emails. This method showed text based classification of emails Kumaresanet.al. (2017) [9] used hybrid kernel based support vector machine learning in the study. The features are extracted from both text and images. TF-term-frequency is used for textual features and the image dependent wavelet moment is considered for classification of emails. Results claimed the accuracy up to 97.235%. Alkhereyf et. al. (2017) [10] work focused on extracting to lexical features in addition to data from social networks features. Enron and Avocado email data set are used for experimentation purpose. SVM and Extra-Trees classifiers are applied on these features to compare the results and it is concluded that SVM perform better in term of accuracy.

MOTIVATION
Email application users send and receive large number of mails on daily basis. Over the period

•
Multiclass classifier is used. Hence reduced data set , training time, cost and efficiency • More Genre can be easily incorporated • No need of negative training data set, because it works one class against the other classes.

•
To effectively manage the large number of emails • To improve upon the existing methods of email classification.

•
To save time in data organization • To gain better user experience To overcome the efforts and problems of manual and semiautomatic classification methods, author proposes an efficient scheme for automatic classification of emails based on the predefined genres as shown in fig3. In the proposed scheme an algorithms is developed to extract the features from the text of email. Dataset used for this purpose is Enron email Dataset.

FEATURES SOURCES
Mostly email data is in the form of bags of words present in the header "subject" and the body "content" of email. So the header and body of email are good source of features. Mostly text classification techniques are applied on such kind of data.

Fig4. Feature extraction and Training, Testing data generation
Feature from the header and body are extracted separately for the purpose of experimentation as shown in fig4 to analyze the contribution of just header in deciding the genre of mail Bags of words are extracted from the text and body by removing the stop words and punctuations. A list of words is created. Then these individual words are processed for obtaining the stemmed words. A final list of stemmed words is generated.

GENRES
Predefined genres are proposed for experimentation purpose. These genres are official, personal, promotional, confidential and others. Corresponding to each genre bags of words are generated in the form of genre keyword list. These lists are again stemmed to improve the results. Dimensions of the feature set are reduced by removing the irrelevant features by assigning these zero weight. This approach is iteratively applied on the model to select the most promising feature set to get high performance and accuracy.

CLASSIFIER MODEL
Support vector machines (SVM) are the unsupervised learning algorithm that learns some features from the dataset feed as training data on the basis of decision planes to generate decision boundaries. A decision plane separates between a set of items belonging to number of categories. SVM is capable to solve non-linear high dimensional and global optimum problems effectively. SVM is initially introduced by Vapnik et al.
[11] [14] as a semi supervised machine learning tools. It is extensively used for categorization and classification of data. The function f(x) : wt x + b is defined; w is weight vector ; b is bias [11]. The value of b displaces f(x) away from the origin as shown in fig5.

Fig5 Support vectors
New data is feed to use the previously learned features to decide the category of the data. Support vector machine (SVM) can also be used to train the multiclass classifier. In proposed work the unique model is trained as one class vs. other class. This will reduce the requirement of negative training data set.

ALGORITHM
Following four algorithms are developed for the proposed research work. Algorithum1 describes the method for predefined genre selection and generation of genre related keywords. Alogrithm2 is used for Feature extraction and feature sets generation from the email data set. Algorithm3 discuss the method of feature set generation and assigning weight to the features which are used for genre classification. Alogithm4 is developed to create multiclass classifier which operates on the principle of one genre against the other genre. Sigmoid function is applied with largest confidence value to decide the class of email. It use fivefold Cross validation technique to crosscheck the output.

DATASET ENRON
This dataset was prepared by the CALO Project. Enron contains data organized into folders and it is used as a resource for research purpose. This email dataset is in public domain [15]. The reason other datasets are not public is because of privacy concerns. Only few folder of this data set are used for the purpose of proposed work to collect approximately 5000 samples. Following genre labels as shown in table1 were used in the dataset Dataset feature vector are scaled [14] in range [-1 , 1]. Scaled Training data set is feed to the SVM for training purposes.

EXPERIMENTATION RESULTS AND DISCUSSION
LIBSVM Support vector machine tool [12] is used to simulate the results and data is converted to SVM Data format. Training and test data sets are organized in the following format for the purpose of using LibSVM tool.
Researchers applied the liner, polynomial and Radial kernel function with different optimum parameter as shown in figures [6,7,8]  Fig9: CV Acc: Enron data and Random data Multiclass classifier finally achieved satisfactory level cross validation accuracy with respect to predefined genre. In case of Enron data as shown in Table3 -the genre CV for the official email is 97%, personal is 87 %, Promotional is 93%, and confidential is 92%. Whereas random samples CV is 91% for official genre, 82% for personal, 89% promotional, 79% confidential and 92% others. The data is further analyzed and corresponding test data set is applied on the optimally trained model to find the evaluation metric precision (P), Recall (R), Accuracy (A) and F1-measure to rate the actual performance of model and to increase the confidence of the proposed work. The results described in fig10 show this work are able to achieve a higher level of accuracy. Moreover no negative data is required for classification because the multi classifier work on the unique principle of one class against the other classes. This multiclass classifier can be applied for spam [1] and fraud [7]  Some of the researcher were able to produce relevant results. But the proposed research work is producing the better results and it is more generic. It can be extended to easily incorporate new categories. Also this work does not require the negative training data.

CONCLUSION
Email system hugely contributed in transforming the world into global village by providing the fast and reliable source of the communication. All over the world people are dependent on various email services. Users of these services use email system for personal, professional, social, promotional communication. User's email boxes are flooded with dozens of mails each day, which makes them uncomfortable in tracing out the important messages. In order to browse whole lot of mails user may miss or ignore some important communication and deadlines. So it becomes obvious to develop a scheme which can effectively categorize these mails into predefined genres to make enhanced user experience and improve the productivity. In this direction proposed scheme is developed to come up with a framework for automatic classification of emails into set genres. To achieve the target machine learning model is developed which is based on carefully selected variety of features to generate multi-attribute criteria. The experiment setup on sample and random data sets produced promising results in terms of overall accuracy of up 90% and efficiency. Proposed scheme used SVM as machine learning tool. Useful and most relevant feature set is the most important aspect for the success of the system and trained model.

FUTURE SCOPE
The scope of this work is going to be extended by considering more features and genres to improve the results. Further, work can be extended to accommodate regional language stuff. Language specific keywords and feature sets needs to be explored for this purpose. Idea is to eventually develop multilingual framework to fit the requirements of the modern societies by enhancing the user experience.