Optimizing Text Categorization for Indonesian Text Using Clustering Label Technique

Text Categorization plays an important role for clustering the rapidly growing, yet unstructured, Indonesian text in digital format. Furthermore, it is deemed even more important since access to digital format text has become more necessary and widespread. There are many clustering algorithms used for text categorization. Unfortunately, clustering algorithms for text categorization cannot easily cluster the texts due to imperfect process of stemming and stopword of Indonesian language. This paper presents an intelligent system that categorizes Indonesian text documents into meaningful cluster labels. Label Induction Grouping Algorithm (LINGO) and Bisecting Kmeans are applied to process it through five phases, namely the pre-processing, frequent phrase extraction, cluster label induction, content discovery and final cluster formation. The experimental result showed that the system could categorize Indonesian text and reach to 93%. Furthermore, clustering quality evaluation indicates that text categorization using LINGO has high Precision and Recall with a value of 0.85 and 1, respectively, compare to Bisecting K-means which has a value of 0.78 and 0.99. Therefore, the result shows that LINGO is suitable for categorizing Indonesian text. The main contribution of this study is to optimize the clustering results by applying and maximizing text processing using Indonesian stemmer and stopword.


Introduction
Text categorization using clustering is a grouping of text documents which has capability handling categorization of high volume data. There are three types of data contained on a computerstructured, semistructured, and unstructured. Currently, there are a lot of data stored in unstructured models such as full-text documents provided on the website, email, and others [1]. Therefore, text categorization has a role in putting a text document into the appropriate groups, so they can help in the process of finding information from large data sources [2,3].
Several techniques can be used for text categorization, while one of them is used clustering technique. Clustering is the grouping of the data set into the meaningful smaller groups, or also called clusters. One example of an application that uses clustering is Google News app that takes news from several sites, and then the news is grouped into specific topics such as business, technology, entertainment, sports, science, and others. Furthermore, text clustering plays a significant role in navigation and browsing process, and also can manage a large amount of stored electronic data [4]. Therefore, text categorization using clustering technique is automatic text grouping which has the capability to handle a significant amount of data and using the principle of maximizing the similarity between documents in the same group and minimize the similarity between groups.
The problem for huge unstructured data would be having the difficulty of getting information about the characteristics of the desired document from the data source. This is due to too much spread data and there are no proper tools to solve the problem. Search engines usually provide numerous search results when using common words as keywords, so users have difficulty in finding the desired information [5]. Furthermore, there are several challenges in text clustering, such as determining the similarity between the text and determines how a text is suitable to fit into a cluster. Moreover, a proper text consists of a set of words from a particular language, while every language has a morpheme, word, and different grammar. Therefore, text categorization using clustering techniques is also one way to categorize unstructured text for easy management and access.
Text categorization using LINGO clustering algorithm has been applied for English [15], Polish [9], and Marathi [5]. Text categorization in English using LINGO showed that evaluation value for Precision (89.  and Recall (90-91), therefore, based on algorithm comparison indicates that LINGO is better than K-Means [15].
Likewise, text categorization for the Polish evaluated by some users' shows that the useful clusters are equal to 70-80%, and 80-95% of snippets inside those clusters matching their topic [9]. Similarly, for Marathi, it shows good evaluation results which indicate Precision and Recall with values of 86.58 and 96.33, respectively [5].
Research for Indonesian text categorization has also been applied using clustering technique, namely the following algorithms: Fuzzy C-Means [16], K-Means [10], [17], and Single Pass Clustering [18]. However, it is only Single Pass Clustering algorithm using evaluation techniques of Precision and Recall, which has the values of 0.79 and 0.88, respectively. Based on this, the Indonesian text categorization using Single Pass Clustering algorithm shows a lower value than the use of LINGO in other languages mentioned above.
Another method is Bisecting K-means clustering technique also applied to process the data set into four stages [11]: (1) Select a cluster to split, (2) Find 2 sub-clusters using the basic K-means algorithm, (3) Repeat step 2, the bisecting step, for a fixed number of times and take the split that produces the clustering with the highest overall similarity, (4) Repeat step 1, 2, and 3 until the desired number of clusters is reached. Finally, to determine experimental performance result uses Precision, Recall, and F-measures formulas. Therefore, research on text categorization of Indonesian text documents using LINGO and Bisecting K-Means algorithms need to be examined.
This paper presents an intelligent system that categorizes Indonesian text using two clustering techniques. In the proposed clustering technique, clustering algorithm methods are applied to process the data set into several stages: (1) pre-processing, (2) extract frequent phrase, (3) cluster label induction, (4) content discovery, and (5) final cluster formation. Pre-processing of the dataset performs tokenization, stemming and stopword mark. In this phase, we optimize the stemming and stopword to increase the clustering result [28][29]. The next phase, it extracts the phrases to be candidates label. In cluster label induction phase, the term-document matrix (TDM) is constructed based on term frequency-inverse document frequency (TF-IDF) weighting scheme. Also, Singular Value Decomposition (SVD) technique is applied to identify labels of each cluster. Cluster content discovery phase uses Vector Space model to assign input document. Finally, text clusters are sorted based on the score. To determine experimental performance result uses Precision, Recall, and F-measure formulas.
The experimental result of this study demonstrated that the system could categorize Indonesian text contained in the data set to reach 93% and 80%. Also, it showed that LINGO which has a value of Precision, Recall, and Fmeasure are 0.85, 1, and 0.92, better than Bisecting K-means reach which only reach a value to 0.78, 0.99, and 0.87, respectively. The evaluation results are almost the same to the evaluation of Precision and Recall using LINGO forEnglish. Therefore, LINGO is suitable for categorizing Indonesian text. Furthermore, this study is applied and maximized Indonesian stemmer and stopword so that system can optimize the clustering label results, especially performance metrics Precision and Recall.
This study proposes optimizing text categorization for Indonesian text. For text processing uses Indonesian stemmer to produce the right word root, then for stopword marking process using Indonesian stopword. In this step, both LINGO and Bisecting K-means algorithms are used.

Methodology
For this research experiment, a dataset of Indonesian text documents containing Indonesian Translation Quran (ITQ): Surah Al-Kahf verses one trough 30 collected from http://tanzil.net/#trans/id.indonesian with the last update on June 4, 2010 was used. This dataset has 398 of words. Stopwords used contains 759 common words that are often appeared in Bahasa Indonesia [22]. Each dataset was translated and attributes selected by Carrot controller and categorized using LINGO and Bisecting K-means algorithms.
Clustering algorithm method processes the dataset in five phases, namely Pre-Processing, Frequent Phrase Extraction, Cluster Label Induction, Cluster Label Discovery, and Final Cluster Formation [9]. Figure 1 presents text categorization process using Clustering algorithm method.

Figure 1.Text Categorization Process
The detail of text categorization process using clustering algorithm method will be described in the following section.

Pre-processing
In the pre-processing phase, the data selected is cleansing through some process that starts with tokenization on a dataset to produce token data. The tokens are normalized into a letter format by converting it to lowercase. At the tokenization process, input texts tokenize using tokenize script, based on space characters, and lowercase, for the example, the verse: Merekakekal di dalamnyauntukselamalamanya (Where they will abide forever), will tokenize to: The next process is apply stemming to make root form from words in preprocess text [23]. After tokenization process, word from the token above, mereka | kekal | di | dalamnya | untuk | selamalamanya, will steam to mereka | kekal | di | dalam | untuk | lama. The stemming script, will steam the word using Indonesian Stemmer which already setup based on prefixes, suffix, and preposition in the Indonesian language.
The last method is to do stopword marking based on stopwords list indexed. At the stopword marking process, the script will mark the stopword with -1 (minus one), 0 (zero) for the Verse title, and 1 (one) for term or word. So, for the words above, the script will give mark mereka |1| kekal |1| di |-1| dalam | 1 |untuk |1| lama|1|. Pseudo-code Algorithm for Preprocessing:

Frequent Phrase Extraction
In this phase, the phrases are extracted into candidates label if considered to meet several requirements [9]. The following requirements are: (1) if a phrase or a single term in the input documents appeared at least the same as term frequency threshold, (2) does not cross sentence boundaries, (3) being a complete phrase and not begin nor end with a stopword.

2.3Cluster Label Induction
Cluster label induction performs the following four steps; term-document matrix building, abstract concept discovery, phrase matching, and pruning label. Firstly, term-document matrix building is constructed based on term frequency-inverse document frequency (TF-IDF) [24]- [26] which is generated in the first phase. Secondly, the term-document matrix is calculated using the method of Singular Value Decomposition to find its orthogonal basis, which supposedly represents the abstract concepts appearing in the input documents [9]. Thirdly, the phrase matching process uses standard cosine distance to calculate how well a phrase or a single term represents an abstract concept, resulting in a value that is also used as the score of a label.
Cosine between document vector aj and the query vector q is calculated by the formula (1): Where aij is the degree of relationship between term i and document j, aj is the jth document vector, t is the number of terms, and ‖a‖ denotes the length of vector a, and T is the sequence of elements (t1, t2, t3,…, tn).
Fourth, during pruning label process, all pairs of candidate labels are calculated to get the similarities by using classic Vector Space Model, then select one label with the highest score for each group of similar tags.
Pseudo-code Algorithm for Cluster Label Induction:

Cluster Label Discovery
Cluster label discovery phase uses classic Vector Space Model to assign the input text to the cluster labels induced in the previous stage, and then matches the input snippets against a series of queries, each of which is a single cluster label [9]. Snippet assignment threshold values fall within the 0.0-1.0 range and empirically verified that thresholds within the 0.15-0.30 range produce the best results. For a certain query label, if the similarity between a snippet and the label exceeds the Snippet Assignment Threshold, it allocates the text to the corresponding cluster. For those snippets that don't match any cluster labels are assigned to "Others" [20].
Pseudo-code algorithm for Cluster Content Discovery: STEP 4: Cluster Content Discovery 30: for all L  Cluster Label Candidates do 31: create cluster C described with L; (L: Label, C: Cluster) 32: add to C all documents whose similarity to C exceeds the Snippet Assignment Threshold; 33: end for 34: put all unassigned documents in the "Others" group

Final Cluster Formation
In this final cluster formation phase, clusters are sorted based on their score. The score is calculated using the following formula (2): Where is cluster, label score is label score, and is the total number of documents assigned to cluster [9]. Pseudo-code Algorithm for Final Cluster Formation: STEP5:FinalClusterFormation do 36:ClusterScoreLabelScorex ; 37: end for C:Cluster ||C||: The number of documents assigned to cluster C Clusterscore: Cluster Score Labelscore:Label Score

Evaluation
In this phase, it measures the experimental testing and evaluates the performance using precision and recall, and F-measure [27]. The following formulas are Precision (3), Recall (4), and F-measure (5):

Results and Discussion
This section explains each process of the experimental data and also evaluates the performance of the experimental testing. The dataset used in this experiment is the Indonesian text given on the Indonesian Translation of Qur'an (ITQ). Text categorization is performed on each of the datasets. First, it does parsing of the dataset and then chooses selected attributes. The selected attributes from the dataset are the name of surah and verse content.

Pre-processing
In tokenization phase, the data are cleansed from the characters and terms that may affect the quality of the group's description. To clean the text is done by making a cut word/token by a space character and punctuation found as seen in the first row in Table 1. Table 1 shows the results of sample documents that contain text cleansing "Segalapujibagi Allah yang telah..." ("All Praise be to Allah who has…"). Line Field Index in Table 1 shows that the type of field based on the attribute specified, namely a value of -1 for stopword, the value of 0 for the name of the letter, and the value of 1 to the contents of the paragraph. In the phase of normalization, case letter on the token image is normalized to lowercase, and in the steaming to get the word essence, so to token "siapakah" (who), it is turned into "siapa" (who), as well as for other tokens. However, irregularities were found in this phase such as token "seorang" (a man) that was supposed to be an "orang" (man).

Frequent Phrase Extraction
In this phase, the resulting candidate's phrases will be the cluster label. Table 2 shows the phrases were extracted successfully from the token data that have been made in the previous stage. Phrases column in this table is the cluster labels candidate to be used. Tokens will be a candidate if they meet the requirements determined, that does not begin and end with a stopword has a value of term frequency (TF), that exceed the value of the minimum term-predetermined frequency, namely 1. So, the only phrase which has a value of TF more than 1 will be the candidate, such as phrases: "Tahun" (Year) and "Niscaya" (Undoubtedly) as the cluster labels which have a value of TF: 3 and 4, respectively.

Cluster Label Induction
Based on the label candidates from the previous phase, this phase, 16 cluster labels index were generated with scores for each cluster label, such as index 476 with the score of 23.79, index 111 with the score of 16.97, and so on. The number of cluster label index represents the position of a phrase within the dataset. Therefore, if it is checked on the cluster label data, cluster label index 476 is a presentation of the phrase "Tahun (Year)," index 111 is a presentation of the phrase "Niscaya (Undoubtedly)," and so on. The cluster label score shows the value of the label similarity with an abstract concept. Table 3 presents sample result of cluster label induction with cluster label index and score.

Cluster Content Discovery
In this phase, the document associated with the cluster label is found using classic Vector Space Model. Table  4 presents a sample of the documents related to the cluster label index "476" where the actual content is the "Tahun"(Year) and the clustered index label "111" that has the content "Niscaya" (Undoubtedly). The documents associated with each group label are found by using classic Vector Space Model. For cluster label "tahun" (year), two related documents were found (Table 4).Documents are chosen because it has the phrase "tahun" (year) in it. For documents area not assigned to any cluster, it will be put in a cluster labeled "Others." Indeed, if they can figure out where you are, they will throw you with a stone, or force you to return to their religion, and if soundoubtedly you will not win forever.

Final Cluster Formation
This is the cluster that has been generated by another vote in this phase. This phase presents the results of scoring from the resulting cluster. Cluster score shows only clusters that are considered good by LINGO and Bisecting K-means. In this phase, the resulting score from each cluster. For cluster "Tahun (Year)" has a 23.79 score, cluster "Niscaya (Undoubtedly)" have a score 16.97, and so on. Value scores do not indicate the level of quality of a cluster; the higher score shows just how good these clusters for LINGO, so it does not have any influence on the outcome of performance metrics.

Evaluation
In this experiment, the result of study shown in Table5, the prepared dataset consists of 30 documents, Cluster label induction using LINGO method was generate 16 categories or cluster labels. The result contains 93% of documents, but there are 7% of the documents that cannot be categorized. Cluster label using Bisecting K-means method was generate 7 categories or cluster labels, contains 80 % of documents, and 20 % of the documents that cannot be categorized.
Based on the measurement of the accuracy of the documents placed in each category (Precision) shows an average value of 0.85, while the measurement of relevant documents called for each category (Recall) shows an average value of 1. Compared with Bisecting K-Means with a value of Precision0.78 and Recalls0.99, thus LINGO clustering algorithm has a better quality than Bisecting K-means. F Measure Using LINGO 0.92 and F Measure Using Bisecting K-Means 0.87.
Previous research used the K-Means algorithm for the categorization of Indonesian text [17], performed using the F-Measure (0.67) and purity (0.61) evaluation. Therefore, our study is better than using the K-means algorithm [17], as our research optimizes the stemming and stopword processes that are compatible with the Indonesian language. Similarly, our experimental results are also better than the Fuzzy C-Means algorithm [16] which uses tf-idf weighting for cluster label determination, whereas our research uses Singular Value Decomposition and Vector Space Model to create cluster labels and cluster documents.
Based on the experiment result, there is a limitation to suit for text categorization of large data set since LINGO algorithm takes up much memory and a high number of matrix transformations. Similarly, the discovery of stemming error from the pre-processing phase which greatly affects the subsequent phases, it influences the outcome of precision in performance metrics. Thus, a more optimal method using stemming algorithm is crucial to producing clusters with high accuracy. Another limitation is the lexical use of Indonesian texts that can influence performance metrics, including the lexical to clustering the documents which have to be considered [21].

Conclusion
This paper presents an achievement of the intelligent text categorization for Indonesian text into meaningful cluster labels by adopting LINGO and Bisecting K-means clustering algorithms. There are 93%and 80% of Indonesian text documents obtained from the dataset can be categorized as the cluster labels. Furthermore, clustering quality evaluation indicates that the text categorization has values of Precision, Recall, and F-measure of0.85, 1, and 0.92, respectively, and 0.78, 0.99, and 0.87 for Bisecting K-means.LINGO clustering algorithm has a better result when compared to Bisecting K-means method. Therefore, LINGO clustering algorithm is suitable for categorizing the Indonesian text documents.
The experiment result proves that users can obtain Indonesian text categorization using clustering technique more exactness and completeness. The main contribution of this study is to optimize the clustering results by applying and maximizing text processing using Indonesian stemmer and stopword.