MTStemmer: A multilevel stemmer for effective word pre-processing in Marathi

Article History: Received: 10 November 2020; Revised: 12 January 2021; Accepted: 27 January 2021; Published online: 05 April 2021 Abstract: In natural language processing, it is important that the context and the meaning of words are retained while also ensuring the efficacy of the data modelling process. During human-to-human interactions, special care is taken regarding the tense and phrasing of the words by taking into consideration the rules of grammar of the specific language. While this modification of words is necessary for framing consistent sentences, these appendages do not add significant value to the original meaning of the word. Stemming is the process of converting words back to their root form for efficient and accurate modelling of the data. In this paper, MTStemmer, a new stemmer for the Marathi language is proposed. It focuses on the stripping of suffixes for obtaining the root word form. The proposed stemmer applies a multilevel approach by taking into consideration both auxiliary verb-based suffixes and gender-based suffixes. The presented approach intends to improve upon the limitations of the previously proposed stemmers for this language. The stemming performed by the stemmer is found to be more accurate in terms of mapping to the root words. Stemming is often an important pre-processing step before processing the data further for the main task. The benefit of the proposed stemmer is demonstrated by using it for an extractive Marathi text summarization task. A significant improvement in the performance of multiple performance metrics is achieved owing to the stemming done by MTStemmer. The working of the proposed stemmer shows promising signs for the development of similar engines for other Indic languages.


Introduction
Natural language processing involves the modelling of human-readable sentences into mathematical form so as to automate a particular language understanding task. It is crucial that the modelling offers maximum retention of the semantic structure and meaning of the original data. Natural language processing (NLP) finds its application in information retrieval, data mining, knowledge extraction, linguistics, and other sub-domains of artificial intelligence. The two main sub-categories of NLP are natural language understanding (NLU) and natural language generation (NLG). NLU tasks involve analysing the provided text for reading comprehension. NLG involves generation of new text based on available data, including tasks like translation, summarization, and chatbots. In both types of tasks, a faster and effective conversion of text is desired. Stemming is one such method used for the effective resolution of words, especially as a pre-processing step. Stemming involves the conversion of the words back to their root form (Porter et al., 1980). Consider the word 'play', it has different forms like playing and played. While these forms are necessary for maintaining the grammatical correctness in human interactions, they do not add any extra value in terms of the meaning of the root word. Thus, a better approach would not consider these three words differently and would rather map them to their root word before modelling them (Dogra et al., 2013). Stemming does this job of stripping the words back to their root form.
In the past, multiple stemmers have been proposed, majorly for the English language (Porter et al., 1980;Porter, 2001). Recent years have also seen a rise in the stemmers being developed for non-English text, especially for Indian languages (Makhija, 2016; Kaur and Buttar, 2019; Kumar et al., 2020). Marathi is an Indian language derived from the Devanagari script and is spoken prominently in the Maharashtra state and nearby areas. While stemmers are available for Devanagari script and Hindi language, the work in the Marathi language has been paltry. Also, a transfer learning or fine-tuning based approach for stemming may turn out to be risky as there are some peculiar characteristics of the Marathi language.
In this paper, MTStemmer, a stemmer for the Marathi language is proposed, which performs stemming using a multilevel approach. The stemming is performed in two phases: In the first phase, the stemming in terms of auxiliary verb suffixes is done, while in the second phase, the gender-based suffixes are removed. Both these suffixes are not found to be adding significant value to the root meaning of the word. Through this paper, the following contributions have been made: 1. A new stemmer for the Marathi language has been created which takes into consideration all the nuances of the language and still offers better performance than existing solutions.
2. The proposed stemmer not just considers tense based suffix forms, but also gives importance to both gender-based suffixes and auxiliary verb-based suffixes which are prevalent in the Marathi language.
3. The benefit of using MTStemmer is demonstrated by showing the improvement in results obtained for the task of extractive text summarization when using the stemmer as a pre-processing step.
The outline of the paper is as follows: Section 2 explains the background work. Section 3 describes the challenges faced when working with Marathi language. The proposed methodology is explained in Section 4. Dataset description is provided in section 5 while the obtained results are shown in Section 6. Finally, the conclusion of the work is presented in Section 7.

Background and Related Work
The initial methods of natural language processing involved taking into consideration statistical features such as word count, position, and part of speech tagging, etc. which were then fed to either mathematical functions or machine learning algorithms. However, later years saw a rise in the emphasis given on pre-processing methods for improving the accuracy in the modelling stage. Gradually, methods like sequence padding, lemmatization, stop word removal and stemming started to become prevalent as common pre-processing steps.
The stemmer proposed by Martin Porter became one of the standard adopted stemmers in the community for the English language and is still used in many applications (Porter et al., 1980). Porter also went ahead to write Snowball, a language for stemming algorithms that tried to address the two main issues, lack of standard stemmers for non-English text, and the extrapolation by developers when implementing the Porter stemmer (Porter, 2001). Porter and Snowball stemmers continue to be the most used stemmers for the English language. Recent years have also seen a rise in the development of stemmers for non-English languages, especially those used in a linguistically diverse country like India (Harish and Rangan, 2020). Ramanathan and Rao proposed one of the first stemmers for Hindi which deployed a suffix stripping and lookup table approach (Ramanathan and Rao, 2003). Mishra et al. (Mishra and Prakash, 2012) combined suffix stripping with brute force to propose a stemmer "Maulik" for Hindi. Some work has also been done for inflectional languages (Paik and Paruj, 2008). Unsupervised learning has also been used for this cause (Husain, 2012). A hybrid approach was presented by Sharma et al. (Sharma et al., 2016) for stemming to improve the efficiency of information retrieval. Some previous works for sentiment analysis have also made use of in-built stemmers offered by contemporary transformer architectures (Malte and Ratadiya, 2019; Ratadiya and Mishra, 2019). The Hindi language is based on the Devanagari script and there have been efforts to make stemmers for some other dialects of the Devanagari script as well.
Makhija proposed an affix removal-based stemmer for the Sindhi language (Makhija, 2016). Recently, Nathani et al. (Nathani et al., 2020) presented an unsupervised learning-based stemmer to better the performance. Desai and Dalwadi came up with a stemmer for Gujarati text which involved the removal of both prefixes and suffixes (Desai and Dalwadi, 2016). A stemmer for verbs in the Punjabi language was developed by Kaur and Buttar which followed a rule-based approach (Kaur and Buttar, 2019). The first stemmer in Maithili was proposed by Priyadarshi and Saha using a hybrid approach (Priyadarshi and Saha, 2019). Barman  worked on Marglishcode mixed Marathi text in a way that improves the performance on opinion mining. With increasing work on various NLP tasks, efforts are also being taken to ensure that accountability of these models does not go unnoticed (Verma and Verma, 2020).
There are certain limitations in the previous work which leave ample scope for research in this domain. Firstly, dictionary or lookup table-based stemmers are large in size and also slow in terms of retrieval. Hybrid stemmers proposed in the past have primarily focused on brute force as one of the involved approaches, thus keeping the window open for manual errors (Dogra et al., 2013). In the Marathi language, the work has been quite limited. Further, there are certain challenges in using the existing approach for the Marathi approach, which are described in the next section. The suffixes can be removed as they only indicate the gender.

Challenges When Dealing with The Marathi Language
There are some peculiar features of the Marathi languages which makes it difficult to deploy a cross-language transfer learning concept. This is owing to some language-specific grammar rules which need to be addressed accurately. Some of the challenges when processing Marathi language, especially for a stemmer are as follows:

Opposite words formed using prefixes
In Marathi, often the opposite of a word is formed by using a prefix term. Consider the word Uchit (appropriate). The opposite of this word is Anuchit (inappropriate) which can be seen is formed by appending the prefix 'An-'. Similarly, many other such prefixes are used to create contrasting terms. As a result, a stemmer that directly strips of prefixes may lead to a complete change in the meaning of the term. This problem is analogous to including the word" not" in the list of stop words when processing the English language.

Prefix terms used independently
The terms which are used as prefixes to create opposites are also found to be used independently in some words. Consider the previous example of 'An-'. The word 'Anukul' means suitable, but by removing the prefix, the remaining word does not have any meaning on its own. In this case, 'An-' does not act as a prefix. Thus, the removal of prefixes in a stemmer is not recommended as in some cases they create opposite effect while in some cases they are also an important part of the word itself and not just used for grammatical correctness.

Variation in suffixes based on gender and auxiliary verb tenses
Unlike English where verbs do not have different forms for genders, in Marathi there are suffixes appended to the verbs only to indicate the gender of the subject. These suffixes do not add much to the meaning of the action denoted by the verb and can be removed. While in English, the stemming takes place only based on the tense of the verbs, in Marathi, the stemmer should also take into consideration these gender-based suffixes to further improve the efficiency of the system (Patil et al., 2017). Table 1 gives examples for each of these challenges to show the problems which need to be addressed by the stemmer. The proposed MTStemmer tries to tackle all these challenges while improving the results

Proposed Methodology
The proposed stemmer involves a two-stage stemming of the suffixes of the given word. As mentioned earlier, these two stages are dependent on the two types of suffixes which it intends to remove: gender-based suffixes and auxiliary verb-based suffixes.

Auxiliary verb-based stemming
In Marathi, many suffixes are added to the verb to indicate its tense or to ensure the grammatical correctness of the sentence. These suffixes could be added to a plethora of words and tokens, and it is a challenging task to determine the list of all possible suffixes. The aforementioned challenge of a suffix being used semantically also persists. To tackle this problem, the mapping of the length of the word to a list of suffixes is carried out. By Marathi grammar rules, it can be derived that for a particular length of words, there is only a specific set of characters that can act as a suffix of auxiliary verb.
Thus, for every token word, the length of the word is checked. Based on the length of the word, the terminal character of the word is compared with the mapping table of suffixes to decide whether the terminal character is to be stripped off or not. The list of suffixes is mapped not to a specific length, but concerning a length threshold. Table 2 denotes the list of suffixes based on this threshold of the length of the individual word. Remember that this mapping list is checked from top to bottom in the same sequence.

Gender-based suffix stemming
Once the verb-based stemming is done, a gender-based stemming strategy is deployed which is similar inflow as the previous one. The length threshold is mapped to a set of suffixes to determine the stemming order. Table 3 indicates the mapping for these gender-based suffixes, for whom the same sequence is followed as that of the rows in the table. In this case, there are more suffixes associated owing to multiple genders being addressed through these characters. For both these stemming levels, the word length considered is one less than its original length, to skip the terminal suffix character. After clearing the auxiliary suffixes, the gender suffixes are removed to obtain the final root form of the given word. Table 4 shows sample examples of stemming of some Marathi words using the proposed stemmer.

उठावे उठा
The working of the proposed MTStemmer is described in Algorithm 1. While stemmer evaluation is done, it is also important to check its effectiveness on some tasks, especially those where stemming is looked at as an important preprocessing step.

Using stemmer as a preprocessing method
Stemming has always been used as an effective yet efficient preprocessing method for the modeling of data. While evaluating a stemmer by itself does not prove its effectiveness because the root words will always be obtained, the benefit of the stemmer can be checked by using it as a preprocessing step for a natural language processing task and comparing the impact on the obtained results. Figure 1 shows the flowchart of an input case in the system.
The proposed stemmer is used as a preprocessing step for the use case of performing the task of extractive text summarization. Extractive text summarization involves the ranking of sentences from the document and retaining only the top n% of those sentences as the summary of the document. There are various methods for performing extractive text summarization, but the most dominantly used technique is the textrank algorithm (Mihalcea and Tarau, 2004).

Algorithm 1 Algorithm of MTStemmer
w' = remove_gender_suffix(w',G) 9: end if 10: D'.append(w') 11: end for 12: return D' A graph is constructed with each sentence indicating a node, and an edge indicating the co-sine similarity between the two sentences that it is connecting. Based on the user input, the top n sentences from this graph are then taken as the final summary of the document (Mihalcea and Tarau, 2004). Being an unsupervised method, the complexity is less and results are obtained immediately, thus making it a more reliable modeling technique.

Dataset Description
The extractive text summarization task is conducted on a news article dataset 1. The dataset consists of over 100 articles ranging on topics from the economy, finance, and politics. As it is an unsupervised method, the performance of the summarization technique with MTStemmer and without MTStemmer is evaluated over all these documents. A sample document text is shown in Table 5. The Textrank algorithm is run on these documents set two times-once without using MTStemmer, and the second time while using MTStemmer as a preprocessing step before forwarding it to the algorithm. The results obtained in both the instances across multiple performance metrics are elucidated in the next section.

Result and Analysis
For extractive text summarization, the ROUGE metric is considered as a standard performance metric. ROUGE takes into consideration the number of overlapping words present in the human summary and the predicted summary document. Considering this approach, the precision, recall, and F1 scores under ROUGE for a document HS and predicted summary PS are defined.
The F1 score considers harmonic mean of the previously mentioned metrics of precision and recall. Based on these metrics, the results obtained on the textrank algorithm are tabulated in Table 6. It can be seen that the use of the proposed stemmer has led to healthy gains of up to 2-3% in the performance of the system without changing any other configuration. To further solidify the claims, results are also evaluated on ROUGE-1, ROUGE-2, and ROUGE-L precision, which take into consideration the number of overlaps of one word (unigram), two words (bigram), and whole sentence respectively. The obtained results are indicated in Table 7. While the proposed approach focuses on the development of stemmer, it has also been proved that the use of the proposed stemmer helps achieve improvement in performance across most of the variants of the ROUGE metric as well. A similar trend can be expected when using the proposed MTStemmer for other language processing tasks on other performance metrics. To further solidify the claims, the performance of the proposed stemmer is also compared with two other standard stemmers for Devanagari script. These include the Snowball stemmer2 and the Marathi stemmer package from the Indic stemmer library3. The comparison of the average precision, recall, and F-1 score over the 100 documents is tabulated in Table 8. It can be seen that the proposed stemmer has comfortably outperformed the two standard stemmers in this domain, owing to its consideration of various kinds of suffixes and appropriate handling of the nuances in the Marathi language. As a part of ablation studies, the results obtained by these three stemmers on F-1 score of ROUGE-1, ROUGE-2, and ROUGE-L metrics are shown in Table 9.
It can thus decisively be concluded that the proposed stemmer is a better alternative to the existing standard stemmers for Marathi language, as demonstrated from the results across multiple performance metrics for a sample task of extractive text summarization. A similar trend can be expected when extending the use case across other natural language understanding tasks.

Conclusion
A new stemmer for Marathi language, MTStemmer has been proposed in this paper. It deploys a multilevel suffix stemming approach using a lookup table for reducing the words to their root forms. The proposed stemmer is used as a preprocessing method for the use case of extractive text summarization and significant gains are observed in the performance of the system over multiple performance metrics. The proposed stemmer is lightweight in nature and still manages to be effective, while not violating the grammatical nuances of the Marathi language. Both kinds of suffixes, verb-based, and gender-based are addressed by the stemmer. Future work includes extending the use case of this stemmer for other language processing tasks such as sentiment analysis, question, and answering, etc. The use of contextual vectors for tokenization can also help improve the performance of the system for the mentioned text summarization task. As the digital world provides access to more data across multiple languages, efficient, accurate methods for processing and modeling such data will help extend the scope of smart systems to areas and domains where English or other global languages may not necessarily be used. The proposed MTStemmer marks a positive step taken in this direction for the Marathi language.