A Heuristic Approach for Telugu Text Summarization with Improved Sentence Ranking

Extracting/abstracting the condensed form of original text document by retaining its information and complete meaning is known as text summarization. The creation of manual summaries from large text documents is difficult and timeconsuming for humans. Text summarization has become an important and challenging area in natural language processing. This paper presents a heuristic appraoch to extract a summary of e-news articles of the Telugu language. The method proposes new lexical parameter-based information extraction (IE) rules for scoring the sentences. Event score and Named Entity Score is a novel part in sentence scoring to identify the essential information in the text. Depending on the frequency of occurrence of event/named entites in the sentence and document, sentences are selected for summary. Data is collected from online news sources (i.e., Eenadu, Sakshi,Andhra Jyothi, Namaste Telangana) to experiment. The proposed method is compared with other techniques developed for Telugu text summarization. Evaluation metrics like precision, recall, and F1 score is used to measure the proposed method's performance. An extensive statistical and qualitative evaluation of the system's summaries has been conducted using Recall-Oriented Understudy for Gisting Evaluation (ROUGE), a standard summary evaluation tool. The results showed improved performance compared to other methods.


Introduction
For English, several advancements are made in the field of Text Summarization but not for Indian languages. Telugu is the 2nd famous language in India and the 15th most popularly speaking world language [4]. Telugu is an agglutinative language, due to which text summarizations developed for other Indian languages like Hindi, Bengali does not support Telugu. Text summarization for Telugu obtained little attention because of nonavailability of Telugu resources like data sets, dictionaries, wordnet, etc. Nowadays, Telugu e-newspapers (Ennadu, Sakshi, Andhrajyothi, Namaste Telangana) is freely available online. Extraction of important information from these newspapers is a time-consuming task. Text summarization plays a role in mining the significant sentences to generate the summary of the entire document.
The automatic text summarization method provides the original text document's condensed form by retaining the meaning and information. The summary helps the readers to understand the content quickly without reading the entire text. Depending on the type of summary, Text summarization methods are broadly classified into extractive/abstractive. Extractive summarization retrieves selected sentences from the source text. Sentences are extracted depending on the statistical and linguistic features in the input text [16]. Abstractive summarization methods interpret the source document and rewrite the sentences to obtain summaries. This paper proposed an improved sentence ranking approach to generates effective summaries for Telugu text socuments based on occurrences of events and named entities in the text.
The rest paper's sequencing is as follows: Section 2 explains the literature of various summarization techniques developed for Indian languages. The framework of Text Summarization approach developed for Telugu are described in section 3. Section 4 illustrates the dataset and experimental results of the work. Section 5 provide conclusion of the paper.

2.Related Work
In the literature, Automatic text summarization systems are available for English and other foreign languages in maximum but less for Indian languages. This section explains various text summarizers developed for Indian languages.
features like cue words, nouns, title words, sentence length, position, numerical data, inverted commas, etc., to obtain different sentence scores. Lexical rule-based text summarization is developed for Hindi [12]. Word-level features such as word frequency, word length, word occurrences, and sentence level features such as sentence length and a similarity score of sentences are used in rule formation.
In [14], vectors space term weighing is used to rank the sentences in the document. Query words are given importance in sentence scoring. Topic-based opinion text summaries for Bengali are developed that consolidates the sentiment information in the given input text document [4]. Extractive summarization for Bengali is created using the thematic term and the word's position as features [5]. In [11], the multi-document text summarization for Bengali is explained. Statistical methods like term frequency are used to score the sentences and extract the relevant information from multiple documents.
In [7], proposed a text summarization for Tamil. In this method, semantic graphs are built for the source text document. By analysing these semantic graphs, humans' experts obtain the summary of the text. Statistical methods such as word frequency, word position, number of named entities in sentences are used to score the sentences, highest-ranked sentences are retrieved to generate a summary for online sports news in Tamil [15].
For Kannada language, Extraction based text summarization developed depending on key term scores [10] [13]. Sentences are scored using the key terms obtained based on term frequency and inverse document frequency measures. In [2], relevant sentences are extracted by computing sentence scores in Malayalam text document. The term frequency and position of words are used to find the score.
For Telugu, keyword-based approaches are used to generate the summaries [9]. The probability distribution of tags is used to identify the keywords, which helps to score the sentences. Human intervention is needeed to some extent at annotation level to identify the keywords. In[3] neural network based appraoch is used to genrate the summaries, but are not evaluated for their performance. A literature study has shown that all the Indian language text summarization is Extraction based. Statistical and Lexical features are used to rank the sentences. This paper presents a complete automated heuristic approach of text summarization with an improved sentence ranking mechanism.

3.Proposed Summarization Method
Text Summarization for Telugu is one of the vital applications in Natural Language Processing (NLP). This section proposes a heuristic approach for automatic text summarization of Telugu documents. An improved sentence scoring method is used to rank the sentences. ISentence scoring mechanism is based on the event and named entity scores. An event is defined as a happening/occurrence of any situation in the real-world scenario. The named entity is defined as the people, place, things involved in an event happening. The statistical-based lexical rules of extraction are developed for scoring the events and named entities. The scores are further used to identify the sentence scores. In the proposed method, the Telugu text document is taken as Input. Pre-processing steps such as tokenization and stemming are performed. Tokenization performs the splitting of a text document into a sequence of words. Using Stemming, the term is divided into stem and suffix. The stemmer algorithm removes the suffixes utilizing a set of frequent suffixes. For example, in words, దేశంలో, and దేశం the letter లో is removed, and both the terms are treated the same. The stop words are extracted from the document. There are 228 stopwords built for Telugu. Stop words such as లో, ఒక,మార్పు ,పేజీ,ఈ,కు, etc. are removed from the text. The remaining terms are sent for tagging the Parts of Speech (POS).
"Events" and "Named Entities" are linguistic features used in the proposed method. Events are terms that indicate happenings in the real-world. The verbs in the text explain the actions. They form an essential role in scoring the sentence relevance for a summary. Named Entities are the name of a person, place, thing, and animal involved in the occurrence of this action. Nouns are the POS tagged to such words in the language. The available events and named entities in each sentence are retrieved by feature extraction part in proposed method and sent to perform statistical analysis on them.
Sentence scoring is done by applying statistical measures on Events and Named Entities obtained. The number of event/named entity occurrences is used to find the word frequency score. The correlation between the number of events/named entities in the document with that of total events/named entities is determined as word frequency score. Equation 1 is used to calculate the word frequency score. The number of sentences in which the Sentence Score event/named entity occurred helps to find the inverse sentence frequency. Equation 2 explains the calculation of inverse sentence frequency of events/named entities. The word occurrence in many sentences gets the least significance to be included in the summary. The Product of word frequency and inverse sentence frequency obtains the wf_isf score of term t. Equation 3 finds the term's significance to be included in the summary based on the score of wf_isf.

Input Telugu text document
The summation of each event's or named entity "wf_isf" score in the sentence is done. Sentence score is obtained by finding a correlation between this value concerning the number of events and named entities in the entire sentence. Equation 4 shows the calculation of sentence score.
The sentence ranking step arranges the sentences in the chronological order of sentence scores. The average score of these sentence scores is used to fix the threshold for sentence selection. In the proposed method, sentences are selected for summary only if the sentence score is greater than the threshold. Sometimes the sentences retrieved for summary may contain duplicate content.
Uniqueness detection in the proposed method identifies whether the sentence selected contains unique information or not. The sentence similarity measure is used to compare whether two sentences are similar. The sentences are converted to vectors, and the similarity between the two sentences Si and Sj is computed using equation 5. If the similarity score between two sentences is greater than 80%, then the sentence with the less scored sentence is eliminated by retaining the highest score sentence in summary. Sentence S 1 is 95% similar when compared with S 2 using the sentence similarity metric. Sentence scoring of S 1 is 0.24, and S 2 score is computed as 0.73. Out of these two sentences, S 1 is eliminated to form the summary sentence since it has a low sentence score when compared to S 2 . Summary generation part of the proposed method extract the highest score unique sentences to form the summary.

4.Expreriment Results and Disucssion
This section evaluates the quality of summaries obtained by the proposed algorithm. The experimentation starts with data collection by scraping the content from popular e-newspapers like Ennadu, Sakshi, Andhra Jyothi, Namaste Telangana, etc. The dataset contains 90 articles from each newspaper collected for 30 days. A total of 360 articles were collected. Each document contains around 50 to 60 sentences. Human-generated summaries for these documents are developed by Telugu linguists and are termed model summaries. These summaries are used to compare the system summaries for measuring the performance.
To compare the results of the proposed method, the precision, recall, and F-score are calculated using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) 1.5.5 tool[6]. It is a standard summary evaluation tool to access summaries generated by systems. ROUGE tool returns three evaluation metrics, namely "average precision, average recall, and average f-score," to determine the performance of the system. Precision is defined as the number of sentences comparative in both model and system summaries to the number of sentences in the system summary. Recall metric plays a crucial role in identifying the number of sentences identical in both model and system-generated-summaries. F-score is defined as the harmonic mean of precision and recall scores. Table 1 compares the experiments conducted on the created dataset of 360 documents. The results are compared with that of keyword-based text summarizer [9] and neural text summarizer[3] developed for Telugu in the literature. The result shows that the proposed work beats the other methods considering the "average precision, average recall and average f-score" values. Figure 2 gives the comparative chart for average scores of three evaluation metricsprecision, recall, and f-score obtained by different summarization methods.

5.Conclusion
This paper proposed a heuristic-based method of extractive text summarization with an improved sentence ranking mechanism for Telugu text documents. Events and named entities are linguistic parameters used to identify the significant sentences in the text. Sentence scoring is computed using events and named entities occurrences in the text. The highest-ranked unique sentences are selected to generate the summary. Three hundred sixty articles are collected from various Telugu e-newspapers, which are used to evaluate the experiments in the proposed method. Standard evaluation metrics -"precision, recall, and f-score" are used to measure the proposed method's performance. ROUGE evaluation tool is used to find these scores. The results obtained for the proposed method are compared with other approaches such as keyword-based and neural-based approaches. The proposed method has shown an average precision of 0.883, average recall of 0.865, and average f-score of 0.873. On Comparision, proposed Heuristic based approach showed the improved performance over the other methods.