A Review of Machine Translation for South Asian Low Resource Languages

Machine translation is an application of natural language processing. Humans use native languages to communicate with one another, whereas programming languages communicate between humans and computers. NLP is the field that involves a broad set of techniques for analysis, manipulation and automatic generation of human languages or natural languages with the help of computers. It is essential to provide access to information to people for their development in the present information age. It is necessary to put equal emphasis on removing the barrier of language between different divisions of society. The area of NLP strives to fill this gap of the language barrier by applying machine translation. One natural language is transformed into another natural language with the aid of computers. The first few years of this area were dedicated to the development of rule-based systems. Still, later on, due to the increase in computational power, there was a transition towards statistical machine translation. The motive of machine translation is that the meaning of the translated text should be preserved during translation. This research paper aims to analyse the machine translation approaches used for resource-poor languages and determine the needs and challenges the researchers face. This paper also reviews the machine translation systems that are available for poor research languages.


Introduction
Machine translation is the technique of translating the text of one natural language into another natural language by using computer software, e.g. English to Urdu. It is an automated process in which the computer does the translation work. Machine translation is an application of computer linguistics [1]. Computer linguistics is an interdisciplinary field that requires language and computer experts. Language experts frame the rules of the languages and computer experts program the computer to understand these rules. The area of machine translation started when electronic computers came into existence. The concept of machine translation was first used during World War II by Weaver, one of the pioneers in machine translation to crack the German enigma code. In the 1950s, the field of machine translation became a reality with the demonstration of Georgetown Experiment, which translated more than sixty Russian sentences into English automatically [2]. As a result, a lot of interest and funding flowed in for almost a decade. The United States was the leading research and funding agency with the primary aim to strengthen their military and defence intelligence. But the research in machine translation came to a halt for about decade in 1966 after the (ALPAC) "Automatic Language Processing Advisory Committee" report.
According to the ALPAC report, machine-translation output was costly, and output was not faster than full human translation because, in machine translation, there was a post-editing requirement. Hence, there was no advantage in using machine translation and suggested that funding should go to basic linguistic research to improve human translation compared to Machine human translation. Due to the recent industrial growth, there is a significant impact on machine translation. The need arises that requires contents to be available in all regional languages worldwide [3]. The beginning years of research in this field were dedicated towards the rule-based systems. During the 1980s due to increasing computational power, there was a transition from the rule-based system to statistical machine translation approach. The enormous increase in the electronic text's multilingual data has ignited plenty of monolingual and cross-lingual information retrieval efforts. It is vital to share information with people for their development [3]. Suppose the MT researchers can develop a multilingual machine translation model. In that case, individuals with various dialects can share their insight and thoughts worldwide in their local dialect. Everybody in the globe can have the ability to get this information and ideas in their local dialect. The translation process's purpose is that meaning of the translated text should be same as that of the original. One advantage of translation is the accessibility of information in the birth languages. Due to technology limitations, it has not been possible to generate information in many languages of the world. The majority of the research in the last few decades was dedicated towards the automatic (NLP) natural language processing for English, East Asian and European languages. Still, unfortunately, South Asian languages received less attention [4]. Due to digital resource scarcity, machine translation is a challenging task for resource-poor languages.

Machine Translation Approaches
The Machine translation approaches are classified as Rule-Based Machine Translation" (RBMT), corpusbased, Hybrid and knowledge-based approach [5] [6]. The classification is shown in Figure 1.

Classification of Machine Translation approaches based on RMBT
The rule-based machine translation techniques can be further divided into three categories based on Bernards pyramid, as shown in Figure 2.

Direct Machine Translation
This Machine Translation approach (MT) operates at the lowest level of the machine translation pyramid given by Bernard Vauqous, as shown in figure 2. It is one of the oldest methods that work at the word level and uses a bilingual dictionary to directly mapource language and target language [7]. This approach does not perform structural and morphological analysis of source language. Hence this approach does not give good results [8].

Transfer Approach
This Machine Translation approach operates at level 2 of the machine translation pyramid in which source language is first converted into an intermediate language. Then it is used to generate target text using a bilingual dictionary. This approach works in three phases: Analysis, Transfer and Generation. The source language text is first analysed using linguistic information to form a syntactic representation of source language with source language parser. After the intermediate representation, the next stage is to convert it into the target language representation, where the transfer stage comes into play. Previous source syntactic representation is converted into target syntactic representation. The last step is the generation stage, where the target-language text is generated with morphological analysis. [5]

Interlingua Approach
The Interlingua approach is similar to the transfer based approach, but in this case, an extensive syntactic, semantic and morphological analysis of source language is done. In this case, the text to be translated is converted into an intermediate form called meta-language or an Interlingual language, which is language-neutral representation. The next step is to generate the target language from the intermediate representation. In this case, a thorough analysis of source language is done [9].

Knowledge Base Machine Translation
Knowledge Base Machine Translation consists of a huge knowledge base containing parallel sentences and an inference engine. The problem with this type of approach is that it is difficult to represent knowledge and define its granularity.

Hybrid Approach
The hybrid approach uses two or more machine translation methods like SMT and RBMT or RBMT and EBMT. The accuracy of the hybrid approach is reasonable compared to other methods, but it is costly during the initial stage.

Corpus-Based Machine Translation (CBMT)
The CBMT known by the name of data-driven approach or empirical machine translation. This approach overcomes specific problems of machine translation approaches that were based on Bernards pyramid. In this approach, there is no need for syntactic, semantic and morphological analysis. In this case, a huge amount of corpus is required for good quality output. Furthermore, the corpus-based machine translation can be divided into three different types which are as follows:

Example-Based Machine Translation
The example-based machine translation is a subtype of corpus-based machine translation and does not require any dictionary and grammatical rules. This type of machine translation is based on the database approach, where we have many examples stored that are already translated. If a new sentence is encountered, such past translations are used, and the best matching algorithm is applied to get the translation of the new sentence.

Statistical Machine Translation
This is another machine translation approach that comes under the corpus-based machine translation approach. In this approach, we require a huge bilingual corpus to train the system, and no rules and Grammar are needed in this approach. This model learns mappings from the parallel corpus and then uses these learned mapping to translate new sentences. The SMT consists of three main components: language model, translation model, and decoder [13]. The increase in corpus size in SMT increases the BLEU score as the corpus size has a significant impact on the BLEU score.

Neural Machine Translation (NMT)
Neural Machine Translation is a promising approach that uses artificial neural networks and substantial parallel corpus. The excellence of neural machine translation is that it is based on end to end learning. The NMT makes use of encoder-decoder architecture. Figure 3 shows the evolution of machine translation approaches. From 2014 onwards, two promising approaches are SMT and NMT Nowadays, the two main approaches of machine translation used are Statistical Machine Translation and Neural Machine Translation. In this paper, we elaborate on these two approaches only.

Statistical Machine Translation Approach (SMT)
This is the data-driven approach based on the Statistical Models and Noisy Channel model for communication, which was introduced by Shanon in 1948. SMT is based on Bayes Theorem of Probability which uses an enormous Bilingual corpus to derive the rules and mapping of Source language and Target Language. MT is still a promising approach because of its several advantages like low cost, rapid prototyping, uses human translation as its building block and also supports many languages which do not have enough lexical resources. This approach produces the best results when large datasets are available.
From the concept of Shanon's noisy channel model, consider a distorted message R (Foreign String f) a model to know how the message is distorted (translation model t (f|e)) and also a model on which original messages are probable (language model p(e)). Our objective is to retrieve the original message S (English string e), shown in figure 4 below. The P(e) is not in the division because the most likely translation f maximise the product of two terms and will remain the same for all.

Neural Machine Translation (NMT)
Neural Machine Translation is a promising approach that maps words of source and target languages in the end to end fashion. It addresses the drawbacks of classical machine translation approaches. NMT architecture consists of encoder and decoder, two RNN (Recurrent Neural Networks), namely encoder and decoder. The encoder network takes input and creates a fixed-length vector, whereas the decoder generates translated output text from the encoded vector [16]. The architecture can be combined with the attention model to achieve excellent performance.
From the probabilistic viewpoint, translation is analogous to find a target sentence t that maximises the conditional probability of t for a given sentence s [17]i:e argmax P(t | s). The encoder reads source sentence S as a sequence of vectors S=(x1, x2, x3……) in vector v, a standard RNN uses the following equation to compute the output. ht = s (W xt + W ht-1 ). (2) ht is the hidden state at time t which is a nonlinear mathematical function of input xt multiplied by weight matrix W added to the previously hidden state output (wht-1) RNN generalises feed forward neural networks that store previous input and combine it with the current input. In RNN, the Neural Network goes back and checks what has happened in the previous nodes before taking any decision.

Proposed Models of NMT
Several Architectures were proposed by the researchers in NMT some of them are mentioned in this paper. Bahdanau et al. proposed NMT by jointly learning to align and translate. This architecture belongs to the encoderdecoder model of RNN in which source sentence is encoded into a fixed-length vector from which decoder generates the target sentence. The problem in the encoder-decoder model of NMT is that it cannot handle long sentences. In this paper, authors have proposed a solution to align and translate mechanism jointly. But this approach will not work for the language whose secret is written in complex fashion as it is difficult to extract individual sequences of the language. This model was tested on English, and French languages, both Subject Verb Object (SVO) order and has not been used on different word order languages,hich is challenging in machine translation.

GNMT (Google's Neural Machine Translation):
Google developed this Architecture in 2017 to bridge the gap between human translation and Machine Translation. This architecture consists of three components. The component is encoder, decoder, and attention network with 8 layers with LSTM (Long Short Term Memory Network) RNN units, which address the vanishing gradient in RNN. The attention network was added to the encoder-decoder model to increase the performance, and this architecture claims to achieve the accuracy rate of 60%. This model uses the sequence to sequence form of learning. It also worked at the symbolic level and focused on languages like French, Spanish, and Chinese.
Hierarchy to sequence Attention NMT Model: This model was proposed by Jinsong Su et al. in 2018. In this approach, the source sentence is divided into a sequence of different short clauses, which are translated sequentially. In this model, the bottom level RNN operates at the word level. The sentence S is divided into clauses c1,c2,c3……cn where each clause contains a sequence of words and at the end of each special clause token is placed to mark the end of the clause. At the decoder side, two attention networks can predict the next word based on previous given context and words generated previously. It chooses the clause length arbitrarily, and no mechanism is employed to detect optimal clause length. This model was also evaluated for Chinese-English and English-German languages.

Need for Machine Translation
The Internet World Stats Report describes that the content available on the internet in different languages varies, and the most dominant language on the internet is English [20], keeping in view this issue there is a dire need of machine translation system to make the web content available to everyone in their native language. Belowgiven figure 5 shows the top ten languages on the internet in millions of users.

Figure 5.
Languages used on the web [18] Machine translation frameworks are expected to decode or translate creative works from any dialect to local dialect. Such machine interpretation frameworks can break the language obstruction by quickly making work accessible to the globe's masses. Numerous web pages may contain information related to our interest in a foreign dialect, and with the help of machine translation, we understand the content present in those web pages. Machine translation can also help commercial product manufacturers prepare product manual in many languages that an be used by different countries [13]. With the advancements in the internet, millions of users worldwide can get the information in their native language with the help of machine translation. In modern civilisation, machine translations have growing need and importance in economics, business and industrialisation. The social and political urgency of machine translation rises in societies where more than one language is spoken. [21] In health care, machine translation plays a crucial role in upgrading access to multilingual health materials. Several machine translation systems are available, but their performance is not adequate in the public health domain [22]. During the last decade, machine translation technology has improved. Currently, machine translation is used by language providers and several companies and by the government departments.
Machine translation provides an economical means to translate an enormous amount of corpus from one language to another with less post-editing. It translates a vast amount of text in less time than a human translator, thus saving a lot of time.

Challenges in Machine Translation
Machine translation is a difficult and challenging problem. The difficulty of machine translation is to handle different ambiguities that are present in source and target languages. These ambiguities are either present naturally in sentences or arise due to the inability to form grammatical sentences. Natural languages have different aspects and feature i:e there is a difference in representing a concept in different languages. If one language represents a concept in one way, the other language may represent it differently. Some ambiguities that cause the problem in machine translation are below: Lexical Ambiguity: Preferably, each word in language must have their unique meaning or sense; however, for natural languages, many words have multiple interpretations due to which sentence becomes unclear or vague. This type of ambiguity is lexical ambiguity. Lexical ambiguity can be of two types: i. Word belongs to one or more lexical categories (noun, verb, adjective, etc.). ii. One word has more than one understanding and belongs to the same lexical category. First, one can be solved by performing a syntactic analysis. Word order issues: Word order issues are challenging for machine translation consider the two languages like English and Urdu, English follows (SVO) Subject verb object order whereas Urdu is free to order language and commonly used order is (SOV) Subject Object Verb, this also presents a challenge in Machine Translation. The example of the word order of English and Urdu is shown in Figure 6. Figure 6 shows that the English script is read and drafted from left to right, whereas script Urdu is read and drafted from right to left.  Parallel corpus: Parallel corpus is an essential resource for SMT and NMT, and an enormous amount of parallel corpus is necessary for these two approaches. Availability of parallel corpus for low resource languages is also challenging for machine translation.
Sentence Alignment: Sentence alignment is also an essential step in corpus preparation. There are various sentence alignment algorithms and tools available in the literature. The tools available in literature do not support enough for resource-poor languages.
Morphological Variation: Some low resource languages are rich morphological languages in which one word can be inflected in several ways. Translation system should handle all the inflation forms and address forms in training data is challenging.

Machine Translation for Low Resource Languages
There are various machine translation systems available in the literature. In this paper, we focused on the machine translation systems available for the low resource languages and their findings. The research work carried out on low resource languages mostly uses direct and rule-based approaches because of the non-availability of massive parallel data to build SMT or the NMT system. In Indian, there are 22 languages given status and official encouragement (8 th Schedule of the Constitution). The list of top 15 languages spoken in India is shown in below given Figure 8. It is clear from the figure that Hindi is the most dominant language in India.

Figure 8. Top 15 languages spoken in India
The below-given Table 3 Various machine translation approaches, the purpose of the machine translation system and findings. The system consists of three dictionaries with parallel corpus od words, phrases and sentences the dictionaries are word dictionary, Phrase dictionary, and sentence dictionary The example base consists of 75000 sentences that were manually translated into three languages.

2006
English to Bangla [29] Example-Based Approach General The proposed method uses a shallow analysis to identify input phrases and get target phrases using EBMT.

2007
Punjabi to Hindi MTS Direct word to word translation approach The system uses pre-processing modules and performs morphological analysis of source language. This system also performs transliteration.
Claimed to achieve accuracy of 92.8% accuracy.

2009
English-Kannada machine aided TS by University of Hyderabad [26] Transfer Based Approach Government The system is based on the transferbased approach and uses Universal Clause Structure Grammar. The results of this system are not so good, as its BLEU score is 0.21. The increase in length decreases the quality of translation

Comparison of Machine Translation Approaches
We compared the machine translation approaches based on some basic parameters described below, given table 4.

Discussion
In this research paper, several machine translation approaches were mentioned. Many translation systems developed so far mainly used the classical approaches for resource-poor languages. However, little focus was on promising approaches, which are SMT and NMT. Researchers do not apply SMT and NMT due to the unavailability of the enormous amount of the parallel corpus for resource-poor languages. The solution to the problem is to develop dataset using crowdsourcing. We can develop google forms and ask a respondent knowing the language to enter three sentences for each domain like health care, day to day life, tourism, business, etc. The google form will be circulated to the engineering colleges and universities of a particular state. The language is official where students have an e-mail id and awareness about google forms. The second method is that we can create groups on social networking sites for data collection. The other technique is to obtain data from news API's and web scraping, which we have done and have collected 2000 sentences and translated them into the Urdu language using existing translation tools and language experts. We will soon place these parallel sentences in public domain.

Conclusion
In this paper, we reviewed various machine translation systems. We found a new promising approach of machine translation like Neural Machine Translation is not applied due to the unavailability of the enormous parallel corpus for resource-poor languages. We also found that some NMT techniques like sequence to sequence model at character level cannot be applied for Spanish and German languages due to complex script writing of some resource-poor language. The long sentence also creates problems in word rearrangement. This paper mentioned machine translation approaches and their evolution, need, challenges of machine translation, and problems faced by researchers in resource-poor languages.MT strives to fill the language barrier gap, and a lot of work has been carried out on European languages. However, Asian languages received less attention. Hence, to fill the language gap, we should try to use promising machine translation approaches to these Asian languages and create some languages.