Improvement in Efficiency of The State-Of-The-Art Handwritten Text Recognition Models

In the past few years, the research in the discipline of Handwritten Text Recognition (HTR) has been fast-tracked as many researchers in computer vision are pursuing this discipline. Most of the deep learning models are likely to have vanishing gradient errors when processing paragraph images like scanned images. The most crucial problem with these models is that they have many parameters, which require a large amount of data and resources. So, the most recent offline HTR follows the Convolutional Recurrent Neural Network (CRNNs). Recently developed neural network architecture was used to get better results, namely Gated Convolutional Neural Network (Gated-CNN), which has fewer layers and parameters. The HTR based on Gated-CNN can outperform the CRNN based HTR. This research surpasses the state-of-the-art HTR system on five different handwritten datasets: Bentham, IAM, RIMES, Saint Gall, and Washington. This research needs low computational resources to use in real life, such as smartphones and robots.


1.
Introduction The discipline of handwritten text recognition (HTR) has a broad range of applications in both the academic and industrial sectors.The HTR converts handwritten text to numeric codes (ASCII or Unicode) either by static or dynamic mode of information [1].So, images can be considered the information for offline text recognition, which in turn can help to digitalize the manuscripts [2], medical records [3], applications [4], and many more.These applications accentuate the development of HTR for different languages and scripts.The offline HTR was initially designed as sequence matching, i.e., features extracted from the input images, arranged as a sequence, are emulated with an output sequence, which directs it to a combination of characters.Initially, the most successful approach to solving the issue of HTR was Hidden Markov Model (HMM) [5].But, the model failed because it could not use context information because Markov assumed that each observation only depends on its current state.In the past few years, the research in HTR has illustrated rigorous advancements over HMM.Deep learning methods such as Convolutional Recurrent Neural Networks (CRNN) have repeatedly improved and gave practical results in the industrial application [6].Long Short-Term Memory plays the role of sequence decoder in the CRNN model [7].To increase the accuracy of HTR, the Multidimensional LSTM (MDLSTM) [8] is used to improve the RNN architecture efficiency by implementing multidimensional data.Bidirectional LSTM (BLSTM) [9] results from the latest studies for the HTR problem because MDLSTM has high complexity and computational cost.The BLSTM offers comparative results with MDLSTM with less computational cost and complexity.The models using BLSTM, such as CNN-BLSTM, provide excellent results but have obscurities in recollecting extended contexts because of the vanishing gradient problem.Moreover, the current optical models have very high parameters which require a lot of trainable data.It is a considerable problem for real-world applications [10].To overcome the issue of a large set of parameters, the Gated-CNN-BLSTM method is used to reduce the parameters but would affect the model's performance [11].To increase the accuracy of the offline HTR systems, we use Gated Convolutional Recurrent Neural Network (Gate-CRNN) architecture which uses a gated mechanism introduced by Dauphin [12].The model also has a bidirectional gated recurrent unit (BGRU).The proposed optical model, Gated-CNN-BGRU would require fewer parameters (thousands) to achieve a high accuracy rate.The proposed model is trained and tested on five well-known datasets, namely, Bentham, IAM, RIMES, Saint Gall, and Washington [15].The results of the proposed models are then compared with the work of Puigcerver [13], Bluche [14], and Flor [15].

Materials and Methods
The well-known datasets viz.Bentham, IAM, RIMES, Saint Gall and Washington [15] were used to compare the results of the proposed model with the well-known models viz.Puigcerver [13], Bluche [14] and Flor [15].

2.1.
Proposed Model The model proposed in this research paper is derived from the well-known models viz.Puigcerver, Bluche and Flor.The model has used the gated mechanism to reduce the future context.The proposed model compared to Puigcerver model [13] has reduced the trainable parameters and increased the efficiency of the model.The trainable parameters are reduced because gated mechanism introduced by Dauphin is used in the proposed model.Maxpooling is used to overcome the problem of overfitting while in the proposed model we used it at the end of the convolutional block rather than using it after each convolutional layer as in Puigcerver model, because it would decrease the parameters while giving the equivalent results.The He uniform is used as an initializer in the proposed model rather than glorot uniform, which improves the distribution.The parametric rectifier linear unit is used as an activator rather than leaky rectifier linear unit to increase the accuracy according to the trainable parameters rather than the fixed function.Batch renormalization was used rather than batch normalization to ensure that all the layers are trained on internal representations that are used during inference.Bidirectional gated recurrent unit is used instead of bidirectional long short term memory unit because BGRU doesn't use memory unit (exposes full hidden content without control).The proposed model compared to Bluche model [14] has increased the efficiency of the model.The gated mechanism is changed from Bluche gated mechanism to dauphin gated mechanism which decreased the trainable parameters.The Bluche gated mechanism works as a point wise product of the original feature (X) and the sigmoid activation (S) of the original feature which is given below: Y = S(X) ʘ X The gated mechanism is similar to Bluche model with a minor difference of the formula.The original features are divided into half and the sigmoid function is applied to the first half (H1) and then pointwise product is carried out between the sigmoid function (S) and the second half (H2) of the original features.

Y = S(H1) ʘ H2
Maxpooling is used to overcome the problem of overfitting while in the proposed model we used it at the end of the convolutional block.The He uniform is used as an initializer in the proposed model rather than Glorot uniform which improves the distribution.The parametric rectifier linear unit is used as an activator rather than hyperbolic tangent to increase the accuracy according to the trainable parameters rather than the fixed function.Batch renormalization is used to ensure that all the layers are trained on internal representations that are used during inference.Bidirectional gated recurrent unit is used instead of bidirectional long short term memory unit because BGRU doesn't use memory unit (exposes full hidden content without control).The proposed model compared to Flor model [15]has increased the efficiency of the model.The proposed model uses the same gated mechanism introduced by Dauphin.Maxpooling is used to overcome the problem of overfitting while in the proposed model we used it at the end of the convolutional block.The He uniform is used as an initializer in the proposed model rather than Glorot uniform which improves the distribution.The convolutional layers are increased with different number of filters to increase the efficiency by detecting important features.The proposed model is inspired by all three viz.Puigcerver; Bluche and Flor models.The model is using Gated mechanism and architecture similar to Flor with minor changes to increase the accuracy of the model with fewer parameters (approx.830,000) [15].Fig. 4 depicts the architecture proposed includes 7 convolutional layers, 6 gated convolutional layers, and 2 BGRU.Table I shows the text line image partitioning for all the dataset used.

Bentham
The dataset is written by an English philosopher Jeremy Bentham.The Bentham dataset is a collection of historical manuscripts which are in the form of gray scale images with dark backgrounds and noise in texts.This dataset has about 11,500 lines of text.The partitioning subsets comprises of 9195 images for training, 1415 images for validation and 860 images for testing.The main challenge with this dataset is that it has large amount of punctuation marks in the text lines.

Fig.2: Sample of Bentham Database 2.2.2. IAM
The dataset was prepared by the InstitutfürInformatik und AngewandteMathematik (IAM, Department of Computer Science and Applied Mathematics).The dataset comprises of 1539 gray scale scanned text pages of handwritten English.The IAM dataset is a collection of 9000 lines of text written by 657 writers.The dataset was prepared for HTR systems to be independent of the handwriting of the writer, so, the lines written by a single writer belongs to a single subset.The partitioning subsets comprises of 6161 images for training, 900 images for validation and 1861 images for testing.The main challenge with this dataset is that it has many writers and some of the images have cursive handwriting which is very hard for recognition.

Fig.3: Sample of IAM Database 2.2.3. RIMES
The dataset was compiled by the Reconnaissance et Indexation de donnéesManuscrites et de facsimilÉS (RIMES, Recognition and Indexing of Handwritten Documents and Faxes).The RIMES dataset comprises of 12,000 handwritten lines taken from 5600 mails written in French language.The images of the text lines are having more readable writing and have clear background.The dataset was prepared for HTR systems to be independent of the handwriting of the writer, so the text lines written by a single writer belongs to a single subset.The partitioning subsets comprises of 6161 images for training, 900 images for validation and 1861 images for testing.The main challenge in this dataset is that there are many local dialect based words.

Fig.4: Sample of RIMES Database 2.2.4. Saint Gall
The dataset is written by a Latin person in 9 th century.The Saint Gall dataset is a collection of historical manuscripts written in Latin.This dataset has about 6,000 unique words and 48 unique characters.This dataset has about 1,410 lines of text.The partitioning subsets comprises of 468 images for training, 235 images for validation and 707 images for testing.The advantage of this dataset is that the text line images are already normalized and binarized.The main challenge with this dataset is that it has very less data which may result in overfitting.

Fig.5: Sample of Saint Gall Database 2.2.5. Washington
The dataset was built from English papers written by George Washington in 18 th century.The Washington dataset comprises of historical manuscripts written by two writers and have much lesser data than Saint Gall.This dataset has about 1,189 unique words and 68 unique characters.This dataset has about 656 lines of text.The partitioning subsets comprises of 325 images for training, 168 images for validation and 163 images for testing.The advantage of this dataset is that the text line images are already normalized and binarized.The main challenge with this dataset is that it has very less data which may result in overfitting.Experimental Setup The Puigcerver's model used images of whole paragraphs with each case having its own hyperparameters.The Bluche's model used a large private training set of 132,000 images.The Flor's model used the text line images.So, to have a fair correlation between the models and compare the statistical results, we will use the same workflow and hyperparameters for all datasets and models.This idea is inspired by the work of [10].The experimental setup starts with training the optical models and CTC functions to improve the loss value.RMSprop optimizer [17] is used at each step with 16 images as mini-batch and 0.001 as learning rate.To improve the loss value, after 15 epochs with no improvements, Reduce learning rate on plateau is applied with a factor of 0.2 and after 20 epochs with no improvements, Early Stopping is applied.Word Beam Search [18] is used in this paper as CTC function.For encoding and decoding, a charset of 150 is taken which consists of all the useful characters from ASCII table.The images for this project must be normalized to have better understanding of the model.So, to normalize all the images takes place in four parts i.e. (i) Illumination Compensation [19] for balancing brightness and contrast.(ii) Deslating [20] the cursive writing images.(iii) Resizing and padding all images to 1024x128x1.(iv) Data Augmentation such as displacement transformation and morphological scaling for all input images is done in 3 parts i.e.(a) Rotating and scaling of the image by 3 0 and 5% respectively.(b) Height and width shifting by 5% each.(c) Erosion and Dilation upto 5x5 kernels and 3x3 kernels respectively.To enhance the results, N-gram statistical characters Language model is used with a free to use software application viz.SRILM Toolkit [21].The language model uses text rather than images, so it is easily trainable.The project uses another free to use online simulator viz.Google Colaboratory for running all the project files using GPU for stronger computational power.

2.4.
Exploratory evaluation The experimental evaluation of the models is done using 2 metrics viz., Character Error rate and Word Error rate.They are calculated using Levenshtein distance [22] between predictions and ground truth.
To declare the proposed model to have lower error rate, we must have p-value less than alpha i.e. 0.05.

Results
The models applied on the well-known datasets viz.Bentham, IAM, RIMES, Saint Gall and Washington to obtain better results than the previous models viz.Puigcerver, Bluche and Flor.To get lower CER and WER compared to the previously declared models, our proposed model has p-value lower than 0.01.The p-values of all the models are significantly reduced and are given in Table 2.

Fig. 1 :
Fig.1: Proposed Architecture 2.2.Datasets All the datasets are partitioned in three subsets i.e. training, validation and testing.The datasets viz.Bentham, IAM, RIMES, Saint Gall and Washington are having their own partitioning methodology.TableIshows the text line image partitioning for all the dataset used.Table1: Description of Datasets

Fig. 6 :
Fig.6: Sample of Washington Database 2.3.Experimental Setup The Puigcerver's model used images of whole paragraphs with each case having its own hyperparameters.The Bluche's model used a large private training set of 132,000 images.The Flor's model used the text line images.So, to have a fair correlation between the models and compare the statistical results, we will use the same workflow and hyperparameters for all datasets and models.This idea is inspired by the work of[10].The experimental setup starts with training the optical models and CTC functions to improve the loss value.RMSprop optimizer[17] is used at each step with 16 images as mini-batch and 0.001 as learning rate.To improve the loss value, after 15 epochs with no improvements, Reduce learning rate on plateau is applied with a factor of 0.2 and after 20 epochs with no improvements, Early Stopping is applied.Word Beam Search[18] is used in this paper as CTC function.For encoding and decoding, a charset of 150 is taken which consists of all the useful characters from ASCII table.The images for this project must be normalized to have better understanding of the model.So, to normalize all the images takes place in four parts i.e. (i) Illumination Compensation[19] for balancing brightness and contrast.(ii) Deslating[20] the cursive writing images.(iii) Resizing and padding all images to 1024x128x1.(iv) Data Augmentation such as displacement transformation and morphological scaling for all input images is done in 3 parts i.e.(a) Rotating and scaling of the image by 3 0 and 5% respectively.(b) Height and width shifting by 5% each.(c) Erosion and Dilation upto 5x5 kernels and 3x3 kernels respectively.To enhance the results, N-gram statistical characters Language model is used with a free to use software application viz.SRILM Toolkit[21].The language model uses text rather than images, so it is easily trainable.The project uses another free to use online simulator viz.Google Colaboratory for running all the project files using GPU for stronger computational power.2.4.Exploratory evaluationThe experimental evaluation of the models is done using 2 metrics viz., Character Error rate and Word Error rate.They are calculated usingLevenshtein distance [22]  between predictions and ground truth.To declare the proposed model to have lower error rate, we must have p-value less than alpha i.e. 0.05.

Fig. 7 :
Fig. 7: CER and WER comparison for ALL Test PartitionThe improvements of the proposed model compared to previous models are because of three reasons viz.(i) latest deep learning techniques and toolkits, (ii) gated mechanism for convolutional block and (iii) bidirectional gated recurrent units in the recurrent block.The results are clear that the performance