Artificial Neural Network Based Amharic Language Speaker Recognition

In this artificial intelligence time, speaker recognition is the most useful biometric recognition technique. Security is a big issue that needs careful attention because of every activities have been becoming automated and internet based. For security purpose, unique features of authorized user are highly needed. Voice is one of the wonderful unique biometric features. So, developing speaker recognition based on scientific research is the most concerned issue. Nowadays, criminal activities are increasing day to day in different clever way. So, every country should have strengthen forensic investigation using such technologies. The study was done by inspiration of contextualizing this concept for our country. In this study, textindependent Amharic language speaker recognition model was developed using Mel-Frequency Cepstral Coefficients to extract features from preprocessed speech signals and Artificial Neural Network to model the feature vector obtained from the Mel-Frequency Cepstral Coefficients and to classify objects while testing. The researcher used 20 sampled speeches of 10 each speaker (total of 200 speech samples) for training and testing separately. By setting the number of hidden neurons to 15, 20, and 25, three different models have been developed and evaluated for accuracy. The fourth-generation high-level programming language and interactive environment MATLAB is used to conduct the overall study implementations. At the end, very promising findings have been obtained. The study achieved better performance than other related researches which used Vector Quantization and Gaussian Mixture Model modelling techniques. Implementable result could obtain for the future by increasing number of speakers and speech samples and including the four Amharic accents.

a security perspective, identification is different from verification. For example, presenting your passport at border control is a verification process: the agent compares your face to the picture in the document. Conversely, a police officer comparing a sketch of an assailant against a database of previously documented criminals to find the closest match(s) is an identification process.
Speaker recognition systems fall into two categories, text-dependent, and text-independent. Text-dependent uses the same text for enrollment and testing. Text-independent uses different text for enrollment and testing. This study is dealt with text-independent speaker identification in a case of Amharic language speech.
At the highest level, all speaker recognition systems contain two main modules feature extraction and feature modeling and matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. Feature modeling and matching is modeling the extracted features and involves the actual procedure to identify the unknown speaker by comparing extracted features from the input voice with the ones from a set of known speakers.
However, the speech based systems that have been developed so far are meant to serve a specific language and are totally limited to some techno-rich countries of the world. For developing countries like Ethiopia, it is a must, to follow the outfit of those techno-rich countries in relation to such technological advancements to do not lose the opportunities provided by technologies. Based on this fact, speech engineers and language experts invarious countries are making noticeable efforts to develop recognition that works for their own language. In our country, even though it is not enough 40 speech recognition and 3 speaker recognition researcheshad been attempted. As we have seen, there is a shortage of research regarding speaker recognition. The above three speaker recognition concerned researches used Vector Quantization and Gaussian Mixture model for modeling techniques. The goal of this study is exploring the possibility of a state of the art modeling techniques for building Amharic language speaker recognition.

Literature Review
S. A. Mahmood and L. E. George, 2007, investigate neural based speaker recognition system. LPC has been used as a feature extraction method. And back propagation neural network has been used for the purpose of speaker modeling and identification. Achieved 90% accuracy for 10 speakers [4].
M. S. Sinith et al. , 2010, emphasis on text-Independent speaker identification system using Mel-Frequency Cepstral Coefficients (MFCC) as the speaker speech feature parameters in the system and the concept of Gaussian Mixture Modeling (GMM) for modeling the extracted speech feature. And used the Maximum Likelihood Ratio Detector algorithm for the decision making process. The experimental study has been conducted on MATLAB 7. Gaussian mixture speaker model attains high recognition rate for various speech durations. The recognition rate is maximum (98.8 %) when the speech is of 60 seconds duration and the number of Gaussians is 16 [5].
Amr Rashed, 2014, this paper proposed a fast algorithm for speaker recognition. This algorithm first records voice patterns of speakers via noisy channel and use some of noise removal techniques. The feature is extracted by Mel Frequency Cepstral Coefficient (MFCC) and the feature is reduced by Principal component analysis (PCA) technique. Then the result vector is fed to ANN classifier. Experimental results indicates that using ANN with weight/bias training algorithm have better performance. The result shows that the proposed algorithm achieved on average about 99% accuracy rate and higher speed rate in comparison with other methods [7].
A. Azene, 2015, the first attempted Amharic language speaker recognition research. It presents textindependent speaker identification system for the Amharic language. Speech signals are collected from different speakers including both sexes as well as different age groups. MFCC had been used to extract features from the speech signals and to generate feature vector. VQ and GMMs had been used for training and identification purpose. The intention of the researcher was to see which modeling approach is better for text-independent speaker identification. For a total of 50 speakers, 74.2% accuracy was achieved when VQ approach is used where as 84.3% accuracy for the GMMs. The researcher tried to see the speaker identification accuracy based on gender. 25 male and 25 female speakers were considered. From the experiment, 86.2% and 85.9% accuracy was achieved for male and female speakers respectively [5].
A. D. Mengistu, 2017, presents an automatic text-independent speaker identification system for the Amharic language in noisy environments. Speech signals are collected from different 100 speakers including both genders. Each speech has 10 seconds duration from each individual. Combination of MFCC, LPCC, and GFCC had been used for feature extraction purpose. VQ and GMMs had been used for training and identification purpose. The researcher was attempting to see which modeling approach is better for text-independent speaker identification with the combination of the three feature extraction techniques (MFCC, LPCC, and GFCC). The researcher conducted two experiments; One, VQ modeling approach with the combination of the three feature extraction techniques for 30, 60, and 90 speakers. And achieved 77.2%, 70.9%, and 69% accuracy respectively. Two, GMM modeling approach with the combination of the three feature extraction techniques for 30, 60, and 90 speakers. And achieved 75.2%, 76.9%, and 78% accuracy respectively [6].
M. Islam, F. Khan, and A. M. Haque, 2013, presents the implementation of Text Independent Speaker Identification system. Feature extraction task has been done using Mel Frequency Cepstral Coefficients (MFCC) acquisition algorithm that extracts features from the speech signal, which are actually the vectors of coefficients. The backpropagation algorithm of the artificial neural network stores the extracted features on a database and then identify speaker based on the information. Achieved near 100% accuracy in case of static speech signal and above 90% accuracy in case of real time speech signal [6].
D. Mengistu D. Melesew, 2017, presents a hybrid approach of VQ and GMM have been used for classifying dialects of Amharic language. In speech signals collection, total of 100 speakers from each group of dialects (Gojjam, Wollo, Shewa, and Gonder) are considered. MFCC feature vectors are used to recognize the dialects of speakers. When 25 speakers are considered from areas, 85.9% accuracy had been achieved. When the number of speakers are increased to100, which is the maximum number of dialect speakers of the experiment, 92.7% accuracy had been achieved [7].
A. Antony and R. Gopikakumari, 2018, introduces an isolated word speaker identification system based on a new feature extractor and using ANN. The system is designed for both text independent and text dependent speaker identification system for English words. The speech is recorded using audio wave recorder. Then the preprocessing is applied for the given speech signals. UMRT is a transform which has been used for image compression. Combinations of MFCC and UMRT are taken and are used as a feature extractor. The classification of the features is done using Multi-layer perceptron with back propagation algorithm. The accuracy is taken using confusion matrix. The accuracy achieved is around 97.91% for speech dependent systems while for speech independent system the accuracy is around 94.44% [8].

Methodology
Activities involved in this study framework of methodology are speech collection, preprocessing of speech signals, feature extraction of preprocessed data, feature modeling of extracted features, feature matching during identification, and performance evaluation. All activities has been done using an appropriate techniques and tools. Figure 1. shows the overall detailed framework of the methodology.

Proposed Speaker Recognition Prototype
In order to have simple interaction in demonstration, we developed graphical user interface prototype. So, in this study, the overall processes are conducted using GUI speaker recognition prototype.  Input speech is preprocessed speech signal. The preprocessed input speech always pass through feature extraction technique (MFCC). After feature extraction, there are two paths; training and testing. During training, feature vector of speech signal is created and stored in to database. Then, all feature vectors of speech signals which have been stored in the database will be trained for the feedforward neural network and "Network.mat" file which contain the trained neural network is created. During testing, a comparison between feature vector of the input speech and feature vector of speech signals in the trained neural network (in Network.mat) will be done. Finally, after comparison has been made, the neural network make a decision to identify the speaker.

Feature Extraction
Since the prototype is GUI, the preprocessed input speech is loaded one by one to the system for feature extraction. The class ID is given when the input speech is loaded for feature extraction. 200 speech samples of 10 speaker has been loaded to the system for feature extraction and stored in database (speech_feature_vector_database).  The code given for the speakers" is assigned using the two capital letters of first and last name of the speaker. Each speaker has 20 speeches, and the wav files are represented using the speaker code with numbers from 00-19 (i.e. AA_00.wav, AA_01.wav, AA_02.wav……. AA_19.wav).

Training of Neural Network
During the feature extraction stage, the speech signals are transformed in to feature vector in the way it will be suitable for training the neural network. The next step is training and testing the neural network. As tried to mention in chapter 3, in this study, three experiments have been conducted. After setting the training parameters, the set of inputs with respective target outputs fed for the neural network sequentially.

Performance Evaluation
Here, the evaluation has been performed in two ways:  Trained neural network model evaluation based on confusion matrix.  Real time test using speeches that are prepared for testing.
Based on all confusion matrixes, the below result summary has been summarized.  Table 4 shows how to classify the samples sound frames versus with four confusion matrix parameters. For example, Class 1, has total 2730 sound sample frames stored in sound database extracted from his voice. From those, the truly classified samples are the actual Class 1 predicted correctly 2558 (TP). The 83 (FP) frame samples are classified incorrectly. The 172 (FN) frame samples of actual Class 1 are predicted incorrectly by otherclasses.The23214(TN)framesampledneitheractualclassnorpredictedclasswhich means truly predicted incorrectly from the total samples26027. Table 4 shows how to classify the samples sound frames versus with four confusion matrix parameters. For example, Class 1, has total 2730 sound sample frames stored in sound database extracted from his voice. From those, the truly classified samples are the actual Class 1 predicted correctly 2567 (TP). The 64 (FP) frame samples are classified incorrectly.

Experiment Two: [Hidden Neuron=20]
The 163 (FN) frame samples of actual Class 1 are predicted incorrectly by otherclasses.The23206(TN)framesampledneitheractualclassnorpredictedclasswhich means truly predicted incorrectly from the total samples26000. Model Accuracy (%) 9.9 9.6 9.6 9.4 9.9 9.8 9.2 9.8 9.6 9.9 Classification Accuracy (%) 99. Model Accuracy (%) 9.9 9.6 9.7 9.5 9.9 9.8 9.4 9.8 9.7 9.9 Classification Accuracy ( Table 4 shows how to classify the samples sound frames versus with four confusion matrix parameters. For example, Class 1, has total 2730 sound sample frames stored in sound database extracted from his voice. From those, the truly classified samples are the actual Class 1 predicted correctly 2579 (TP). The 50 (FP) frame samples are classified incorrectly. The 151 (FN) frame samples of actual Class 1 are predicted incorrectly by otherclasses.The23229(TN)framesampledneitheractualclassnorpredictedclasswhich means truly predicted incorrectly from the total samples26009. Here, the prototype has been tested using testing speeches that are preprocessed separately. 20 testing speeches are prepared for each speaker. So, one speaker has been tested 20 times with his/her respective prepared test speeches.

Result Discussions
In the experiments and findings section, various findings are presented. The findings have addressed the problems in research question. The first research question was realizing whether ANN has promising performance for speaker recognition. Definitely, ANN has promising performance for speaker recognition. The indication is separately discussed later on. First, based on the confusion matrix on the overall accuracy is 96.0%, 96.7%, and 97.3% respectively for the three experiments. This shows how often the classifier is correct. Second, based on table 4 which presented the performance in terms of TP, TN FP, and FN, the approach showed good results in classifying objects correctly and incorrectly. They were a few objects that are classified wrongly. Third, based on table 5 which presented the performance in terms of precision, recall, and F1-score, the approach showed good results. Finally, based on the real time testing on table 5, the trained neural network provides a promising result in recognition of untrained speeches of registered speakers.
Later exploring internal factors that could improve the performance of the model and recognition. Selecting appropriate number of hidden layers and neurons is still a gap. There is no common standard approach, various approaches are listed in [12], from them rule of thumb is used in this study. So, used three different number of hidden neurons (15, 20, and 25)

Average Precision, Recall, F1-score, and Overall Accuracy
separately. Basically, the response times also increases during training the networks when increasing the number of the hidden neurons.
It is not appropriate to compare and contrast this study with the three previously done papers because to compare and contrast, researches should be in common background or equivalent environments.
[5] and [6] used 50 and 100 speakers directly recorded speeches respectively, in this study the speeches are collected from multimedia and conducted using 10 speakers. But, the good thing is in [5], the researcher presented the performance per number of speakers (10,20,30,40,50). For 10 speaker, he achieved 83.2% and 87% accuracy using VQ and GMM respectively. In the same manner in [6], the researcher presented the performance per number of speakers (30,60,90). For 30 speakers and he achieved 70% and 66% accuracy using VQ and GMM respectively. Therefore, based on this analysis, our study achieved better performance than the two previously worked papers. However, it couldn"t mean that this modeling technique has better performance than others for large number of speakers because this study is conducted for only 10 speakers.
At the end, the study introduce a novel attempt of modeling techniques for Amharic language speaker recognition research. And showed that ANN performs better than others on this context. Table 6. Comparison of this research with other related researches

Conclusion
Biometric techniques are one of the modern advances in security systems. No longer requires of entering a password or a PIN which is difficult to remember. Physical characters of the person are used instead. This thesis has presented voiceprint as one of the most promising and useful technologies fitting to the biometric security.
The general issues and applications of speaker recognition is described in the introduction of this thesis in well manner. The goal of the thesis was enabling the environment to develop text-independent speaker recognition for Amharic language using a novel modeling technique which is not attempted for Amharic previously.
Total of 200 sample speeches dataset has been prepared from 10 famous people public speeches. MFCC feature extraction and ANN modeling technique is used to meet the goal. MFCC transformed the preprocessed speech signals in to feature vectors so that it will be appropriate for training the neural network. The feedforward neural network trained the feature vectors using Levenberg Marquadrt training algorithm with training epoch parameters of 100. Training speech dataset has been divided into 70%, 15%, 15% for training, validation and testing respectively using dividerand function, and tansig activation function has been used to convert input signal of a neuron in to output signal. proposed modeling technique showed better performance than other techniques which are used in previously done researches.
Findings addressed the whole research questions raised in the study. Findings have been discussed in discussion section of chapter four. Since the third research question is general question, the answer is found after completion of the whole research process. Its" implication was exploring external cause that could reduce the performance of the model and recognition accuracy. For the performance improvements of model as well as recognition accuracy, there are various external factors.

From speech perspective;
 Way of speech collection (direct recorded or from speech database)  Environment on which the speech is recorded (Noisy or Noise free)  Emotion and health condition of the speaker.  Accent of speaker.  From techniques perspective;  Appropriateness of preprocessing steps  Selecting of best feature extraction technique  Modeling using a technique that is best in discriminating.  Criteria to evaluate the performance.
Every research context has its" own fitting requirements. It needs scientific searching in order to get appropriate requirements for specific research context. Still it is a big gap for the researcher to standardize common appropriate requirements that works for any kind of research context. Mimicking the nature is so difficult.
At the end, we witnessed that the architecture selected for the neural network in this thesis, which is a pattern recognition feed-forward neural network with one hidden layer containing 15, 20, and 25 neurons and an output layer containing 10 neurons, inputted 26 coefficients vector was effective and suitable for the identification process.
The study contributes a novel attempt of modeling technique for Amharic language speaker recognition. Since the technique achieved better performance than other modeling techniques that are used in previously worked papers, it gives a clue which modeling technique is best, when anyone who needs to implement practically. The availability of this study will create a chance for the coming researchers who have willingness in this research area in order to conduct comparative analysis on modeling techniques especially for Amharic language speaker recognition. Finally, even though it is not the objective of this study, after all it is the product of this study; the GUI prototype developed by the researcher could be used as a tool for the coming researchers who have willingness in this field of research based on MFCC and ANN. It simplifies searching for different source codes and incorporating all in one.