Performance Evolution of Face and Speech Recognition system using DTCWT and MFCC Features

Every activity in day-to-day life is required the need of mechanized automation for ensuring the security. The biometrics security system provides the automatic recognition of human by overcoming the traditional recognition methods like Password, Personal Identification Number and ID cards etc. The face recognition is a wide research with many applications. In the proposed work face recognition is carried out using DTCWT (Dual Tree Complex Wavelet Transform) integrated with predominant QFT (Quick Fourier Transform) and speech recognition is carried out using MFCC (Mel Frequency Cepstral Coefficients) algorithm. The distance formula is used for matching the test features and database features of the face and speech images. Performance variables such as EER, FRR, FAR and TSR are evaluated for person recognition


Introduction
In the recent years, with the drastic development in computer technology, the security methods for authentication have switched from traditional methods like Identity Card, PIN and Password etc to biometrics, as the biometric security methods are convenient for users and tough to steal and breach.
Biometric is a Greek term in which Bio stands for life & metric stands for measure that means measurement of life [1]. Biometric works on two characteristics such as physiological and behavioral. The physiological characteristics are based upon the shape of human body like face, iris, palm print and fingerprint whereas the behavioral characteristics of a person based on his behavior, which include signature, gait and voice, where face detection is one among the most commonly used biometric for verification in security systems as it is intuitive and non-intrusive. It is one of dynamic area of research, with a specialized application in various fields. It has wideranging application prospects in many fields like automation access control, video content indexing, surveillance, crime investigation, human computer interaction and other fields.
The main form of communication between human beings is speech & recognition of speech made possible for machine to understand human language. Automatic Speech Recognition (ASR) technology has decreased the human efforts in different fields by using an efficient user interface for various devices in all applications of computer technology like telephone networks, voice dictation, voice navigation in smart phones, smart speakers etc. Speech recognition system recognizes the input as language of human through device and analyzes the language of the human and then transforms the voice signal in the process into the corresponding logical information that can be recognized by the computer [1]. In speech recognition method pre-processing is carried out to obtain the speech signal and features are extracted to match the recognized speech signal with test signal using distance formula.

Related Works
Tadi Chandrasekhar and Ch. Sumanthkumar [2] proposed a model for recognizing the face by using Adaptive Neuro-Fuzzy Interface System classifier. In this model DTCWT was utilized for enhancing face pictures. The discriminative face features of these enhanced images are extracted using Principal Component Analysis (PCA) method. Performance of suggested approach were measured on YALE B as well as ORL data sets with ANFIS classifier. Researchers in [3] designed an algorithm using fast PCA & HOG (Histogram of Oriented Gradient) for recognizing the face under non-restrictive environment. Preprocessing of the raw data was carried out to extract the face region using Haar feature classifier. HOG features of this face image are extracted. The experimentation was conducted using Support Vector Machine (SVM) for matching on Labeled Faces in the Wild (LFW) database. Ningning Zhou et al.,[4] constructed a face recognition system technique by improving the CS-LBP (Center-

Research Article Research Article
Symmetric Local Binary Pattern). In CSLBP ignoring of the central pixel information has an impact on discriminative ability. To overcome this effect a descriptor was designed for feature extraction by fusing the central pixel information into CS-LBP. The effectiveness of the algorithm was tested on data sets like FERET, YALE B, YALE, and ORL by adopting the nearest neighbor classifier for matching process. An algorithm was developed by Ravi J et al.,[5] on the basis of DTCWT and LBP for face recognition. The original face images of all the databases are resized for uniformity in the preprocessing stage. The DTCWT coefficients of the resized face images are extracted using five levels DTCWT, which are then segmented into 3X3 matrix. The final face features are extracted from segmented matrix using the LBP descriptor. The experiment was carried out on different databases by comparing the test features with the trained features using Euclidean Distance classifier.
In [6] the researcher developed a face recognition algorithm using fusion of feature learning techniques. In this technique desired face region was captured by tree structure part model on the basis of facial landmark points. From these face region patches Scale Invariant Feature Transform (SIFT) descriptors were determined. Feature learning method such as block co-ordinate decent, sparse representation coding, co-ordinate decent, locality constraint linear coding is applied on SIFT descriptors for obtaining different input image face features. Such scores of this learning technique are fused to make a decision in recognition process. The performance is evaluated on different databases using Multiclass SVM. Lijian Zhou et al.,[7] have designed an algorithm on the basis of 2DLPP (two-dimensional locality preserving projection) & LBP for face recognition. Enhanced texture features are extracted using LBP descriptor by eliminating the illumination and noise effects, then 2DLPP was applied on these enhanced features to capture some features and by reducing dimension of image space structure data. In order to assess the effectiveness of algorithm, Experiment were performed on Yale, the expanded CMU PIE C09 & Yale B standard database with Nearest Neighborhood Classifier (NNC). Tong Xiaoet al., [8] has developed the technique of encrypted face recognition using Tent Map, Discrete Cosine Transform, Discrete Wavelet Transform. A pseudo-random sequence was generated with the use of Tent Map. DWT-DCT was applied on the face image to extract coefficient matrix, dot product of these matrix was done with pseudo-random sequence to get the encrypted face image. Projection matrix was generated from the encrypted image by the application of PCA, which was used to train the Back Propagation neural network for recognition. The simulation experiment was conducted on ORL database to check the robustness of the algorithm.
Eyad I. Abbasand Mohammed E. Safi [9] developed the algorithm for face recognition by reducing the database size by wavelet decomposition. Discrete wavelet decomposition was applied on the training and test images to decrease the database size. Final features are extracted from these images using PCA. The algorithm was tested on ORL database using Euclidean Distance classifier.
Punnam and Satyasavithri [10] proposed DT-CWT sub band segmenting for recognizing face. With the use of DT-CWT the image of the face is split into various oriented sub bands. Novel one's representation of this subband was formed by arranging from low to high frequency as a column vector, PCA was applied on this representation to get the final features. The Performance was evaluated on the ORL database with the use of k nearest neighbor classifier with Maholanobiscosine distance.
Hua Wanget al., [11] developed the algorithm by fusing HOG and Local Difference Binary for face recognition. Local pattern characteristics of face image is obtained through LDB descriptor and edge features are extracted by HOG descriptor. These features are fused to extract the final feature vector. The accuracy of the algorithm was tested on ORL and Yale datasets using linear SVM classifier. Chunling Tang an Min Li [12] proposed an algorithm for speech recognition in the noise environment using speech enhancement, combined with discard feature model by eliminating noise to check the correct voice in the voice information and finally measured the speech recognition rate in the noise environment of automobile. Lucas Debatinet al., [13] has proposed offline speech recognition techniques by referring different speech recognition topics. Author concluded to improve the speech recognition rate by reducing error rate, neural networks for language models and n-gram statistical models. R. Thiruvengatanadhan [14] developed an algorithm for speech recognition using Auto associative Neural Network technique. In the proposed algorithm speech features are extracted using Mel Frequency Cepstral Coefficients (MFCC) for individual word which is trained to the system and recognition rate is measured.
Ritesh A. Magre, and Ajit S. Ghodke [15] developed robust feature extraction for visual speech and speaker recognition algorithm is proposed. In the proposed work features are extracted by considering the speakers moth region and also compared with different visual features to select the best feature to increase the accuracy and reliability for identification visual speech.
Mehryar Mohri et al.,[16] proposed frame work with weighted finite-state transducers for speech recognition. Proposed speech recognition system is developed by considering different components of transducers to provide context-dependency models, statistical grammars, hidden Markov models, pronunciation dictionaries, and phone/word lattices.
Ashok Kumar and Vikas Mittal [17] developed algorithm for speech recognition using different methods like Linear predictive Cepstral coefficients, Mel Frequency Cepstral Coefficient, and PLP feature extraction methods.
Jayanthi Kumari and Jayanna [18] presented the research work with restricted data (not more than 15 seconds) using different feature extraction algorithm that provide good performance of speaker verification. In proposed work features are extracted by considering different methods such as LPCC, MFCC, LPRP and Linear Prediction for NIST-2003 database to measured equal error rate. P. Krishnamoorthy et al.,[19] has proposed a method for recognition of speaker under the condition of limited data with additional noise for obtaining better performance for limited data (less than 15 s) and measured Signal to Noise Ratio (SNR) for 100 speakers by selecting randomly in TIMIT database. To measure the performance of the system different feature extraction techniques like MFCC and Gaussian Mixture Model are used.

Proposed Model
The proposed work consists of necessary Dual-Tree Complex Wavelet Transform along with Quick Fourier Transform (QFT) is used to obtain the combined face features &Mel-Frequency Cepstral Coefficients are utilized for obtaining speech characteristics to recognise the person more accurately. Figure 1 shows the proposed model for face recognition.
DTCWT seems to be the competent algorithm for applying a wavelet transition. The technique is known for having Fourier Transformation properties in wavelet transform. Dual-Tree Complex Wavelet Transform algorithm has several advantages like limited redundancy & perfect reconstruction, approximate shift invariance directional selectivity in addition to that greater basis for de-blurring & de-noising.

Figure 1: Block Diagram for Face Recognition
Proposed work incorporates multi scale resolution methods for obtaining face image characteristics. System independently includes DTCWT technique to obtain one set of features coefficients. The DTCWT of x (n) signal is created with two significantly sampled DWT's for the same data in parallel. The Filter bank of DTCWT is described in Figure 2. The first tree generates the real information whereas second tree produce the illusionary information of transform respectively. The high pass pair (h h ) and low pass pair (h l ) information is obtained from the real part tree and the complex coefficients of low pass (h l ) & high pass (h h ) pair of information is obtained from the imaginary part of DTCWT and also it gives the total of six bands with different angles such as ±15, ±45, and ±75. This sub band information is considered as features. The complex wavelet and complex scaling function are outlined by the following equations: 2D complex separable wavelets and 2d complex scaling functions are described as: Ψ1(p, q) = ɸ(p) Ψ(q) Ψ2(p, q) = Ψ(p) ɸ(q) Ψ3(p, q) = Ψ(p) Ψ(q) ɸ (p, q) = ɸ (p)+ ɸ (q) The final features are achieved by integrating QFT features with DTCWT features using arithmetic operation. In order to classify the test feature with trained set of features Euclidean distance (ED) classifier is used. Samples of L-spec data set of a person with different poses are shown in Figure 3.
Here QFT uses symmetry properties of cosine and sine functions to derive an efficient algorithm.   In speech recognition voice consists of more information, we have to identify which person is speaking by extracting person's voice characteristics. In preprocessing speech signal is converted into digital representation, because the signal of speech differs with time; Although we evaluate it in time between 5 milliseconds and 100ms its characteristics are relatively unchanged. We will observe the difference in the speech signal, after 0.2 seconds or more than that, hence the better method to run audio signal is shot term spectral analysis.

MFCC (Mel-Frequency Cepstral Coefficients)
The Mel-Frequency Cepstral Coefficients method is on the basis of human hearing behavior that will not recognize frequencies greater than 1 KHz [18]. The ear of human can able to distinguish different frequencies.
The MEL scale is used to express the signal and are centered on observation of pitches measured by observers at regularly spaced intervals. This scale makes use of a filter based on logarithmic spacing above 1000Hz and linearly spaced the frequencies below 1000 Hz.

Farming
Framing is the process of segmentation between the ranges of 20ms to 40ms. The voice signal is split into N sample frames and distinguished by M adjacent frames that is M<N by taking the N=256 and M=100 values. For obtaining limited length, hamming window is used with different values of N=128 and 256, the values of M= 50 and 100 and the combination of M=100 and N=256 provide better performance and FFT is implemented for converting time domain to frequency domain of N samples in every frame.

Mel Filter Bank Processing
The MFB technique is used for obtaining linear scale from the wide range of FFT spectrum of the voice signal. Figure 5 describes the filter bank. Filters are used to analyze a weighed number of spectral components, and to obtain filter output Mel scale process is used. The filter response gives the magnitude of frequency in the form of triangle and seems to be equal to unity at center frequency & decreases linearly to zero of the 2 adjoining filters at the center frequency and it gives the output. The yield is the total of filter's spectral components, which is calculated for the given frequency using the following equation in Hertz.

Feature Matching
From the available feature matching techniques like, HMM,DTW,& Vector Quantization [20], the Vector Qualification technique is used in the proposed work because ease to implement with better accuracy.

Distance measure
The unknown speaker's voice is characterized by a feature vector sequence of (y 1 , y 2 …. yi), after it is compared with the codebooks from the database. To recognize the unknown speaker by comparing the distance of two vectors, when the distortion distance is low the person is recognizing as a known person using Euclidean distance.
To carry out the work, we can also create the speech database which contains the first 20 persons containing 6 speech signals per person.

Training
For training, from the Space Accent Achieve dataset first 20 persons 6 speech signals per person are used; hence total signals for training used are 120 signals.

For FRR and TSR calculation
For FRR calculation, seventh signal is considered from first 20 persons (Inside Database).

For FAR calculation
For FAR calculation, seventh signal is considered from signals from 21-30 persons (Outside Database).

Matching
The matching is carried out separately for face images and speech signal images using Euclidean distance. When the distance between the corresponding feature vectors based on the minimum score of the two faces and speech images being matched corresponds to the best alignment. If the Euclidean distance between two feature vectors is less than a threshold value, then the decision that "the two images are matched and these images are come from the same person and otherwise a decision that, the two images are not matched and these images are come from different person.

Results Analysis
The L-spacek face databases and Speech Accent Archive datasets are considered to evaluate the performance of the proposed work.

Result Analysis for speech
The table 2 demonstrates the values of %TSR, %FRR, and %FAR, for speech data. The FAR value varies from zero to 85 percent by varying the value of threshold from 8.2 to 12.2.and TSR becomes 97.5%. The value of FRR is 100% till threshold value is 9 and it reduces to 2.5 when the threshold value reaches to 12.2.  The graph of FRR and FAR is shown in Figure 6 and 7 with different threshold values at which FAR and FRR intersects.

Conclusion
In the proposed work face identification using DTCWT has been used effectively for L-Spacek database. The pre-processing is accomplished on face image for obtaining uniform size foe all the images and Duel-Tree Complex Wavelet Transform is used in the resized image of faces for obtaining DTCWT features & these characteristics are considered as the final ones. The Euclidean Distance is adapted for matching. We can observe that Total Success Rate is 98.83% for face database and 97.50% for speech database