An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer

Associate Professor, Department of Information Technology, Gudlavalleru Engineering College, Gudlavalleru, AP, India. E-mail: indiragamini@gmail.com Associate Professor, Department of Information Technology, Gudlavalleru Engineering College, Gudlavalleru, AP, India. E-mail: sureshdani2004@gmail.com Assistant Professor, Department of Information Technology, Gudlavalleru Engineering College, Gudlavalleru, AP, India. E-mail: venkatgecit@gmail.com UG Student, Department of Computer Science and Engineering, Gudlavalleru Engineering College, Gudlavalleru, AP, India. E-mail: lakshmi.hariprasanna@gmail.com


Introduction
Our eyes are the most appropriate place to look. Hearing for our ears. If eyes were hypothetically wiser and quicker than ears, wouldn't it be more useful for our eyes to send sound signals for processing? [1] [2]. Human Emotion Recognition is classified into various methods. Among them three are playing very crucial role in the recent days. 1. Facial Emotion Recognition 2. Speech Emotion Recognition and the 3. Audio Visual Emotion Recognition. Now a day's Artificial Intelligence and Neural Networks [3] [8] can be implemented using different software like Python, Jupitor, Anakonda etc. These softwares are well suited for Image Analytics in a smooth manner. Here in this paper we implemented CNN [4][7] [11] and CNN-2D on RAVDESS dataset to find out the human emotion. When we implemented CNN-2D we converted the whole dataset into image files from wav files. It is a part in Audio-Visual Emotion Recognition.
Basically, convolution neural networks are used for image data classification. The arrays consisting of pixel values are given as input to the convolution neural networks. The operations performed by 1, 2 or 3 dimensional convolution neural networks are the same. The difference is convolution direction rather than the input or filter dimension. In convolution 2D networks, the kernel moves in 2 directions. Conv-1D [5] is mainly used for time series or signal that is audio whereas conv-2D is used for image data analysis.
In this, we give the input features mfcc, mel, tonnetz, contrast in the array form to CNN-1D[7] sequential model. The structure of the model is 4 convolutional layers(with the activation function relu) in which the first two of them have a max_pooling layer with them. Then a flatten layer and the dense layer are stacked. The output layer activation function is softmax. The optimizer is rmsprop, learning rate is 0.00005 and the loss is sparse_categorical_crossentropy. Batch size is 20 and the epochs is 500.  Year

Methods Used
Here we are expressing literature survey in different model like a table format. We studied various papers and placed the list of algorithms from 2010 to 2020. But we observed the same type of algorithms before decade. We identified that there is a need of work on audio-visual mode. Image analytics also places a vital role in these days.

Methodology
Pooling Layer -For down sampling of the characteristics, the pooling layer (POOL) is used and is usually added after a convolution layer. Max and average pooling are the two types of pooling operations, where the maximum and average value of functions is taken, respectively. Max pooling was used.
Fully Connected Layers -On a flattened input, where each input is connected to all the neurons, the completely connected layer (FC) operates. These are normally used to link the hidden layers to the output layer at the end of the network, which helps to maximize class scores.
Each model has the same structure, i.e., it has a stack of five convolution and pooling layers (as a pair) without any dropout, then a completely connected layer with a drop of 0.8, followed by a regression layer that is fully connected.

CNN-2D
In the CNN-2D network, we give the image dataset values as input. The RAVDESS dataset [18] is converted into spectrogram dataset i.e., all the audio files are converted into spectrogram images [6][9] manually by using the spek tool. The images are converted to gray scale with 150 size. The model has a stack of five convolution and pooling layers(as pair) without any dropout then fully connected layer with drop out of 0.8 followed by fully connected layer with regression. Relu activation function is used for all the layers of CNN and the last layer has softmax activation function. The regression part has adam optimizer, loss is categorical_cross_entropy, learning rate is 0.005 and the number of epochs are 50.

Optimizers
Here in this paper three optimizers are used to enhance the output. Those are ADAGARD, RMSPROP and ADAM. Among all these ADAM optimizer gave best result when the learning rate is 0.005. iii. Print "Positive emotions : happy, neutral" iv. Print "Negative emotions : angry, sad" 8.END

Results and Discussion
The convolution neural network model is used for the generated image dataset.
i) Image_Size =150, Gray Scale Image, Learning Rate= 0.001 a) Relu activation function [18] is used for all the layers of CNN and the last layer has softmax activation function.
The regression part has ADAGARD optimizer, loss is categorical_cross_entropy, The accuracy obtained for this model is 70 %.
The classification report is: The classification report is: ii) Image_Size =150, Gray Scale Image, Learning Rate= 0.005 d) Relu activation function is used for all the layers of cnn and the last layer has softmax activation function. The regression part has ADAGRAD optimizer, loss is categorical_cross_entropy, The accuracy obtained for this model is 79 %.
The classification report is

Conclusion and Future Work
The optimizers adam, rmsprop and adagrad are used in the whole procedure of designing the model. Relu activation function is used for all the layers and softmax function is opted for the last layer. The loss is categorical cross entropy. By considering all the optimizers and tuning the learning rate as 0.001 and 0.005, we observed that Adam optimizer with 0.005 learning rate model outperforms all the remaining models by achieving 89% accuracy. The next model with more accuracy is model with learning rate 0.005 and rmsprop optimizer i.e., 82%. Remaining models with 0.001 learning rate gave 70% and 74% and 76% for adagrad, rmsprop and adam respectively. This paper's contribution is three-fold. First, we suggested a novel speech amplification model based on CNN-2D, which transforms audio files into spectrograms. The spectrograms are then used with more accurate phase information to synthesize enhanced speech waveforms. Second, we derive an optimizer that has taken into account various metrics in the objective function. We will investigate the integration of video and voice input data as a spectrogram in the future. We also have to explore this application in various fields.