Acoustic based Scene Event Identification Using Deep Learning CNN

Abstract: Deep learning is becoming popular nowadays on solving the classification problems when compared with conventional classifiers. Large number of researchers are exploiting deep learning regarding sound event detection for environmental scene analysis. In this research, deep learning convolutional neural network (CNN) classifier is modelled using the extracted MFCC features for classifying the environmental event sounds. The experiment results clearly show that proposed MFCC-CNN outperform other existing methods with a high classification accuracy of 90.65%.


Introduction
Building computers with sense (touch, vision, hearing, taste and smell) like humans is a long-awaited goal among the researchers. Simply, computers are made mindful of their surroundings as human beings are. Thereby making sensible computers, we can produce robots which are conscious of their surroundings, design environment recognizing hearing aids, or even design self-driving car for hazardous situations, acoustic-based surveillance systems etc. But acoustic surroundings in real is hard to decode due to presence of simultaneous sounds or high background noise, or due to long distance to the sound source.
This work towards mining valuable information from environmental audio recordings. The proposed system is targeted towards recognizing different classes of environmental sounds which are nothing but noisy sounds that we hear in our day-to-day time (i.e. dog barks, car engine running, etc.).

Related Work
Feature extraction is the base of acoustic data classification, based on its temporal resolution they fall into three subdivisions. First one is frame-level features, which are derived from short analysis frames/windows of sample size between 10ms to 100ms for representing the local characteristics. Examples includes cepstral, spectral and temporal features like LPC, LPCC, MFCC, time energy. ZCR, Centroid, Roll off, Flatness, etc. Second one is segment-level features, which apprehend the sound's texture characteristics since their analysis windows are long enough when compared with frame-level (Tzanetakis et al., 2002), they are also named as texture windows. (Seyerlehner et al., 2010). Third one is cliplevel features, which represent the global audio characteristics of the signal. These features are same for all frames extracted from a single audio clip.

Mel-Frequency Cepstral Coefficients (MFCC)
MFCC is an compact, frame level feature which tries to mimic the human auditory system which is derived from the audio through discrete fourier transform (DFT). Fourier spectrum is modified through mel-scaling for adapting to human perception level. Mel-filterbank consists of n triangular filters whose mel-frequency scale is approximately linear, up to 1 kHz and logarithmic thereafter. In this equation (1), (2) where M is the melfrequency and f is the spectral frequency.

Figure 1. MFCC Extraction
The calculation of MFCC include the following four steps.
1. Acquired audio sample is framed and windowed for making the short time frame audio locally stationary and free from edge discontinuities. 2. Transform the short time frames into spectrogram frames using DFT. 3. Map the spectral frequency into mel-scale bins. 4. Take the logs of the value of the mel bands. 5. Apply discrete cosine transform (DCT) on the mel bands to derive the cepstral coefficients. 6. Further differentiating the cepstral coefficients results in delta and double delta coefficients.

Deep Learning
Artificial Intelligence is playing an tremendous role in bringing the capabilities of machines closer to humans. Computer vision is mainly employed in the tasks of audio, image & video data's classification, recognition, recreation, analyzing, etc. Computer vision's growth is maximized with the involvement of deep learning.

 Convolutional Neural Network
CNN is an deep learning classifier which takes two-dimensional matrix as input and learn the network weights and bias from it, for identifying that same class of image during testing. There is almost very less preprocessing present in the CNN. CNN initially reduces the input data as simple as possible without losing the important features.

 Architecture of CNN
The CNN made up of convolution layers with RELU activation function, pooling layers for finding patterns and fully connected layer with softmax function at the end for classification.  In Figure 3, the filter travels to 9 different positions since Stride = 1, at each position they performing a dot product operation between filter and corresponding region of the 2D matrix over which the filter is currently lying.  If we want same dimensional features as that of input matrix, we do zero padding as shown in Figure 3. Here 5x5x1 data matrix is first zero padded and then convolved with 3x3x1 kernel, we get 5x5x1 feature dimension. If we do that same convolution without zero padding then we get 3x3x1 feature dimension.

Rectified Linear Unit (ReLU) activation function:
The convolution layer output is given to a ReLU activation function, which introduces non-linearity in data.ReLU makes the values ≤ 0 as zero and all positive values will be unchanged.

Figure 5. ReLU Activation Function
Pooling Layer: Pooling layers were used to decrease redundancy present in the data matrix. These layers will down sample the data so that the resource needed for running that program is also reduced. In pooling, a analysis window hovers over the 2D data matrix with an stride value. For max-pooling at each step, the greatest value among the analysis window is chosen. Whereas for average pooling, the average of all values in analysis window is selected.

Classification -Fully Connected Layer (FC Layer):
Finally, FC layers are added at last to learn from the high-level features derived from the previous layers. At this stage, the 2D data matrix is flatted out and given to a feed forward neurons and backpropagation is utilized to their network weights at each iteration. Thus, by using the softmax activation function at the last output layer, the input data were classified into their corresponding classes.

Experimental Setup
In this proposed system, deep learning-based modelling is employed through CNN classifier for recognizing an event in a specific scene from its acoustic data. The audio data is represented through the MFCC cepstral features. Proposed system comprises of training and testing, while each phase has three stages namely preprocessing, audio feature extraction and finally classification as shown in Figure 8.

Preprocessing
Pre-processing includes audio signal normalization, framing, followed by windowing and then silence removal. The audio signal which is sampled at 16000Hz, is separated into successive frames. Each frame is of duration 20ms and having 320 samples (16000Hz / 1000 = 16 samples per 1 ms since 1 second = 1000 milli second). Such 20ms duration frames were extracted every 10ms. These frame's signal amplitude is normalized to be in the range [− 1 1]. Followed by applying Hamming windowing to each extracted audio frame of size 20ms. For each frame, the overall energy mean is calculated and those frames with low energy values are discarded to remove empty audio frames.

Feature Extraction
Audio frames were analyzed in cepstral domain for deriving most discriminative feature vectors, which were the high-level representation of the input. In this system, frame-level 39 dimensional double delta, delta and static MFCC feature are utilized for effectively distinguishing environmental sounds.

Modelling using CNN
The extracted features whose dimension [39 X no. of frames] is taken as input for training. CNN network is trained using known training dataset's feature vector for all five class. After the CNN is trained, the testing data which is newer to the classifier is given one by one to analyse the recognition accuracy. The duration of the audio dataset is varied and their corresponding results are noted every time. Figure 11 clearly show that CNN classifier gives almost same accuracy for 3 and 5 second duration audio samples.

Conclusion
A scene recognition system is proposed on audio input modality and is implemented in Matlab. Proposed system utilizes cepstral MFCC features which discriminative well and reduce the redundancy in the audio data for effectively classifying environmental sounds into predefined classes. Deep learning CNN classifier and two Machine learning classifiers namely SVM and BPNN were employed in recognizing the acoustic events related to the respective acoustic scenes. Test results clearly depicts a higher accuracy of 90.65% using deep CNN while lower accuracy for other machine learning classifiers such as 72% for SVM and 83% for BPNN. Thus, CNN clearly outperforms other state-of-art classifiers in the recognition accuracy.