Visual Question Answering using Convolutional Neural Networks

The ability of a computer system to be able to understand surroundings and elements and to think like a human being to process the information has always been the major point of focus in the field of Computer Science. One of the ways to achieve this artificial intelligence is Visual Question Answering. Visual Question Answering (VQA) is a trained system which can answer the questions associated to a given image in Natural Language. VQA is a generalized system which can be used in any image-based scenario with adequate training on the relevant data. This is achieved with the help of Neural Networks, particularly Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In this study, we have compared different approaches of VQA, out of which we are exploring CNN based model. With the continued progress in the field of Computer Vision and Question answering system, Visual Question Answering is becoming the essential system which can handle multiple scenarios with their respective data.


Introduction
Artificial Intelligence (AI) has always been seen as a robotic system having the ability to think like a human, but AI can be technically distributed into parts such as Natural Language Processing (NLP), Computer Vision, Image Processing, and Text Processing. For a system to be called Human-like, it should be able to understand like a human and respond to a stimulus in a similar way that of a human. This is quite a challenging task, as the actual working of a human's way of thinking is still unknown. But, there have been progress in the development of such systems. Visual Question Answering is such a system, where it can understand a given image and answer the question asked upon the image. This is done in two parts; the system understands the features of a given input image and also analyzes the given question to find the importance in it and association between the words in the question and features of the image. Finally, an answer is generated in Natural Language.
As this task consists of two different parts of processing, individual processing of image and question and image-feature mapping must be done accurately to achieve the desired result. This is particularly dependent on the way of training the dataset and the choice of properly fine-tuned Neural Networks.
In this research, we have compared previous approaches of VQA by studying their model-training, accuracies, feature-extraction methods and use of dataset. We also propose an approach of implementing VQA with the help of Convolutional Neural Networks and Recurrent Neural Networks with the inclusion of external knowledge of the images of the dataset. The use of external knowledge helps the system to properly map the image information with its corresponding question-answer pair by providing additional details of the features in the image. This helps in decreasing random answers irrelevant of the image or question.

Related Work
There have been some approaches to tackle the challenge of VQA, mainly with the help of Artificial Neural Networks specifically Convolutional Neural Network (Qi Wu 2017) and Recurrent Neural Network (Iqbal Chowdhury et.al). We compared different approaches on the basis of their test accuracies of answering the question correctly. We found out that initial approaches such as (Yuetan Lin et.al.2016) were based on a smaller image and question dataset. Models which were based on external input or information (Qi Wu 2017) proved to be a better approach. As such an approach was not truly based on the given image data and an associated question-answer pair but consisted of additional information of the image which explains in detail about the extracted features of the image.
With more developments, models based on attention (Peter Anderson 2018) were developed. These models focus on the feature extraction phase of the image set. Feature extraction using VGGnet (Yuetan Lin et.al.2016) only accepts the image in a format shape of 224*224, and with attention models rigorously finding features in an image by either top-down attention model (Peter Anderson 2018) or adaptive attention (Geonmo Gu 2017) gave the feature extraction phase a better perspective by focusing on the major features of the image.

Proposed System
The main model is based on CNN (Convolutional Neural Networks) and LSTM (Long Short-term memory) for the task of image model and question-answer model (Kavita Moholkar et.al). The system is 4 modules: Image feature extraction module, question-answer vocabulary, image model and prediction module. The handling of the VQA task starts with processing and understanding the given image dataset. The image dataset used in this system is MSCOCO 2014 (Tsung-Yi Lin et. al). The processing is done by extracting the features of the images with the help of CNN pre-trained image model name VGG16, which is a 16 layered convolutional neural network which takes image input of size 224 by 224 in RGB (Red, Green, Blue). The processing of the images is done in batches of 10 images per batch until all the images are completed. This processing results in a feature list of shape (82568,4096) which is stored as a feature file in h5py format (h5). Along with this feature list file an image identification list (size of 82568) of associated image features is created. Both these files are then used as inputs for creating the image model.
The second stage of processing is associated to the questions and answers. Here, the annotations of the MSCOCO images are broken down in a format where the questions and answers are separated into two different files as question-answer and vocab file. The question-answer file is required as an input while training the model and the vocab file is used during the prediction stage to understand the given input question.

Image Model
We are using VGG16 network to train the MSCOCO 2014 dataset, VGG16 is a pre-trained CNN network which comprises of 13 convolutional layers, 5 Max Pooling layers and 3 Dense layers. The image features extracted previously in the processing stage is given as input to this network to create a trained model. Along with the image feature, the question-answer vocabulary is given as input which is handled by the LSTM network.
The size of LSTM nodes is set to 512 for 3 layers. The image feature length is 4096 which is pre-determined by the VGG16 network. The dropout rate for both image and word embedding are set to 0.5. While training this model the batch size is set to 200 for 100 epochs or iterations with a learning rate of 0.001. This model combines the extracted image features with the associated question-answer air. With the given training amount and resources, the model achieves an accuracy of 60%.
The time taken to train this model is approximately 80 hours on an intel i7-8750h processor with 8gb of RAM.

Prediction Model
At this stage, we have two trained models of image and question data. These models are merged or combined to form a model which will generate the relevant answer, represented in Figure 1

Transfer Learning
Transfer learning is the method in which a neural network is trained on data, and the generated weights of the data is then used for another neural network rather than training the new data, in the sense to transfer the gained knowledge from the previously trained data. This is done to make the system more efficient for handling new data.
Our system is trained on the MSCOCO (Tsung-Yi Lin et. al) dataset of over 87000 training images, but to increase the capabilities the model is capable to accept a new training dataset to re-train the model with this new dataset on the basis of the existing data. This makes learning procedure of the system natural and improves the abilities of answering different situations or images. the weights of the pretrained dataset MSCOCO (Tsung-Yi Lin et. al) is given to another neural network. In this manner, our system can handle any dataset by giving the previously trained dataset.

Datasets
With the increasing developments, there has been an increase in the datasets.

Discussion and Conclusion
In table 1, we have compared the accuracies of different approaches. Only few approaches reach the accuracy mark of 70% whereas rest of them lie in the range of 50-60%. Our model of VGG networks and LSTM resulted in training accuracy of 60% on MSCOCO 2014 (Tsung-Yi Lin et. al) training dataset comprising of 82000 images. The testing accuracy of this model reached 57% over the same dataset. In conclusion, we have found that artificially implementing a human nature of question-answering is a challenging but rewarding task. As VQA is a generalized system, its application is endless with the exception of availability of the scenario related dataset.  Fig. 2, accuracies of the approaches on their respective datasets are compared. The chart shows that approaches [11] [14] on VQA achieves the highest accuracy of 70%. This shows that approaches based on knowledge-bases with attention provides higher accuracy. With this analysis, we find out that even the highest-achieved accuracy is of 70% and there is a good scope of improvement. Models and approaches based on knowledge-base and attention achieves the goal of VQA with* higher accuracies. With our proposed system, we try a similar approach with improvements of using transfer learning to increase the model's ability to answer relevantly to the asked question.