An Improving Method for Performance of DNN using 2-bit Quantization

: Background/Objectives: Recently, interest in AI(Artificial Intelligence) has increased, and many studies are being conducted to enable AI to be used in embedded and mobile environments. Among them, quantization is one of the methods to reduce the size of the model, and most quantization of less than 8 bits cannot be implemented without additional hardware such as FPGA. With this in mind, in this paper, we propose two new algorithms that can implement 2bit quantization in software. Methods/Statistical analysis: In this paper, we propose a packing operation that quantizes a weight consisting of 32-bit real values into 2 bits, stores four 2-bit quantization weights in one 8-bit memory, and a Masking Matrix Multiplication function that performs the calculation of the packed weight and input values. These functions operate in parallel in the GPU memory. Findings: The quantization model using the above function showed about 16 times more memory saving and 4 times faster when comparing the operation with the existing 32bit model. Nevertheless, the DNN model showed an error of around 1% in learning using MNIST and HandWritten data, and the CNN model showed an error of around 1% in learning using EEG (Electroencephalograpy) data. Improvements/Applications: The function used in this study is focused on the domain of DNN, and although extended to CNN, quantization could be performed only in the FC (Fully Connected) part. To apply to the convolution layer, an additional function is required, and it is necessary to check whether the difference in accuracy is small even in a more complex data set in the future.


Introduction
As the hardware advances, the field of AI has also undergone remarkable development. Various companies and research institutes have conducted various learning using high-performance PCs, and the learning has shown surprising results in succession. However, in recent years, research to learn in an environment where resources such as mobile [1] and embedded are limited rather than a high-performance PC environment are increasing. With the creation of chips for AI computation and the release of accelerators using USB, the era of AI can be applied to mobile phones is coming. In such an environment, the existing 32bit model and matrix multiplication operation cannot be used because it requires a lot of resources. Therefore, processing such as reducing the number of bits [2][3][4] or reducing the operation [5] through quantization is necessary. The current software environment does not support quantization of less than 8 bits, and additional hardware such as FPGA must be used for this [6][7][8][9][10].
In this paper, we implemented a packing operation in which the weight of the model is quantized to 2 bits or less in a software environment, and four weights are stored in one 8-bit memory to reduce overhead. In addition, masking matrix multiplication operation is implemented for the operation of packed data and input values, and the operation for packing and masking matrix multiplication operation is as follows.

Packing Function
Packing operation quantizes 32-bit data into 2 bits, and stores them in a big-endian format in an 8-bit memory through bit operation. The quantization method took a threshold comparison method, and the threshold value was 0.004 in this paper. The reason for using 0.004 is to find a region that equally divides it into 3 equal parts because the standard deviation of the normal distribution of random values is set to 0.01 and the mean value is 0. If the weight value is less than -0.004, quantization is performed with -1, when it is greater than 0.004, quantization is performed with 1, and otherwise, quantization is performed with 0. As mentioned above

Masking Matrix Multiplication Function
The Masking Matrix Multiplication operation is a function created to solve the calculation of the above packed data and input values. Since the packed data contains 4 weight values in an 8-bit memory, it is necessary to connect 4 weights and 4 input values to calculate them. The process was carried out collectively through parallel operation, and the function for selecting the required weight value from the 8-bit memory was solved using bit operation. The detailed algorithm for this is as follows.

Results and Discussion
To test the results of this paper, two models were created, DNN and CNN, and the DNN was tested by learning the positive and negative emotions of humans using the MNIST and HandWritten data sets, and the CNN using the EEG data set. First, the environments where DNN was tested are desktop and Linux, and the results of this environment configuration and time are shown in the table below.    The table comparing the overall memory occupancy and accuracy average of the above test is as follows. Ours 32bit

Conclusions
In this paper, a method of compressing and calculating a 32-bit model into 2 bits was proposed, and the operation was extended and applied to the FC layer of DNN and CNN and tested. The overall result is as follows. It is believed that if the quantization is carried out in the convolution layer in the future and the operation is improved in accordance with the learning of more complex data, better results will be achieved.