State-Of-The-Art In Video Processing: Compression, Optimization And Retrieval

Video compression plays a vital role in the modern social media networking with plethora of multimedia applications. It empowers transmission medium to competently transfer videos and enable resources to store the video efficiently. Nowadays high-resolution video data are transferred through the communication channel having high bit rate in order to send multiple compressed videos. There are many advances in transmission ability, efficient storage ways of these compressed video where compression is the primary task involved in multimedia services. This paper summarizes the compression standards, describes the main concepts involved in video coding. Video compression performs conversion of large raw bits of video sequence into a small compact one, achieving high compression ratio with good video perceptual quality. Removing redundant information is the main task in the video sequence compression. A survey on various block matching algorithms, quantization and entropy coding are focused. It is found that many of the methods having computational complexities needs improvement with optimization.


Introduction
Video data is the representation of audio with pictorial scene in a digitized form. It is sampled as spatial and temporally in the digital form. Use of social media has become a necessary activity in day today life to access news, information, and interaction and to make decisions. Facebook, YouTube, Instagram, Tumblr, WhatsApp, WeChat, etc., are the most popular social media platform where a billion of users share video files. Netflix, a media service provider of American technology performs streaming of the videos in an adaptive bitrate. According to the users' broadband network connection conditions and speed, it also adjusts the quality of video and audio. Sharing video information concisely in few seconds captures viewer's interest. Video data representation involves sampling spatially within a frame as a rectangular grid and temporally between the sequence of frames at regular time interval. A complete visual scene is sampled at a point to generate a frame which consists of odd and even number of spatially sampled lines. Sampling is repeated at an interval of0.04 or 0.03 seconds to generate a moving video signal.
Each pixel in a spatio temporal space is represented as a set consisting of luminance (also known as brightness) and chrominance (also known as color). Rather than RGB, YCbCr color space has more advantage where Y (the luma), Cb (blue chroma) and Cr (red chroma). As the human visual is fewer sensitive to color than luminance, the Cb and Cr components can be represented in low resolutions compared to Y. Thus, the data used to represent the chrominance component can be reduced without affecting the visual quality [3]. YCbCr video data representation with the reduced chroma resolution is similar to RGB having no obvious difference. For less storage space and reduced transmission requirements, it is necessary to convert the RGB images to YCbCr. Here image compression is performed effectively to represent the chroma components in a low resolution. The process of compressing the video data is known as video coding. For a smooth transformation of images in a video scene, a higher temporal sampling rate of frames is required. Sequence of images called frames that displays at a frequency (as frames per second or fps) known as frame rate. Video coding standards usually support 24fps and 30fps videos. Video coding standards are developed aimed at having high coding efficiency. Video coding is the process of compression and/or decompression of video signal which encodes/ decodes the video data at lowest bit rate without compromising good video quality. There are two quality metrics used to measure the coding efficiency such as Peak Signal to Noise Ratio (PSNR), the Objective metric and video quality, the subjective metric.

Motivation
The research in video compression is not new and continuing from last almost four decades. Video on demand communication is increasing tremendously in last decade and therefore the area is still attracting the researchers. Motion Estimation and compensation process is also an old topic, where there is a need of optimization. For these optimization problems, deep learning algorithms, evolutionary algorithms, bio-inspired and nature inspired algorithms can be explored in new aspect overcoming the regular block matching algorithm patterns and improving accuracy. For many decades, research has been done in video compression to provide high compression ratio with good video quality.

Video Compression Basics
Gesture video data consists of ordered sequence of group of pictures. Many surveillance applications share the available network bandwidth infrequently with others. Deriving compression techniques, the bit rate of video file is reduced. When the compression ratio is high, low bandwidth is consumed. Increasing compression may also cause increased degradation called as artifacts.
Compression is done by reducing the image data of the video sequence to a reduce media overheads for distributing. Comparing adjacent frames and reducing color resolution with respect to predominant light intensity, removing imperceptibleslices, outliers and noise are performed in the compression technique basically. It results in significant reduced file size for the video sequences which is achieved with visual quality. There are two types of compression techniques involved such as lossy compression and lossless compression. Lossless compression algorithm hat removes the statistical redundancy and allows to extract original data which can be reconstructed from compressed data. It results in limited data reduction by applying limited amount of techniques available. Lossy compression involves original data reduction to some extent, where the original data cannot be reconstructed in decompression process. Latency is the delay or time taken by the applied algorithm to interpret the video data and view the video on display screen. Latency increases when advancement is done in compression algorithm as it compares the adjacent frames. Main objective of video compression algorithms is to attain best compression ratio with the minimized distortion consequently. Removing spatial, temporal and frequency domain redundancy is possible to compress the data meaningfully to a convinced extent of loss of information. Subsequently compression is accomplished by coding schemes such as arithmetic coding and Huffman coding. Discrete cosine transforms (DCT)followed by the process of Motion compensation (MC) are the widely used video coding techniques involved in compression. A block based motion compensation is the motion compensated DCT video coding which is mostly used in H. 26x and MPEG formats.

Standard Video Codec Lexicon
Following terms are used for understanding video coding standards.  Frame -Group of picture consists of three different types of frames viz.: I-frame (Intraframe coded picture/ frame), P frame (Interframe predicted picture/frame) and B frame (Bi-directionally predicted pictures).  Macroblock -It is region of 16 x 16 size which consists of four -Y (luminance) blocks, one -CR(red Chroma) block and one -CB(blue Chroma) block. In Chroma processing, the color accuracy is compared to that of the ten bit YCbCr for the formats such as 4:4:4 and 4:2:0 [31].  Block -A region of size 8 x8 in a picture or frame. Discrete Cosine Transform (DCT) is used code the block.  Motion vector (MV)-Akey element in the process of motion estimation. It represents the macroblock in a picture (frame) based on its position placed in another picture known as reference frame. Motion Vector is found by calculating the correspondence amongst the frames at time t and t-1. Vector diagram depicts the direction, magnitude and velocity of moving object. Motion vectors stores the changes in a block. It is represented like a bi-dimensional pointer communicating the predicted macroblock is located left or right or up or down based on the reference frame to the decoder.  Motion Estimation -A video compression scheme which analysis the two frames and identifies the macroblocks are changed or not with the motion vectors. It examines the moving objects from one frame to another. It exploits the redundancy of vectors between reference frame and subsequent frame by finding the best prediction of macroblock. Process of compressing a video using motion estimation process is referred as interframe coding.  Motion compensation -Based on the knowledge of moving objects, motion compensation exploits the high correlation between the successive frames of a video sequence.

Existing Video Coding Standards or Compression Formats
There are many compression standards previously developed to compress image and videos. MPEG is the basic compression standard associated with digital audio-visual sequences. Following are the collection of compression standard formats:  H.120 -It is the first video coding standard established by ITU-T organization with a bit rate of 1544 and 2048 Kbps.  JPEG-2000 -It uses wavelet transformation instead of DCT transformation. It removes the blockiness of JPEG and is replaced with fuzzy picture.  Motion JPEG 2000 -It is a still picture compression technique used to represent a video sequence with better compression ratio compared to JPEG. This technique is unsuccessful due to its low viewing experience of video stream.  H.261 or H.263 -It is designed for the purpose of video conferencing thru low bandwidth over telephone line but not suitable for video encoding.  MPEG-1 -It is developed to code the motion pictures associated with audio. It also provides media storage of 1.5Mbps.  MPEG-2 -It is designed to code the visual-audio sequence with good video quality with an increased bit rate compared to MPEG-1.  H.263 -It is a coding standard developed for video conferencing. This codec performs optimization for low data rate with relative small motion.  MPEG-4-It is used to represent the audio visual data in terms of objects.  MPEG-7-It is a content description standard of a multimedia to allow fast searching and efficient retrieval. It uses XML to store metadata and does not perform encoding. The video stream contents and the events can be tagged for intellectual processing in video management.  H.264 -Its goal is to provide high coding efficacy as high compression ratio according to network environments. It is also used to deliver good video quality by means of low and high latency, at low bit rate and high bit rate for low and high resolution. The applications of H.264 includes streaming service or video on demand service, multimedia messaging service and conversational services.  H.265-It is also known as High Efficiency Video Coding (HEVC) standard [75], which specifies the decoding format along with the encoded video sequence involved in the video compression process. It also defines the syntax of compressed sequence.

Video Compression Essentials
Video sequence consists of group of pictures or frames which further comprises of three different frames IP and B frames as mention in section 2. The Intra-coded frame (I-frame) exploits the spatial redundancy existence between the neighboring pixels within a frame. The Inter-coded frame (P-frame) exploits the temporal redundancy between the consecutive frames in a sequence. Huffman coding is employed to encode I frame on the quantized version of DCT by itself. Moving objects and motion failures are detected by adaptive thresholding and differencing approach done by [32] along with Particle swarm optimization-based Motion estimation to remove temporal redundancy.
Video codec model performs transformation, quantization, block-based motion estimation and motion compensation and entropy coding. Video coding method exploits both temporal and spatial redundancy for achieving high compression. Temporal model, spatial model and entropy encoding are three functional units of video encoder. The entropy encoder compresses the Motion vectors and coefficients factors of temporal and spatial model. Fig. 1 illustrates the system that performs video compression.

Figure 1. Video Compression System.
Encoder compress the video sequence and creates a compressed form of video bit streams which may be used for storage or transmission. The video encoder involves the steps such as (i) Partition the picture (frame) into multiple units namely, Prediction Unit (PU) [15], Coding Unit (CU), Coding Tree Unit (CTU) (ii) Perform inter and intra prediction by subtracting prediction from unit (iii) Transform and quantize the residual errors (iv) Encode the output of transform, predicted information and header information. Video decoder decompress the compressed bitstream of video sequences and reconstruct the frames. Steps involved in decoder are (i) Entropy decoding and extraction of sequence elements (ii) Inverse transformation and rescaling (iii) Add prediction to each prediction unit to form the inverse transform output and (iv) reconstruct the decoded sequence.
Each coded video frame, or picture, is partitioned into Slices. Each slices consists of several macroblocks or Coding Tree Units (CTU) having maximum size as 64x64 pixel. It is further divided into Coding Unit (CU) shown in Fig 2. Each CU is further split [9]into single or many Prediction Units (PUs), in which each are predicted as either intra prediction or inter prediction. PU modes and CU combinations [4] are iteratively examined to select the optimal Rate Distortion (RD) cost of CTU [6,17]. Quad tree structure with multi type recursively partitions the slice and generate flexible block sizes [7,8]. In a video encoder, BjøntegaardDelta -Bit Rate (BD-BR) measurement is accustomed to compare the performance. Its negative value indicates exactly howample the low bit rate is reduced. Whereas the positive value indicates that the bit rate is amplified for thealike PSNR value. CTU rate control [12] estimates the allocation of bits to each CTU by exploiting the correlation between quantization parameters and features. Deep neural network based model improves the efficiency of CTU level rate control video coding.

Intra Prediction
Each prediction unit is anticipated from its neighboring pixel information within the same frame. The preliminary frame is divided into 8 x 8 blocks in inter prediction coding [20] as shown in Fig. 2. Consequently on each block, the Discrete Cosine Transform (DCT) and quantization is applied to assess the coefficients of digital and alternate current. Then entire the digital and alternate current coefficients are scanned in the zigzag manner to perform scrambling of run length. At latter conversion of entropy is obtained by means of Huffman algorithm. Neural network based intra prediction linear model is built by [19,37] to demonstrate the performance of versatile video coding among conventional methods. Neural network combined with Gray Level Covariance Matrix (GLCM) [5] used for intra prediction with flexible quad tree CU. CU are classified into natural content and scene content based on decision tree model [10] to reduce homogeneous CU splitting complexity. To speed up the process of encoding, affirm deciding CU scheme is adopted; thus skipping the dividing procedure which has smaller threshold value.

Inter Prediction
Each prediction unit is predicted from its data of neighboring frames using motion compensation. Succeeding frames are then undergone inter prediction coding in order to eliminate the time-based redundancies existence amongst neighboring frames. Frame rate up conversion [11] enhances the original videos' temporal resolutions thus converting the frame rate between different systems by interpolation of frames between consecutive ones. Encoding process of HEVC inter prediction [14,20] computes quadtree partitioning of CTU and evaluates its hierarchy using rate distortion optimization.

Transformation
DCT and DWT are important multimedia compression schemes using transform techniques. The elementary of conversion to accomplish compression on an image or video is Discrete cosine transform (DCT). This DCT has extensive recognition of signals for compression for its robust dynamism most extensively used to ensure enduring coding in compressing a video. Main purpose of this DCT technique [48] is to transform the M dimension of video or image into N dimension and also converts energy compaction especially short frequency coefficients. Gathering of associated coefficients can also be decorrelated using this cosine transform function.
Video bit streams on applying DCT technique that transforms the residual samples and the resultant factors are quantized. Resultant quantized factors are then traversed in crisscrossed fashionaiming to fragment the coefficients of DCT using macro chunk level dissection into two fold sub streams. Fragmentation of video into frames is done whenever it is sensed by a node [33]. Here every frame in RGB color domain are converted into luminance-chrominance (YCbCr) color domain so as to eliminate the RGB color space redundant components which are storage incompetent and not transmission efficient. An 8 X 8 sized block is selected from the luminance-chrominance color space to undergo DCT transformation by measuring the rows at the end. In this DCT conversion of color space, the original pixels are converted into spatial occurrences for further process. Inverse Discrete Cosine Transform decodes all the spatial frequencies into pixel values. [38] combined the nonsub sampleddelineation conversion with Huffman followed by run length coding thus achieving high performance in compression. Distinct shape of pictures are captured by contourlet transform in different directions.

In Loop Filtering
When the video files are being compressed, there occurs a quality degradation in lossy coding. In-loop filters are utilized in video coding standard towards enriching the eminence of video appearance [63]. To improve video quality, the key frame I frame are sent to the in loop sifting system on an encoder [1]. It reduces the noise distribution in the video frames. Many works have been performed on deblocking filter, CU partitioning, and adaptive filtering. Deep learning techniques [2] are also used to achieve better performance.

Motion Compensation
Motion compensation is mainly used to reduce the temporal redundant information of the current frame with reference to reference frame to achieve high compression ratio. MC predicts the current P or B frame based on reference frame [35] and encode the difference in prediction error. Segmentation [16] can also be applied to this motion compensation process in order to accurately predict the object edges in the block. The stages of motion compensation involves, (i) motion estimation between current frame and previously reconstructed frame (ii) current frame prediction (iii) diverse prediction encoding and original frame error prediction.

Motion Estimation
The process of identifying identical blocks between the consecutive frames (either previous frame or future frames) from the current frame is referred as motion estimation (ME). ME can be performed using various techniques known as inter frame encoding in the video compression procedure. ME is the initial step involved in compression and computes the motion vectors and its displacement values of each pixel. Positive values of motion vectors indicates that the frame moves in right or down. Negative value of MVs indicates that the movement of frames is in left or up.
Block-based ME and motion compensation (MC) are altered according to the structure of video object plane to a random form in MPEG-4 formats. A box like boundary is selected in a video object plane of a frame and the estimation of motion is performed within the MBs. Here the ME of current MB is mostly preferred as block based if the estimation of motion is done within internal MBs. If the estimation of motion is at the boundary then polygon matching, modified block based ME is preferred. Pixels in current MB of the video object plane are used to calculate the falsification measure in this polygon matching. ME is an interframe prediction process which includes pel recursive algorithm and block matching algorithm. Block-matching motion estimation assumes that the objective motion being foreseen is inflexible and non-rotational.

Block based Motion Estimation
Matching the blocks of reference and current frame is the main step involved to detect the temporal and spatial redundancies. Each video sequence consists of several GOPs and each GOP consists of any number of three different frames. I frame indicates the scene change. Frames are split in to macroblocks of size 8x8 or 16x16. The best macroblock of the current frame which perfectly matches with the reference frame are found as shown in Fig. 3by means of various block based ME algorithms. Quality measures in finding the best matched macroblock is determined via Mean squared Error (MSE), Mean Absolute Difference(MAD), Sum of Absolute Differences(SAD) and Peak Signal to Noise Ratio (PSNR). The similarity measures between the regions of macroblocks of the frame and −1 are computes as ( 1 , 2 ) = ∑ ( ( 1 + 1 , 2 + 2 ), −1 ( 1 + 1 + 1 , 2 + 2 + 2 )) 1 , 2 ∈ (1) Where E is the expected value, ( 1 , 2 ) is the pixel and ( 1 , 2 )is the displacement value. Search region size of a frame is (2 + 1) × (2 + 1). Mean Squared Error (MSE) is the square value of the differences computed between the reference and current frame. The quality is high only when the MSE value is low and reduces the error. (2) Current frame Reference frame −1

Search area Best matched block Motion Vector
Here the squared difference between the pixel value ( , ) of the original frame and the reconstructed frame is computed. PSNR is used to measure the quality of reconstructed picture at the decoder. Ratio of the squared maximum possible pixel value( _ (i.e. 256)) and Mean Squared Error is calculated. The logarithm of this value is known as PSNR value which is expressed in terms of decibel (db). = 10 • 10 ( _ 2 ) (3) Higher the PSNR value indicates high quality is achieved in reconstructed frame. Data Compression Fraction is well-defined as the fraction of uncompressed original video data size and the compressed size of video data. Data representation is reduced to relevant size in compression. Streaming of audio and video data is termed as = (4)

Quantization
Quantization is the process of mapping input source symbols to a small set of possible output values in such a way that the reconstructed picture in lossy compression is same as the original. DCT outputs the DCT coefficients to be quantized. Quantization process uses quantization parameter (QP) which decreases the DCT coefficient values. Scalar quantization and vector quantization are the two types of quantization used in compression techniques. [36] Defining the set size of quantization in video coding standard is the important part in providing compression efficiency. When the step size is coarse, compression ratio is high with less quality of reconstructed one. Compression efficiency is less when the step size is small. Scalar quantization is the process of mapping one input signal value to single quantized value of output which is performed along with the transform coding. A scalar quantiser may be uniform (same step size) or non-uniform quantiser (different step size). Vector quantization is the process of mapping vector (set of input values) to a set of quantized values. A survey on various techniques used for quantization is shown in Table 2.

Entropy Encoding
Encoding is performed with new schemes to reduce the energy transmission followed by quantization operation. Each frames of the video sequences are encoded so as to reduce the block size and statistical redundant information. Entropy coding determines the least number of bits essential towards representing the information deprived of any distortion. It converts the moving vectors and transform coefficients into compressed stream of bits for storage or transmission. High dimensional multimedia data consumes more time for encoding and decoding process. Variable Length Coding (VLC), Huffman coding and arithmetic encoding are broadly used entropy encoders. Context adaptive VLC (CAVLC) and context adaptive binary arithmetic coding saves the bit rate in compression. H.264 standard uses CAVLC entropy encoding technique proving low cost complexity. A fast and efficient video compression mechanism is need for video codecs. Machine learning algorithms can be used for encoding high and differing resolutions. Multi-reconstruction recurrent residual network (MRRN) [62,64]extracts the features of artifacts and fed into CNN reconstruction model to achieve content related restoration with different denoised ratio. ML algorithms involves high computation cost, enormous storage space and communication overhead. Thus considering constraints, researchers are motivated to design an enhanced encoding methods. When there is a proliferation in quantized bit stream size in the process of encoding [33], then there is a reduction in average number of bits to be trimmed which in turn obstructs the efficiency of transmission. Therefore in order to increase the average truncation bits count during encoding, those quantized stream of bits can be partitioned as either into two or three or many parts. In this scheme, half of the total length is considered as the quantized bit stream size and so threshold value can be fixed.
Huffman encoding and arithmetic coding are introduced in hardware to parallelize and optimize coding but still bottlenecks such as complex computation exists. A Huffman tree is built using the heap data structure and stored in the memory block which is further used to generate codeword lengths for every symbol. Variable length codes are concatenated with the encoded stream of bits in Huffman decoding which is more difficult. The maximum length of codeword is 255 and shares the prefix codes with canonical code. The main drawback of classical approach is that it looks up the memory several times the canonical codes are extracts the communal prefix code. Various entropy coding algorithms in compression is comparatively analyzed and shown in Table 3. Improves accuracy of detecting objects in inter and intra frames with less distortion [49] Generative Adversarial Network Down sampling of the blocks are performed in prediction. After coding the low bit rate the signals are then upsampled in decoder to convert to match original resolution producing good video quality.

Higher network training performance and flexibility
Feature extraction is more complex and time consuming in concluding efficient features.
[51] Spatiotemporal Knowledge distillation Distills the inherent knowledge from a complex to simple on low resolution dynamic saliency estimation Redundancies at inter and intra level are removed step by step producing high accuracy with low computational cost. Temporal cues are computationally expensive.

Optimization Techniques
Optimization can be defined as the procedure of resolution refinement so as to find and achieve the best efficient methodology. Evolutionary algorithms, nature inspired algorithms [41], swarm intelligence algorithms are the most preferred optimization techniques.
Optical movement, matching the identical blocks [42,43], recursive methods, various transform techniques and adaptive PSO techniques are used for attaining motion assessment optimization. ME based video compression saves bit by sending less entropy encoded images to a fully coded frame. ME process is the most computational cost expensive and resource extensive operation. Hence a fast and computationally inexpensive algorithms for ME is needed. Harmony search, simulated annealing and ray optimization techniques are physics or chemistry based algorithms. Table 4 shows the comparison of different block matching algorithms with its search points and PSNR value for the inputted Claire video sequence.

Video Retrieval
Video can be represented as moving objects in spatial and time; which consists of pictures (frames), audio, metadata and captions. Automatic indexing and retrieving the video unit is the recent topic in the multimedia research. Splitting a video into several potential units is performed by segmentation to improve indexing. High definition video compression, video summarization can be performed using video Object Segmentation [22] and Video object Tracking mechanism. [61,64] categorized content based video indexing and retrieval as four concepts such as; Video segmentation, indexing, dimension decrease and machine learning (ML)techniques. Segmentation of video into shots, the fundamental unit is performed to extract the necessary information describing the frames and shot boundary detection [46]. Redundant frames are eliminated and keyframes are identified or video summarization. Segmentation is performed on approaches such as cut based, machine learning, color based and entropy based. The feature are extracted [45] from color based approach, visual approach, motion based and texture approach. A few algorithms such as k-means, Principal Component Analysis (PCA) and document frequency based are used to reduce the dimensionality. Many machine learning algorithms used for CBVIR are Nearby neighbors based techniques, RF, Softmax, SVM, NN, MV based methods, regression, learning based, tree based, density based, probability boosting methods, querying and labelling methods, clustering techniques such as Fuzzy and hierarchical methods.
Quality of Experience (QoE) [72,74] is measured based on the user satisfaction of the retrieved videos from the social cloud. QoE can be assessed in two ways such as subjective and objective QoE. Video degradation falls under subjective QoE, whereas technical data loss impact in video quality comes under objective QoE. Videos of.wmv and.mp4 formats with 240P and 360P respectively can be downloaded from YouTube website https://www.tubeoffline.com/ [73]. When these videos are uploaded to a social cloud, it is automatically compressed to decreasing quality with reduced file size for storage. Parameters of videos such as frame width, frame height, data rate, bit rate, frame rate, audio bit rate, storage size, length, type of codec and video file type are used to describe a video file. [40] measured the video quality parameters thru divergent compacted and distorted video sequences.
Social clouds such as tumblr and Qzone video provided good quality of service when compared to youtube with increased QoE and less noise, less blur in low transmission bit rate. Cloud resources performs transcoding as a service [39] to reduce the complexity at user level. Cloud server transcoder splits the videos into chunks of different size having GOPs, process each chunks individually and merge them to a compressed one. Mobile video stream platform [50] can also be created based on collaboration with cloud environment to process the video. A large amount of data transmits when the resolution of the frames are high. Mobile devices and edge devices having dynamics transmission rate are handled by adaptive algorithm implemented in Netvision [55]. It processes the on demand video scheduling problem, query and optimize the response time. [57] built a lookup table, history tree and map table to improve query processing response time by maintaining space complexity. A summary on video compression and retrieval approaches with its basic quality metrics used are shown in Table  5.

Challenges
As additional video traffic will concern more sequences at 4K resolution, there exists some challenges.  Pre -sequence transcoding pecking: Fixing of resolutions and bitrates of the media by the providers is performed here. Single size does not fit for all media services. Movies have different bit rates combined with perceived quality metrics and reducing bit rate of encoded video is the major research area to provide video services without affecting the quality.  Espousing new standards: Newly developed video standards provide high compression ratio compromising the video quality. Compression performance of new standards triggers transcoding demand.

Future Directions
Though many papers are reviewed, many of the methods found unable to identify the redundant features, correlated frames from the original frame to perform transformations. Machine learning algorithms are more useful in content based video retrieval preserving the geometric structures and features. Semantic labelling video indexing can be considered which could explore the correlations thus improving annotation performance.
It is predicted that deep learning based compression of image or video might be a pivotal good quality videos represented in less bit rate. Following are the provoked issues necessarily be further examined.
 Decision making during encoding or motion estimation is a complicated problem during the compression standard generation. To tackle this problem, learning based algorithms such as active learning, reinforcement learning, deep learning, ensemble learning and transfer learning can be applied thus improving the performance of coding operation.  Memory and computation efficient video codec design, semantic-kind oriented video compression, compacting the feature descriptors and visual content with deep learning based framework. CNN model compression is also a multi-variable optimization problem, which should be optimized jointly considering computational cost, CNN performance [18,19]and rates utilized for CNN transmission.  Automatic compression of media file in social clouds adds noise, blocking effects, blurring, information loss during quantization and drop of frames in a sequence of video. Degradation of video decreases the quality of experience of the videos at the user side.
Many NIA algorithms are implemented to perform optimization in motion estimation. This paved a path to do research on combination of NIA algorithms along with learning concepts to enhance the performance by computing motion vectors. In future, Artificial bee colony and cuckoo search based DE hybridization can be started as the base work of ME optimization using nature inspired algorithms.

Conclusion
This paper provides the study on popular video compression standards, fundamental needed and operations performed to compress a video. Quality comparative analysis with its evaluation metrics are discussed. Lot of aspects in prediction, variable block size coding, compression artifacts, perceptual and semantic based processing and optimization of block estimation can be further improved for high resolution video sequences. Deep learning techniques can be effectively used for computing temporal coherence in predictive coding but time and memory complexity are high. Various block matching algorithms in motion estimation are studied. Artificial bee colony and cuckoo with differential evolution are found to provide best matching in motion estimation with high PSNR values. These techniques alleviates the resources, degradation of quality and time consumption. In summary block matching optimization along with deep learning based motion estimation and compensation is the major exploration topic considering the computational complexity and time complexity.