Analysing Effect of t-SNE and 1-D CNN on Performance of Hyperspectral Image Classification

Feature extraction is a crucial step in Hyperspectral Image classification that aids in processing data effectively without losing relevant information. This step is essential when dealing with images with high dimensions because they suffer from Hughes phenomenon or the curse of high dimensionality. This phenomenon occurs in high dimensional datasets where the number of training samples is limited. In this paper, we have studied the influence of feature extraction techniques in HSI classification. We have compared the efficiency of three widely used techniques, namely Principal Component Analysis, tStochastic Neighbor Embedding and Convolutional Neural Network. Overall classification accuracy for PCA when used with KNN, a commonly used classification algorithm was found to be 69.79% while t-SNE with KNN was 71.04%. When CNN was used for feature extraction, its outperformed t-SNE and PCA with a wide margin with classification accuracy reaching as high as 95.06%. Keywords—— feature extraction, convolutional neural network, t-SNE, principal component analysis, hyperspectral image classification _______________________________________________________________________ Application of Clustering Filters in Order to Exclude Irrelevant Instances of the Process Before Using Reinforcement Learning to Optimize Business Processes in the Bank __________________________________________________________________________________ 1829


INTRODUCTION
Hyperspectral image classification is an emerging technology applied in geology, mining, ecology and surveillance. Each pixel in the image contains the entire spectrum of the scene which aids in accurate mining of all available information from the scene.
In Machine Learning problems that have limited quantity of data samples of high dimensions, an enormous quantity of training data is necessary. The predictive accuracy of the algorithm increases as the number of features increases but then decreases, which is known as Hughes phenomenon [1].
Since there is a huge amount of information present in a scene and small number of samples available Hyperspectral Image Classification becomes a daunting task. This issue is addressed by reducing the complexity of the data set using feature extraction [2].
In feature extraction, the number of features in the dataset is reduced by conceiving new features from the original ones [3]. The new set of reduced features comprises information present in the original features.
Principal Component Analysis (PCA) is an extensively used linear feature extraction technique where a mixture of input features that encompasses all the available information is obtained from the input [4]. PCA does this by preserving the crucial parts in the data that exhibit maximum variance.
t-distributed Stochastic Neighbor (t-SNE) Embedding is a manifold learning feature extraction technique utilized particularly for high dimensional datasets [5]. Unlike PCA which is mathematical, t-SNE is probabilistic. It is a variation from stochastic neighbor embedding as important visualizations are obtained by reducing the bias of crowding points in the middle of the plot [6]. The utilization of deep learning has significantly increased in recent years due to its exceptional performance in terms of classification accuracy. For image recognition and classification challenges, Convolutional Neural Networks are used extensively. The visual system of humans has influenced the construction of CNN architecture [7]. They prove to be an excellent combination of feature extractors and classifiers. In our paper, we assess the influence of feature extraction in HIS classification by comparing the performance of three different classes of FE techniques. The major contributions of this paper are: 1) Identify if PCA, t-SNE or CNN provides better accuracy for classification. 2) Verify results obtained using other datasets.

II. METHODOLODY
Three feature extraction techniques belonging to different categories are considered.

A. Principal Component Analysis
The adjoining bands of a hyperspectral image are extremely correlated and contain redundant information. PCA finds the optimum linear combination of the bands of the image which expresses the variation of image pixel values [8].
To perform PCA, the data has to be standardized. first. This is done in order to obtain a gaussian form with standard deviation 1 and mean 0. The average of pixel values is subtracted from each pixel and divided by deviation. This is followed by calculating the covariance matrix of the input image. Covariance is obtained using the formula where: xi is image pixel vector N = a*b a is the total quantity of rows and b is the total quantity of columns.
The Eigen decomposition of the covariance matrix is obtained. The eigenvectors and eigenvalues are then ranked in descending order based on the maximum variance. The top k eigenvectors obtained from the result of the scree plot represents the new bands which are an orthogonal

B. t-Distributed Stochastic Neighbour Embedding
t-SNE is a non-linear feature extraction technique used chiefly for high dimensional datasets [9]. The algorithm works as follows. The probability of similitude of data points in low dimensional and high dimensional space is calculated. This similarity is determined as the conditional probability that one point would choose another as a neighbour if they were chosen with respect to the probability density under normal distribution centred at first point. This difference between conditional probabilities (which represents similarity between two points) is minimized to the fullest extent for the ideal representation of points in the lower dimensional space. The sum of Kullback Leibler divergence of all data points is curtailed by the gradient descent method to calculate the minimization of the sum of the difference of conditional probabilities [10,11].
In t-SNE Student t-distribution is utilized. The joint probability qij for this distribution is defined as The cost function in this case is defined as: In low dimensional space, pairwise similarities are given by: Thus t-SNE matches high dimensional data to low dimensional space and tries to find patterns in the data by analysing and classifying based on the clusters obtained based on the data points similarity with numerous features.

C. 1-D CNN
Convolutional Neural Networks are extensively used in image processing and have proved to exhibit excellent performance for Hyperspectral Image Classification. To implement CNN for feature extraction, an architecture [12] with five layers is used. The network consists of input layer, a Convolutional layer, a Max Pooling layer, Fully Connected layer and an Output layer. Conventional CNNs utilize spatial and spectral data for classification. To exhibit efficiency of CNN, the spectral signature data of each pixel is considered.

Training:
We initialize the trainable parameters between -0.05 and 0.05. The process of training includes two crucial steps: Forward propagation and Backward propagation. Forward propagation computes the classification result with current parameters. Backward propagation updates the parameters after each iteration to limit the cost function to the minimum.

Forward propagation:
Hyperbolic tangent function is implemented as the activation function for Convolutional layer and fully connected layer. The maximum function is utilized in the Max pooling layer. Owing to the fact that the CNN output is a multiclass classifier, the result of the FC layer is given to Softmax layer that results in a distribution over the number of classes that needs to be identified. The batch size is fixed as 32.

Backward propagation:
Parameters that need to be trained are updated by utilizing gradient descent algorithm in back propagation. The cost is reduced once the first iteration is over by passing the resultant weights through each layer. The mathematical intuition for this is to determine partial derivative for weights in each layer [13]. In the architecture, C1 and M2 act ad trainable feature extractors  The loss function is given by where n is the number of samples used for training. Y is the output required. As the number of iterations increase the difference between the actual output and desired output decreases until this discrepancy reaches minimum.

A. Datasets
The Indian Pines data set was obtained using the AVIRIS sensor. The region covered is north-western Indiana. The dataset consists of 220 spectral channels in the visible and infrared spectrum. This covers the range 0.4 to 2.45 um. The image scene has a spatial resolution of 20m.
The data for development is obtained by dividing the data into training and testing samples which can be utilized for parameter tuning in the case of CNN. Each pixel is scaled uniformly between -1 and 1.
The other dataset, Salinas was obtained by Aviris sensor as well. It captures the Salinas valley scene and constitutes 3.7 m of spatial resolution. This scene consists of 220 spectral bands with 16 different classes.

B. PCA with KNN
The first k principal components from the result of the scree plot are selected from the 200 original bands available in the image. The hyperspectral image pixel values are stored as a vector whose length is the total number of pixels.
The result of PCA is then utilized by the KNN classification algorithm. KNN stands for K nearest neighbors. Here, K stands for the number of nearest neighbor pixels that each pixel uses to assess and vote the label of the chosen pixel. The measure used to find the similarity of closest point is Euclidean distance. The algorithm is run over different k values in order to find the optimum value exhibiting maximum accuracy.

C. t-SNE with KNN
The data is standardized before applying t-SNE. Perplexity is a tunable parameter that plays an imperative role in the performance of t-SNE algorithm. This value lies somewhere

D. 1-D CNN
The dataset is split into training and testing data. The training dataset constitutes 50% whereas testing constitutes the remaining 50%. The data is standardized such that each data point lies in a particular range. The learning model parameters are then generated and transformed parameters are obtained before feeding into the neural network. Since convolutional neural networks require the categorical data to be converted into numbers, one hot encoding is done.
The batch size is taken as 32 for forward propagation. Since our objective is to extract crucial features from the plethora of data available, valid padding is done. This ensures that after each layer, the number of features reduces drastically to the most important ones.
Dropout regularization is done by neglecting nodes in a random manner. This will cut down the cost of storage, time and interdependencies arising in the nodes.  IV. CONCLUSION In our paper, we have done a detailed analysis of three feature extraction techniques that are prevalently used for image classification problems. The datasets used were Indian Pines and Salinas datasets which are standard available datasets used for hyperspectral image processing. Experimental analysis of Indian Pines dataset indicates that CNN outperforms PCA and t-SNE by a huge margin with an overall accuracy of 91.41%. But this comes at the cost of increased time and computational complexity. When we compare t-SNE with PCA, there is only a slight improvement in the performance of t-SNE with 71.04% compared to 69.79% of PCA. This increase in accuracy is not justified by the enormous computational complexity that t-SNE has. This deviance in performance of t-SNE despite it being cited as a novel technique might be the fact that t-SNE is better suited for visualization of high dimensional datasets in lower dimensional space rather than in classification tasks. In the case of Salinas dataset, the results agree with results of Indian Pines dataset but since the scene consists of similar classes of vegetation, all three feature extraction techniques exhibit similar performance. The average accuracy also tends to be higher owing to the similarity of the classes. In conclusion, CNN can be used if high accuracy is required despite computational cost, PCA can be used if computational resources are not available at the cost of accuracy and t-SNE can be used for visualization tasks.