Data Analysis and Data Classification in Machine Learning using Linear Regression and Principal Component Analysis

In this paper step-by-step procedure to implement linear regression and principal component analysis by considering two examples for each model is explained, to predict the continuous values of target variables. Basically linear regression methods are widely used in prediction, forecasting and error reduction. And principle component analysis is applied for facial recognition, computer vision etc. In Principal component analysis, it is explained how to select a point with respect to variance. And also Lagrange multiplier is used to maximize the principle component function, so that optimized solution is obtained.


Introduction
Linear Regression is the simplest method which models the relation between dependent and independent variables. Here in this models it is considered as one dependent and one independent variable for the analysis of linear regression, hence it is a simple linear regression. If we use multiple data sets for analysis then such models is called as multiple linear regression model (Mohan, 2019).
Principal component analysis is a method for dimensionality reduction and for feature extraction and it's an unsupervised learning algorithm, it is in some sense, the mostly used unsupervised learning algorithm in the field of machine learning and artificial intelligence. Actually PCA is a linear transformation which transforms the given data from n-dimensional space to another space with the same number of dimensions, but despite the fact that the original space, the original data has some correlations among the space variables or dimensions or features (Arunkarthikeyan, 2021;Pavan, 2020). Always the resulting data is the target space which will have variables which are completely uncorrelated to each other and they are mutually orthogonal or they are mutually perpendicular to each other (Wang, 2008). On using statistical methods, for studying multivariate problems, n number of variables will increase the problem complexity and amount of calculation. The principal component analysis method is to use the dimension reduction method to transform the multi index into a small number of mutually independent comprehensive indexes, and then the data for further analysis (Zhao, 2014;Garikipati, 2021). Let's see how PCA does this work and how it can be used to reduce the number of dimensions and reduce dimensionality of given data.

Procedure for linear regression
The vital feature of linear regression is that use of least square method. "standard" Least Square Error (LSE) methods fitting data to a function = ( ) , where x is an independent variable and y is a measured or given value, "orthogonal" Total Least Square Error (TLSE) fitting data to a function f(x)=0, i.e. fitting data to some d-1 dimensional entity in this d-dimensional space, e.g. a line in the 2 space or a plane in the 3 space (Groen, 1980;Balamurugan, 2018;Arunkarthikeyan, 2020;Huffel, 1991)"orthogonally Mapping" Total Least Square Error (MTLSE) methods for fitting data to a given entity in a subspace of the given space. However, this problem is much more complicated. As an example, we can consider data given in and we need to find an optimal line in , i.e. one dimensional entity, in this d-dimensional space fitting optimally the given data. Typical problem: Find a line in the space that has the minimum orthogonal distance from the given points in this space. This algorithm is quite complex and solution can be found in (Skala, 2015). Here in this paper least square method is used for most quantitative prediction. Estimates for the parameters are obtained by minimizing the sum of squares of differences between the observed values and predicted values (Zahedan , 2015 Apply partial differentiation Here, by using all the required formulas, mathematical modeling of linear regression can be done, which is given in PCA section.

Mathematical Modelling
Let us consider data of x and y, which is given below, where x is number of labors involved and y, is number of shoes produced. Here x is independent variable and y is dependent variable.
Above matrix elements can be written in the form of equations and after solving values of a & b are obtained. Where, a=+1.792, b=+0.178 Error at data point (x,y)=(true y)-(predicted y) The total error obtained for a straight line equation using linear regression model is, 1 =2.821.

Flow Chart of Linear Regression
Step by step procedure to be followed for performing linear regression is explained through flow chart concept which is given in figure (1) Here input data x and y is considered for a given straight line equation, first find sum of x and sum of y, sum of squares of x. Next step is to find sum of product of x and y, where n is total length of x. next step is to calculate slope, M. calculate y-intercept. Residuals is obtained using the data x and y, which gives error occurred between actual and predicted values of a straight line equation.

Principal Component Analysis
Simple description of idea behind PCA, assume that you have some data points in a 2-dimensional space with two featured variables X & Y. for example if you are studying some patients data, X can be BP and Y can be the heart rate. We have these two variables representing our data points in 2-dimensional space. We have few samples and from that if you are asked to select just one of these variables X or Y, then which one of them are we going to select, X or Y. we need to select variable with Higher variance and here it seems that X has higher variance. Because, the variance of a variable is proportional to its range of variation. To perform PCA first calculate data covariance matrix of the existing data and then calculate Eigen values and Eigen vectors of the original data.
In the flow chart of basic steps of PCA, first step is to form the data as p*q matrix, where p is number of attributes and q is number of samples. Now optimize the formed data and calculate correlation matrix.  First PCA problem has to be converted to optimization problem [9]. ---------------------------- (15) And problem here is to find out u1, u2---un, to maximize the variance of Z. This is the idea behind principle component analysis. So we implement PCA step by step using MATLAB code.
First creating random data points, two dimensional space and then find principal components of data generated by random generator.
An example of digits.csv data set is considered and used that data for finding principle components, x and y is the input data which we get from digits.csv, first mat lab has to read the data from file, then principle component analysis method is applied to for reducing the dimensionality of the data and summarize the larger data into smaller data sets.

Flow Chart for PCA
From the flow chart of PCA it shows steps to perform PCA. Where it is considered to be input data as digits data set. System reads input data X and Y from digits.csv file. Principle Components of inputs are calculated by finding covariance of principle components. To check whether the points are satisfying, compute Eigen values and Eigen Vectors. After finding cumulative sum of Eigen values, plot them. Here after implementing principle component analysis to the digits data set, it is obtained that dimensionality of data is reduced and large data is summarized into smaller data sets. For example, same code can be used for implementing PCA for random numbers. By considering keyword it will read some random data. Now, find covariance of matrix of the random data and finally Eigen values and Eigen vectors are also obtained. The results of random data are shown in results and discussions.
In this paper it is also considered for iris data set and same code can be used for performing PCA for iris data set also. Using keyword, iris dataset from mat lab, it will be able to read iris data. The results of PCA using iris data set is shown in fig.5.

Results and Discussion
Output for linear regression using least square method is shown in fig. 5. Here it is clearly shown that best fitting line using least square method will fall in maximum data points that are considered during mathematical modeling. Here error obtained is 2.821 which is very less.
All the points touching thick line, is of the original data. We get largest possible variance from data.   6, shows the plot between random data of x and y-axis, where blue dots gives original data which in large group and scattered. Green line gives how entire random original data can be put into one set of data. This result is obtained from mat lab.

Case 2:
An example of iris data set is considered to perform PCA. Scattered plot of iris data is shown in fig.7. In the iris data set, 150*4 double data set is considered in x-axis and in y-axis 150*1 double data set is considered. With this we get a new variable z whose dimension becomes 150*2. So in figure it is shown with species1 (Blue), species2 (Green), species3 (Red) of an iris flower [18][19][20][21]. Figure 5 gives 2D view of pc1 and pc2, it is clearly shown that species1 is completely separated from species2 and species3. But separating species2 and species3 is harder.   By increasing the number of Eigen values and using more principal components, we can preserve more data. For example, if you want to preserve 80% of data then we should use 13 principal components and that will preserve almost 80% of original data, using 30 Eigen values and principal components, we will have 95% of original data and this is how principal component analysis works and this is how it reduces the amount of data needed to work on data set.

Conclusion
So, in this paper linear regression using least square method is explained and how data fitting is done for straight line equation. By applying linear regression model for a particular data of a shoe production company, the error obtained for actual and predicted Y is 2.821 which is very less. In this paper Principle Component analysis is discussed with three different cases. In first case example is considered as random inputs, output is clearly shown in figure.6 where thick straight line is best fitting line for PCA. In second case example considered is iris data set, plot for first two components is shown and all three different species are classified, which is shown in fig.7. In third case, example of digits data set is considered, where by selecting 13 principle components it can preserve 80% of original data which is shown in figure.9. It is also shown that best principle components can be obtained by selecting Eigen values of large variance.