A Survey on Machine Learning Approach to Detect Malware

Malware is one of the predominant challenges for the Internet users. In recent times, the injection of malwares into machines by anonymous hackers have been increased. This drives us to an urgent need of a system that detects a malware. Our idea is to build a system that learns with the previously collected data related to malwares and detects a malware in the give file, if it is present. We propose a various machine learning algorithm to detect a malware and indicates the user about the danger. In particular we propose to use a algorithm which give a optimal solution to hardware and software oriented malwares.


Introduction
Malware means "Malicious Software" which can be inject into the computer and get access of the computer programs and files. The malware can do a harmful tasks in the personal computer [6]. The data in the computer has been destroyed by the third force. Some malwares are spyware, viruses, Phishing, fileless malware worms. Malware has been get into a network through the third party and they can do a malicious activity which will get access through social engineering. The cyber attackers can develop a variety of malicious software which can be used to gain a data from the business people. It will cause a huge damage to the user network [11]. Phishing is the one of the malicious activity which can done by a hacker to get a unauthorized data from the user. It will be implemented through the email or social media [5].In this paper, we detect a malware by using various machine learning algorithm[ Figure 1]. It contains the previously collected data which was trained to detect the malware in the file [8]. The software tries to learn the data which are related to malicious and which are benign based on databases of both malicious and benign code. The scope of the project is that there is a need to produce a system that efficiently detects a malwares present in the system and indicates the user about the malware danger. There is a tremendous increase of internet access by the humans which leads to the various data corruption [14]. Before that, viruses make attack on a personal computer which can make a copy of itself and insert into the other program or files and it leads to a harmful actions like destroying the data. For example, in Phishing the data has been stealed by the hackers by using the social engineering or mock-up websites. There are many methods have been proposed to detect a Phishing websites [20]. But hackers can evolved their methods to escape from these detection methods. The most successful methods to detect a malware is machine learning [21]. This is because Phishing has some characteristics which are learned by the machine with the help of previously collected data in the machine learning algorithm. We have compared the results of various machine learning methods to detect a malware. And finally we pick a method which gives a better solution to the system's malware by using its success rate [21]. Our idea is to get a optimal result for the malware detection which are related to both software and hardware .The remaining document is organized as follows: Discuss through several methods of classifying malware and detection of a malware by using machine learning technique [28]. We are trying to bring out a better result in a traditional way, by comparing the results with two or more algorithms. Proposed method of malware detection is algorithm.

Machine Learning
The Machine Learning is one among the inspirational technology evolved, which in need to connect the globe inorder to take over the secular tasks in a automated manner. Machine Learning is categorized with the algorithms which allows the application software to explicit with the future predictions without being programmed. The basic operation of machine learning is building algorithms inorder to receive the inputs and output with the help of statistical analysis. Classifications of machine learning can be classified to three learnings. They are1.Supervised 2.Unsupervised 3.Reinforcement.In Supervised learning, the machine holds with the labeled data in which each data tagged with correct label. This is classified into Classification and Regression techniques. In Unsupervised learning, the machine holds with the uncategorized data and in prior the machine will not be trained [25]. This is classified into Clustering and Association. In Reinforcement learning, the machine will not be holds up with any data instead it will interact with the environment which it receives rewards for the correct performance and punishment for the incorrect performance. 2.1Algorithms: 2.1.1 Supervised Learning These algorithms usually work with labeled data to learn mapping function that turns input into output variable. This helps us to generate the accurate outputs with the help of given new inputs. When output variable is divided into categories then, classification is used to render outcome of a given sample. A classification system also might look to input data for label assignment. When the output variable is related to the real values then, regression is helpful to render the outcome of a provided sample. The examples of supervised learning[3] are Naïve-Bayes ,Linear Regression, CART, Logistic Regression, and K-Nearest Neighbors (KNN).
Another form of supervised learning is combining two models' prediction as the prediction of individual system is not accurate enough. These models only work with the input data and not the output data for any given sample. Unlabeled training data is used to model the structure. The objects are similar to one another within the same cluster than to the objects from another cluster. This denotes clustering. Dimensionality reduction is helpful to reduce the number of variables of a data set while confirming that important information is still conveyed [10]. The Feature Extraction methods and Feature Selection methods are used in dimensionality reduction. Feature extraction is nothing but the data transformation from high-dimensional space to low-dimensional space.

Reinforcement learning
The algorithm that allows to decide the futuristic behavior based on its present state which leads to maximize a reward denotes reinforcement learning [33].Through trial and error, the reinforcement learning learns the optimal actions.

Linear Regression
This is a supervised classification algorithm which is helpful to render the probability of a target variable based on the predictions. In logistic regressions the nature of target variable or dependent variable will results only to the two possible classes [35].The dependent variable is binary in nature while processing data and the result of that processed data will be coded as either 1 or 0

Naive Bayes
NaivesBayes, a supervised learning algorithm, based on the Bayes theorem and also used for the major classification problems. This algorithm is one of the most effective and simple Classification algorithms which leads to the quick predictions with the aid of the fast learning models [36].

KNN
This algorithm is a supervised algorithm and also the simplest algorithm that is helpful in solving both classification and regression problems. It is easier to understand and implement. Usually, the KNN algorithm is used for the recommendation systems. But, this algorithm is cannot be used for the high dimensional data but a efficient algorithm for the baseline systems. It is also known as the instance based learning[36].

K-means
This clustering algorithm, is one of the simplest and also a popular unsupervised algorithm. In other words, the K-means algorithm denotes k number of centroids, and then assigns every data point to the nearest cluster, while keeping the centroids as tiny as possible in the result. This K-means algorithm is mostly used for the classification process [36].

Bagging with Random Forests
Random forest algorithm is a supervised machine learning algorithm, which is helpful for both classification and regression problems. Similarly, with the help of the data samples, random forest algorithm creates the decision trees. With the samples from those decision trees the best solution will be predicted [36].
Random forest algorithm is more flexible and also easy to use algorithm that produces result without hyperparameter tuning, which produces the greatest of all time. It is also one of the most used algorithms, because of its simplicity [36]. Random Forest algorithm can be considered to be one among bagging techniques and not boosting techniques. The random forests trees usually run in parallel. The trees in boosting algorithms will be trained sequentially [36].

Case Study
In this project our aim is to build a system that learns with previously collected data related to Malwares and detect a malware in the given file if it is present. We are giving the files and their details as an input. By using the files and their details as an input we are detecting the Malware in the files. We are using various types of algorithms and find the best algorithm. By using this we will detect the Malware in the files.

Use Case
The malware detection can be very useful for Business field, IT sectors, Educational field, Healthcare field and Government sectors. Because these fields has very important and confidential data and information. To secure the important data we can use this malware detection.
Some of the malware attacks are: 1) LockerGoga is a malware attack hit in 2019 for the large corporations in the worlds such as Altran Technologies and Hydro. It caused millions of dollars loss for the companies [23].
2) One of the worst attack in history is WannaCryin 2017 through phishing emails. Many of the sectors has been affected by this attack. It nearly causes 4 billion USD of loss [30].
3) CryptoLocker is one of the most worst attack in the year of 2013. This attack has been done through email. It has been said that it has caused 3 million USD loss infecting nearly 200,000 people all over the world [48]. 4) NetWalker is one of the latest attack which targeted governmental agencies, healthcare organizations, corporations and remote employees in the year 2020-2021. 5) Tycoon is a recently discovered malware type. Many organizations in the education and software industry has suffered by this malware attack [31]. 6) In 2016 Linkedin was attacked and 6.5 million passwords were stolen by the attackers [39]. 7) In 2013 Adobe declared that 3 million customer credit card details were stolen by the hackers [41]. 8) CovidLock is a malware which encrypts key data on an android device and deny the access for the user [44]. These are some of the popular attacks of malware which had caused lot of loss in many sectors. To prevent these attacks we want to detect these malware earlier by using this malware detection. By using these method we can prevent many data. 6. Advantages of using Malware Detection: • Safeguard from viruses and its transmission The main role of this is to stand against viruses and other form of malwares. The viruses will not only damage the data it will also decrease the performance of the system. It will detect malware before it happens [30].
• Defence against Data thieves and Hackers Malware detection will give protection against hackers and data thieves. It will detect them before they access or hack the data.
• Spyware protection Spyware is a type of malware that spies on our system Stealing the confidential information. The malware detection has the capability to prevent these type of spyware attacks [19].
• Secure your Data and Files The reason of this malware detection is to keep our data and files in a secure manner. By using this we can protect our data.
• Control the access of websites to build up the Web protection While browsing in the internet users can come across different forms of threat. This can be overcome by using this malware detection. User can protect their information using this. 7. Front end:

React
React is a JavaScript library which is open source and is used for front end development. It was developed by Facebook. It also allows us to build user interfaces especially for single-page application. It also supports mobile application development. In the modern days, React has become so popular because of its extra simplicity and flexibility. While other popular frameworks were also in the competition at the initial stages of React, programmers were forced to code in most of the occurrences irrespective of the change being minor or major. This prevailed as a problem until the development of React. As mentioned earlier, React is flexible and it can easily adapt to changes. It is not a wonder that many of the top corporations such as Facebook, Uber, PayPal, Airbnb and Instagram make use of React. This also amounted for the huge popularity. This credibility has drawn more people to the framework[ Figure 2].

Figure2: Percentage of Users would use a framework again
As it is evident from the picture, use of React keeps increasing in comparison with other peer competitors.

Necessity in this project
The front end of the project is single-paged and no other better user interface designer is identified yet. Though all the actions happen in the back end, the aim is not fulfilled without a good and pleasing front end[ Figure 3].

. Conclusion
To summarize, although there are quite a few methods of detecting a malware, none of them are highly reliable. Thus, the method we follow will be of good help to not only the Internet giants, but also for the common people who unknowingly fall prey to the illegal malware community. Growth in technology also means growth in risk. The risk management and prevention should also be advanced and this approach takes us one step closer to what is needed.