The Similarity of Essay Examination Results using Preprocessing Text Mining with Cosine Similarity and Nazief-Adriani Algorithms

Article History: Received: 10 November 2020; Revised: 12 January 2021; Accepted: 27January 2021; Published online: 05April 2021 Abstract. Exams are one way to measure the level of students' ability to participate in learning. One type of exam given to students is the essay type. This study focuses on making automatic assessments for essay-type exams using cosine similarity. This method has several stages such as folding Case, tokenizing, filtering, stemming, analyzing, weighing of words in documents with cosine similarity. The stemming process uses the Nazief & Adriani algorithm. The results of this study are to conclude that the choice of words that are considered as keywords in the answer key greatly affects the results of the system's assessment. This is evidenced by testing applying the cosine law of 89.5%. However, there are several types of questions that are significantly different because there are unique characters in the database and answer keys that do not contain keywords that match the correct answer.


Preliminary
At this time the world is feeling the impact of the Coronavirus Disease (Covid 19) pandemic. Indonesia is one of the countries that has been badly affected, especially in the field of education, causing schools and universities to be unable to carry out the face-to-face learning process. Learning is transferred by applying online methods or online learning using media such as google classroom, zoom, WhatsApp, and other methods [1]. The development of technology in the digital era as it is today certainly brings many benefits to society and one of them is education institutions [2].
The application of online learning methods is also applied to the implementation of the Mid-Semester Examination (UTS) and the Final Semester Examination (UAS). Exam questions given to students can be in the form of essays or multiple choices. Multiple choice questions are filled by selecting answers from those provided. It is different from essay questions which require students to provide the answers they have according to the student's understanding. The answers produced by students are not solely right or wrong, but there is also a possibility that it is close to correct. As an application, if the perfect answer is 100 then the wrong answer is given a value of 0 and the close answer is given a value of 40, and so on. Therefore, Previous research was conducted by Rahimi Fitri and Arifin Noor Asyikin by applying the Cosine Similarity algorithm to the student essay exam assessment cases. The Prepossessing Text Mining stage, precisely at the stemming stage, does not apply an algorithm, so the process of determining the root word is ineffective and does not have criteria [3]. The stemming process influences the accuracy of information retrieval. Stemming is done by removing the affixes contained in words.
Another study was conducted by Saipech, Pongsakorn, and Pusadee in the case of detecting similarity in test results for Thai by applying the Cosine Similarity algorithm. In this study, the DCB approach was applied to Thai for the word segmentation process. Prepossessing in this study was limited to Word Segmentation and Stop word elimination [4].
The choice of a method or algorithm for a case must also be precise because it depends on the objectives and the results of its accuracy [5]. In this study, the authors used the Prepossessing Text mining stage by applying the Nazief and Adriani algorithms at the Stemming stage for Indonesian words. Many algorithms have been developed to carry out the Indonesian stemming process, including the Nazief and Andriani algorithms, Porter's algorithms, and the Arifin and Setiono algorithms. [6]. The research scenario with stemming resulted in an average similarity value of 10% higher than without stemming [7].
The author applies the Cosine Similarity method to analyze student answers to produce a similarity in these answers. Then combined with the Nazief & Adriani algorithm for the stemming process of words.

A. Information retrievals System
Information retrievals System or the information retrieval system is one of the clumps of computer science relating to information retrieval in document collections both in content and in the context that must be found to realize the desire of users for information [8].
Information that can be obtained from the Information Retrievals System can be in the form of text, pictures, audio, and video which are useful for searching for information and maintaining information. [9].

B. Stemming
A process contained in the IR (Information retrieval) system is a stemming process. This stemming process is responsible for transforming the words contained in a document into the root word by applying certain rules. [9].
Stemming It is also one of the steps used for booster performance (improving performance). Information retrieval in Indonesian text is intended to remove suffixes, confixes, and prefixes, of course, different from English text where the stemming process is used to remove suffixes [10].

C. Nazief -Adriani Algorithm
The Nazief-Adriani algorithm was first developed by Bobby Nazief and Mirna Adriani. The Nazief and Adriani stemming algorithm was developed based on Indonesian morphological rules which are grouped by prefixes, suffixes, and confixes called conjunctions. [11].
The basic word dictionary is used for the Nazief & Adriani Algorithm and is supported for recording, such as the compilation of words that are subjected to an excessive stemming process. The grouping of affixes into several categories according to the morphological rules of Indonesian is as follows [12]: 1. Inflection suffixes are a group of suffixes whose root word does not change. For example, the word "eat" which is given the "-lah" ending would become "eat". This group can be divided into two:  Particle (P) such as, "pun", "tah", "-kah" and "-lah"  Possessive pronoun (PP) or possessive pronouns such as "-ku", "-nya" and "-mu". 2. Derivation suffixes (DS) is a collection of original Indonesian suffixes added directly to the root words, namely the suffix "-kan", "-an" and "-i" ,. 3. Derivation prefixes(DP) is a pure root word that is immediately given a prefix or a root word that has been added up to two prefixes. It includes things like:  "Be-", "te", "pe-" and "me" which are morphological prefixes  "Ke-", "se-" and "di-" or prefixes have no morphology. The form of affix words in Indonesian based on the affix classification above can be modeled as follows [13]:

[DP + [DP + [DP +]]] Kata Dasar [[+DS][+PP]]
Information: DP: Derivation prefixes DS: Derivation suffixes PP: Possessive pronoun The rules used in the Nazief & Adriani algorithm are as follows [13]: 1. The combinations of prefixes that are not allowed are "se-kan", "be-i", "ke-kan", "ke-i", "me-an", "te-an" and "se-i". 2. It is not permissible to use affixes repeatedly. 3. If a word consists of only one or two letters, the process cannot be carried out. 4. Prefix added to change the original form of the root word or prefix that has been previously given for example the prefix "men" can change to "men-", "mem", meng-"and" meny-". Therefore we need a rule in dealing with morphology.
3. Removal of Derivation Suffixes ("-i", "-an" or "-kan"). If a word is found in the dictionary base, the algorithm stops. If not then go to step c1 a. If the word "-an" has been deleted and the last letter of the word is "-k", then "-k" is also deleted. If the word is found in the dictionary, the algorithm stops. If not found then do step c2. b. Deleted suffixes ("-i", "-an" or "-kan") are returned, go to step 4 4. Remove Derivation Prefix. If in step c any suffixes are removed then go to step d1, otherwise go to step d2. a. Check the table of disallowed prefix-suffix combinations. If found, the algorithm stops, otherwise b. go to step d2. c. For i = 1 to 3, specify the prefix type then remove the prefix. If the root word has not been found, do step e, if so, the algorithm stops. Note: if the second prefix is the same as the first prefix the algorithm stops. 5. Recording. If all steps have been completed but are not successful, the initial word is assumed to be the root word. Process complete

D. Cosine Similarity Method
Cosine Similarity is a measure of the similarity used in retrieval information and the size of the point of view between the document vector Da (point (ax, bx)) and Db (point (ay, by)). Each vector is represented in each word in the document (text) which is compared in the form of a triangle so that the law of cosine can be applied to state that [14]:

Research Methods
The research methodology used in this study is a qualitative research method. The qualitative method is research that aims to understand the phenomena experienced by research subjects as a whole in the form of words and language in a natural context. The stages of research carried out in this study are as follows:

Data collection
Collecting data on questions and answer keys for the Data Mining course at AMIK Tunas Bangsa Medan, North Sumatra which will be tested as well as collecting supporting concepts or theories in this research.

Essay Exam Modeling
Determine the right keywords from the answer key as a reference for examination assessments, check the answers from students by making the keywords as a reference for the correct answers then calculate all the resulting values from the weight calculation of each question and add them to the maximum value of the questions which will become the final score college student.

Essay Exam Architecture
The method used is to match the answer key with the answer from the student and to fix the system functions when an error occurs.

Algorithm Implementation
The stages of matching answers between student answers and answer keys using the Cosine Similarity method are accompanied by the results of data processing. The creation of an automatic assessment system has several stages, namely, preprocessing and analyzing. Preprocessing includes several stages such as: tokenizing, filtering, stemming [15], while the analyzing stage is the calculation of weighting of words in documents with cosine similarity. This process is the process of changing words into basic words. Stemming process in this exam sample uses Nazief and Adriani algorithms.

Results and Discussion
The trial process used 60 student samples with three different questions then the results of the assessment of the answer keys were compared with the results of students' answers using the Nazief & Adriani algorithm calculation process and combining the cosine similarity method. The framework of this study is shown in Figure 1. The similarity of Essay Examination Results Using Preprocessing Text Mining with Cosine Similarity and Nazief-Adriani Algorithms The following are the results of calculations using the Nazief & Adriani algorithm method and cosine similarity as follows:

Case Folding and Tokenizing
At the case folding stage, a change will be made to all small letters. Can be seen in table III.1: If outlook is overcast then yes if outlook rain and wind is false then yes or wind is true then no if outlook is sunny and humidity is <77.50 then no or humidity <= (less or equal to) 77.500 then yes.

B (Answer Key)
if outlook = overcast then yes if outlook = rain and wind = false then yes or wind = true then no if outlook = sunny and humidity => 77,500 then no or humidity <= 77,500 then yes Then proceed with the tokenizing stage by breaking the sentence into several words. The tokenizing stages can be seen in table III.2:

Filtering
In the filtering stage, the process of deleting words that are considered to have no effect on the core of the sentence is carried out. Data decapitation is done by deleting the words "di-, ke-, se-". The stages of filtering results can be seen in table III.3: After calculating the cosine similarity process, the results obtained were 0.965 with a similarity percentage of answers of 96.5%.

Value calculation process
The results of the assessment were obtained from one of the students who obtained a similarity presentation result of 96.5%, 82%, 90% with a total of 3 questions. The following is the calculation of the score: From the value above, it can be concluded that the mean level of the calculation of the value is above 80%, meaning that the answer keys and the results of the students' answers that have been compared have significant word similarities with a total value of 89.5%.

Conclusion
The conclusion results by applying the Cosine Similarity and Nazief & Adriani Algorithms to the Essay Exam Assessment concluded that the choice of words that are considered as keywords in the answer key greatly affects the results of the assessment of the system. The results of the tests carried out get a match accuracy value of 89.5% From this research, several suggestions are given for further research, including paying attention to synonyms and anonymous words, then it is hoped that better weighting of the answer keys will be used which will be used as a reference in conducting assessments to improve the performance of the results of student answers.

Acknowledgement
We would like to extend our grateful for the support from STMIK Pontianak for funding this research fully.