Cervical Cancer Diagnosis System Using Ant-Miner for Managing the Knowledge in Medical Database

Article History: Received: 10 November 2020; Revised: 12 January 2021; Accepted: 27January 2021; Published online: 05April 2021 Abstract:The fourth most frequent cause of cancer death in women is cervical cancer. No sign can be observed in the early stages of the disease. In addition, cervical cancer diagnosis methods used in health centers are time-consuming and costly. Data classification has been widely applied in the diagnosis of cervical cancer for knowledge acquisition. However, none of the existing intelligent methods are comprehensible, and they look like a black box to clinicians. In this paper, an ant colony optimization-based classification algorithm, Ant-Miner is applied to analyze the cervical cancer data set. The cervical cancer data set used was obtained from the repository of the University of California, Irvine. The proposed algorithm outperforms the previous approach, support vector machine, in the same domain, in terms of the better result of classification accuracy. The proposed method is implemented as an engine in a prototype system named as the cervical cancer detection system. Evaluation of the prototype system demonstrates a good result on its usability and functionality.


Introduction
The fourth major reason for death from canceramong femaleis cervical cancer [1].In all over the world, every year around 500,000 women are diagnosed with cervical cancer, and every year there are more than 280,000 deaths because of cervical cancer [2]. Cervical cancer begins as soon as theinside layer cells in the lower segment of uterus (cervix), begin to grow out of control [3]. The regular cells of the cervix become abnormal progressively and develop into cancer. These cells transformations can be identified by the inspection test namely Pap test [4] and other imaging techniques such asdiffusion-weighted imaging (DWI) and magnetic resonance imaging (MRI) that can reveal cervical cancer to a certain stage [5], [6]. However, in the developing and low-income countries, people have low awareness of routine Pap screening test. In addition, limited medical expertise and the lack of medical equipment contributes a number of an amount of death caused by cervical cancer.
The use of computer and artificial intelligence has improved the health care system, which led to an increase in the demand for intelligent and the discover knowledge in modern medicine practices [7], [8]. The process of data collection of diseases leads to the large volume of data stored in medical databases. Therefore, the use of artificial intelligence techniques to the medical databases contributes to enhancing the process of diagnosing and detecting diseases such as works related to breast cancer [9],liver tumor [10]and malaria fever [11]. Over the past few years, there were several methods proposed and applied for diagnosing and detecting the occurrence of cervical cancer in medical database including principal component analysis (PCA), linear regression (LR), particle swarm optimization (PSO), support vector machine (SVM), Artificial Neural Network (ANN), fuzzy positivistic C-means clustering, genetic algorithm (GA), hierarchical decision approach (HDA), texture analysis, Artificial Bee Colony (ABC), Simulated Annealing (SA), Firefly (FA) and Ant Colony Optimization (ACO) [8], [12], [21]- [24], [13]- [20].
The models for all these algorithms are usually incomprehensible, and effectively exploiting intelligent systems requires considerable experience [24]. This paper aims to apply the ant colony optimization (ACO) based classification i.e. Ant-Miner as data classification rules for extraction of relevant features in cervical cancer diagnosis which provides a predicting rule list that is totally understandable and can be easily applied to the decision support system. The Ant-Miner will then be applied to analyze the benchmark cervical cancer dataset.Result from this approach shows that Ant-Miner can produce higher accuracy of cervical cancer classification.
In addition, this paper also describes the design of the cervical cancer diagnosis prototype system which contains the Ant-Miner data classification as the engine. This prototype will enablemedical specialists to exploit the results of the classification.
The organization of this paper is sequenced as follows. Several studies on diagnosis of cervical cancer approaches are reviewed in Section 2. Section 3describes on cervical cancer dataset that used in this study. Then, in Section 4, the performance of the proposed methods applied on the cervical cancer dataset is discussed. Section 5 describes on the experiments setup and the result analysis, while section 6 illustrates on the prototype system that implements the Ant-Miner data classification. Finally, the last two sections, section 7 and 8summarizes some discussion and conclusion respectively.

Related Approaches
There have been many studies in the diagnosis of cervical cancer and on different types of data by using different algorithms.
Reference [25] proposed classification computerization for detecting normal and abnormal cervical cells with artificial neural networks (ANN) and learning vector quantification (LVQ). The sample data sets are gathered and sorted out using the digital image processing phases i.e. pre-processing, purifying and feature filtering and selection. The input image is kept in ANN,whilethe LVQ method is employed for computing the coefficient mean value of the obtained image which is used for categorizing the normal and abnormal cell with 90% accuracy result.
Reference [26] diagnose the cancer stage to help treat cancer patients by categorizing the clinical data set for cervical cancer patients.The initial step is to partition the image of the Pap smears withedgerecognition to split the cell nucleus from the background and cytoplasm. After that, the various features of the cervical Pap images such as elongation, perimeter, and regionare produced. After that, the min-max method is used to normalize these features. After the normalization step,K-nearest neighbor (KNN) method is used to classify cancer on the basis of its abnormality.
Reference [27] introduced a support vector machine (SVM) as an approach to a cervical cancer diagnosis. Related to classification, SVM can group data into various type after the training procedure. The learning simulation was generated in the training procedure by isolating the initial data into various collections by means of their names. Some of the collections is a hyperplane that developed by SVM, which predicts the name of new data.Two improved SVM methods, namely, support vector machine-principal component analysis (SVM-PCA) and support vector machine recursive feature elimination (SVM-RFE), were suggested to diagnose pernicious cancer samples. They used the cervical cancer data set that obtained from the Universityof California Irvine (UCI)repository website, represented by four target variables, namely, Cytology, Biopsy, Schiller and Hinselmann, and 32 risk factors. The four targets were classified after diagnosis by the three SVM-based approaches. The result showed good accuracy on the use of SVM. The basic SVM method could classify benign and malignant cancer. The other methods could realize analogous functionality with fewer factors than SVM used.

Cervical Cancer Data Set
The dataset is openly accessible on the University of California Irvine (UCI) repository website i.e. https://archive.ics.uci.edu/ml/datasets/. The dataset was contributed to the website by [28]. This dataset concentrates on the forecast of signs of cervical cancer. The dataset comprisesdata of 858 patients that checked up at the unit care of the reproductive system of women i.e. gynecology unit at Hospital Universitario de Caracas in Caracas, Venezuela. The data wascollected randomly between year 2012 and 2013.
As shown in Table 1, the dataset is illustrated by 32 attributes, including age, patient's sexual matters, patient's smoking habits, andpast medical information.The dataset also include four target variables which represents the screeningprocess such asHinselmann, Schiller, Cytology, and Biopsy. All the screening process involving the colposcopy (procedure to closely examine the cervix, vagina and vulva for signs of disease) with different types of chemical substances.
A number of patients chose not to response some of the queries because of privacy matters, means that the dataset contains missing values. In order to attain a good dataset that essential in the classification process, we need to pre-processed the dataset.The pre-processing of the dataset includes several steps such as data cleaning (eliminating annoying data, filling missing values), data reduction and data conversion (aggregation, normalization). After the process of pre-processing, the attribute of 27, which is Sexually Transmitted Diseases (STDs): Time since the first diagnosisand attribute of 28 which is STDs: Time since last diagnosis,were removed because the available values were absence. Hence the cervical cancer data was analyzed by 858 patients with only 30 attributes.

The Proposed Method
The algorithm of the proposed method i.e. Ant-miner is shown in Figure 1. The test cases and the training package are classified by using five-fold cross-validation. Ninety percent of the training data and twenty percent of the test data are used in each fold test. After the pheromoneinitialization, numerous bases are created in the repeat loop. The procedure is continued with the pruning base and the pheromone update method. When the ants build the same rule consistently more than once (No_Rule_Converg) or the_number_of_ants equals the_number_of_rules, the loop will stop. In the list of rules, the best rule will be added when the inner loop is completed 'Repeat-Until.' As a result, all training cases provided in this rule will be removed from the training package. Pheromone is initialized again. The external loop controls the session responsible for configuringthis pheromone. For the 'Repeat-Until' loop, a limit more than the number of indeterminate training sessions is called Max_uncovered_cases.

Experiments and Analysis
Theproposed method is programmed in Java Eclipse and runs on a computer Intel (R) Core TM i5 Duo CPU @ 2.40 processor and Windows 10. The parameters used in the proposed method as shown in Figure 2 In order to evaluate the performance of the proposed method, the datasets are classified into training and test groups, depending on the number of folds in the cross-validation i.e. 5. Four target variables of diagnostic procedures known as Hinselmann, Schiller, Cytology, and Biopsy [29], will be diagnosed respectively. The performance evaluation is based on the accuracy rate, rules number, and condition number as shown in Figure 3.
The accuracy rate is calculated based on (1):     Based on the employment of four target variables, it was illustrated that Ant-Miner was able to discover the malignant samples and accomplished the classification of the cervical cancer data set. The accuracy results are compared with another approach, i.e. SVM, SVM-PCA and SVM-RFE as shows in Table 3 which also applied to the same cervical cancer data set. The comparison of accuracy results indicates that the performance analysis of the Ant-Miner classification achieves a high percentage compared to SVM, SVM-PCA, and SVM-RFEfor all four target classes.

Prototype System
This section describes the cervical cancer diagnosis prototype system that implements the Ant-Miner data classification as its engine. The prototype is named as Cervical Cancer Detection System (CCDS). The CCDS is developed by usingJavaScript and hypertext markup language (HTML). The use case and sequence diagrams for the CCDS are shown in Figure 4 and Figure 5 respectively. In both Figure, the user interface provides information ( Figure 6) and handles queries to the user through question forms (Figure 7). The queries include historical medical records, user habits, and demographic information. This information will be used to diagnose the occurrence of cervical cancer using the rules obtained from Ant-Miner data classification as an inference engine.
The input from the user through the question forms consists of two types of attributes, i.e. non-indexed and indexed as shown in Table 4 and Table 5 respectively. The non-indexed attributes are consists of information about the user's habits and demographic information, while the indexed attributes are mainly consisting of historical medical records.     Based on the values that users entered into CCDS, the state of the disease, i.e. infected or uninfected depending on four tests, namely, Hinselmann, Schiller, Cytology, and Biopsy are listed as shown in Figure 8. According to [30], questionnaire with Likert scales responses will produce ordinal data i.e. responses can be rated or ranked, but the distance between responses is not measurable. Therefore, the interpretations of ordinal data can be produced by calculating the median and the Inter-Quarter Range (IQR). Median is the value relating to a quantity positioning at the midpoint of a frequency distribution of observed quantities.The IQR is a measure of spread, whether the responses are grouped back to back or separated throughout the series of potential responses.
The calculation of the median is carried out by listing all the responses scale for each question from the smallest to largest. For example, for responses scale of question 1, the listing is 3, 3, 4, 4, 4, 5, 5, 5, 5, 5 (20% of Good, 30% of Very Good, and 50% of Excellent). The median formula in the spreadsheet application i.e Microsoft Excel will be applied to this listing. The calculation of IQR also carried out in the spreadsheet application by using the formula Quartile (Quartile 3 subtract Quartile 1) to the same listing numbers. Table 7 shows the median and the IQR for all the responses for each question.
A relatively small IQR for all responses for each question as in Table 7, is an indication of consensus [30]. Based on the value of the median (4, 4.5, and 5) and small IQR value, the results demonstrate that most of the users in the evaluation phase indicated agreement with the usability and functionality provided by CCDS.

Discussion
Patients who do not undergo a routine screening test for cervical cancer need a proper tracking system for easy diagnosis. Patients using a cervical cancer diagnosis system provide accurate information to physicians, who use this information to understand and evaluate patients' symptoms. Computer-based cervical cancer detection systems are expected to gain traction among patients and doctors. Such systems are useful because they can be easily accessed via mobile or smart devices. To our knowledge, this study is the first to use metaheuristic algorithms for cervical cancer diagnosis. Patient examination data are composed according to certain criteria by using the ACO-based classification algorithm and 858 patient data. The variable targets are Hinselmann, Schiller, Cytology, and Biopsy. By comparing the proposed model with other works, we found that the ACO-based classification algorithm detects understandable rules with a high degree of classification accuracy. With the proposed method and the prototype system, it is hoped that medical experts who not specialists in artificial intelligence or machine learning, may be able to utilize this method and system in clinical practice.

Conclusion
Several studies reported great classification performance in cervical cancer diagnosis were implemented a model and finding knowledge rules, but the results for most of these methods are not comprehensible. Where yet still requires experts to have considerable experience in dealing with these effectively extracted results rather than understanding the knowledge bases directly and designing a system that enables specialists to exploit the results of prediction. In this paper, an ACO-based classification algorithm, Ant-Miner was proposed to predictcervical cancer data set and get a rule list that is totally understandable. The proposed method achieved high accuracy results, i.e. more than 90 percent. When comparing with other approaches, the Ant-Miner have higher accuracy results. Thus, this algorithm is the most suitable for cervical cancer data set classification. In addition, this paper also describes the prototype system that implements the proposed method as its engine. The evaluation of the prototype system shows a good result on its usability and functionality.