Comprehensive Data Corruption Identification Using Machine Learning Algorithms (PAACDA)

Vanitha  M.; Maneesha  K.; Uma Renu Sri   K.; Nancy K.

doi:10.61841/turcomat.v15i3.14785

PDF

Published: 2024-09-04

DOI: https://doi.org/10.61841/turcomat.v15i3.14785

Dr. M. Vanitha

Professor, Department of CSE, Malla Reddy Engineering College for Women, Autonomous, Hyderabad

Maneesha K.

Student, Department of CSE, Malla Reddy Engineering College for Women, Autonomous, Hyderabad

Uma Renu Sri K.

Student, Department of CSE, Malla Reddy Engineering College for Women, Autonomous, Hyderabad

Nancy K.

Student, Department of CSE, Malla Reddy Engineering College for Women, Autonomous, Hyderabad

Abstract

Data and analysis have evolved from being scattered numbers and qualities in spreadsheets to being seen as a means to revolutionize any substantial industry, thanks to the rise of technology. There are many unethical and unlawful ways that data may get corrupted; thus, it's important to find a way to effectively detect and highlight all the corrupted data in the dataset. It is not an easy task to detect damaged data or to restore information from a corrupted dataset. This is crucial and could cause issues with data processing using machines or deep learning methods later on if not handled early enough. Rather than focusing on outlier identification, this study introduces its PAACDA: Presence-driven Adamic Adar Corruption Identification Algorithm and then consolidates the findings. Even though they rely on parameter tuning to achieve high accuracy, and remember, current state-of-the-art models like Isolation Forest and DBSCAN (which stands for "Density-Based the Spatial Process of Clustering of the Applications with Noise") have a lot of uncertainty when they factor in corrupted data. This study investigates the specific performance problems with several unsupervised learning methods on corrupted linear and clustered datasets. In addition, we provide a new PAACDA technique that achieves a higher precision of 96.35% for cluster data and 99.04% for linear data compared to previous unsupervised training benchmarks on 15 prominent baselines, including K-means clustering, isolation forest, and LOF (local outlier factor). From the aforementioned angles, this essay delves deeply into the relevant literature as well. In this study, we identify all the problems with current methods and suggest ways forward for research in this area.

Issue

Vol. 15 No. 3 (2024): Vol. 15 No. 03 (2024)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
Adapt — remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.

Under the following terms:

Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Notices:

You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .

No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.

How to Cite

Comprehensive Data Corruption Identification Using Machine Learning Algorithms (PAACDA). (2024). Turkish Journal of Computer and Mathematics Education (TURCOMAT), 15(3), 144-153. https://doi.org/10.61841/turcomat.v15i3.14785

References

E. Bergdorf, Predicting the impact of data corruption on the operation of cyber-physical systems. 2017. [2] V. Chandelle, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009.

M. Pang-Ning and V. Steinbach, Introduction to data mining. Pearson Education India, 2016.

H. M. Tony, A. S. Moussa, and A. S. Hadid, “Fuzzy multivariate outliers with application on BACON algorithm,” in 2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2020.

S. Thamud, P. Branch, J. Jinn, and J. Singh, “A comprehensive survey of anomaly detection techniques for high-dimensional big data,” J. Big Data, vol. 7, no. 1, 2020, DOI: 10.1186/s40537-020-00320-x.

O. J. Oyemade, O. O. Oladipo, and I. C. Barbuda, “Application of k Means Clustering algorithm for prediction of Students Academic Performance,” arrive [clog], 2010. [Online]. Available: http://arxiv.org/abs/1002.2425

H. L. Sari, D. Sorani Mrs, and L. N. Zulia, “Implementation of k-means clustering method for electronic learning model,” J. Phys. Conf. Ser., vol. 930, p. 012021, 2017, doi: 10.1088/1742-6596/930/1/012021.

M. Ester, H.-P. Krieger, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” In kid, vol. 96, no. 34, pp. 226–231, 1996.

D. Deng, “DBSCAN clustering algorithm based on density,” in 2020 7th International Forum on Electrical Engineering and Automation (IFEEA), 2020.

F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation Forest,” in 2008 Eighth IEEE International Conference on Data Mining, 2008. 24 VOLUME 10, 2022 this article has been accepted for publication in IEEE Access. This is the author's version, which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3253022 this work is licensed under a Creative Commons Attribution-Non-commercial-No Derivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

R. Gao, T. Zhang, S. Sun, and Z. Liu, “Research and improvement of isolation forest in detection of local anomaly points,” J. Phys. Conf. Ser., vol. 1237, no. 5, p. 052023, 2019, DOI: 10.1088/1742-6596/1237/5/052023.

M. Ashrafi Zaman, S. Das, A. A. Illegally, Y. Chak-chak, and F. T. Sheldon, “Elliptic envelope-based detection of stealthy false data injection attacks in smart grid control systems,” in 2020 IEEE Symposium Series on Computational Intelligence (SSCI), 2020.

C. McKinnon, J. Carroll, A. McDonald, S. Kaikoura, D. Infield, and C. Strachan, “Comparison of new anomaly detection technique for wind turbine condition monitoring using gearbox SCADA data,” Energies, vol. 13, no. 19, p. 5152, 2020, DOI: 10.3390/en13195152.

Goldstein, Markus, and Andreas Denel. "Histogram-based outlier score (hobs): A fast unsupervised anomaly detection algorithm." KI-2012: poster and demo track 9 (2012).

N. Kazlauskas and A. Basks, “Application of histogram-based outlier scores to detect computer network anomalies,” Electronics (Basel), vol. 8, no. 11, p. 1251, 2019, DOI: 10.3390/electronics8111251.

I. T. Jolliffe and J. Kadima, “Principal component analysis: a review and recent developments,” Philos. Trans. A Math. Phys. Eng. Sci., vol. 374, no. 2065, p. 20150202, 2016, doe: 10.1098/rsta.2015.0202.

S. Mishra et al., “Principal Component Analysis,” Int. J. Lives. Res., p. 1, 2017, DOI: 10.5455/ijlr.20170415115235.

Article Sidebar

Main Article Content

Abstract

Article Details

Issue

Section

You are free to:

Under the following terms:

Notices:

How to Cite

References