Main Article Content
Phishing refers to a type of cyberattack known as social engineering, in which criminals trick users into revealing their credentials by utilizing a deceptive login form that submits the information to a malicious server. In this project, we compare machine learning techniques to propose a method for effectively detecting phishing websites through URL analysis. Most current state-of-the-art solutions for phishing detection consider homepages without login forms as the legitimate class. However, we differ in our approach by incorporating URLs from the login pages into both classes. We believe this approach better reflects real-world scenarios and demonstrate that existing techniques yield a high false-positive rate when tested with URLs from legitimate login pages. Furthermore, we employ datasets from different yearsto illustrate how models experience a decline in accuracy over time. We train a base model using outdated datasets and evaluate its performance using recent URLs. Additionally, we conduct a frequency analysis of current phishing domains to identify the various techniques employed by phishers in their campaigns. To support our claims, we introduce a new dataset called Phishing Index Login URL (PILU-90K), which consists of 60,000 legitimate URLs encompassing index and login websites, along with 30,000 phishing URLs. Lastly, we present a Logistic Regression model that, when combined with Term Frequency - Inverse Document Frequency (TFIDF) feature extraction, achieves an accuracy of 96.50% on the provided login URL dataset.