Machine learning-based phishing detection using URL features: a comprehensive review

Asif, Asif Uz Zaman, authorRay, Indrakshi, advisorShirazi, Hossein, advisorRay, Indrajit, committee memberWang, Haonan, committee member2024-01-012024-01-012023https://hdl.handle.net/10217/237380In a social engineering attack known as phishing, a perpetrator sends a false message to a victim while posing as a trusted representative in an effort to collect private information such as login passwords and financial information for personal gain. To successfully carry out a phishing attack, fraudulent websites, emails, and messages that are counterfeit are utilized to trick the victim. Machine learning appears to be a promising technique for phishing detection. Typically, website content and Unified Resource Locator (URL) based features are used. However, gathering website content features requires visiting malicious sites, and preparing the data is labor-intensive. Towards this end, researchers are investigating if URL-only information can be used for phishing detection. This approach is lightweight and can be installed at the client's end, they do not require data collection from malicious sites and can identify zero-day attacks. We conduct a systematic literature review on URL-based phishing detection. We selected recent papers (2018 --) or if they had a high citation count (50+ in Google Scholar) that appeared in top conferences and journals in cybersecurity. This survey will provide researchers and practitioners with information on the current state of research on URL-based website phishing attack detection methodologies. The results of this study show that, despite the lack of a centralized dataset, this is beneficial because it prevents attackers from seeing the features that classifiers employ. However, the approach is time-consuming for researchers. Furthermore, for algorithms, both machine learning and deep learning algorithms can be utilized since they have very good classification accuracy, and in this work, we found that Random Forest and Long Short-Term Memory are good choices of algorithms. Using task-specific lexical characteristics rather than concentrating on the number of features is essential for this work because feature selection will impact how accurately algorithms will detect phishing URLs.born digitalmasters thesesengCopyright and other restrictions may apply. User is responsible for compliance with all applicable laws. For information about copyright law, please see https://libguides.colostate.edu/copyright.machine learningsocial engineeringURL-basedphishingcyber-securitysurveyMachine learning-based phishing detection using URL features: a comprehensive reviewText