Authors: Tarush Katiyar,, Khushi Pant, Sumit Yadav, Aman Anand, Dr. Vivek Kumar,, Dr. Hitesh Singh,
Abstract: The rapid growth of cyber threats, especially malicious domain (phishing) attacks, calls for advanced detection methods beyond traditional signatures. We propose a machine learning framework that analyzes URL characteristics and other features to classify domains as safe or malicious. The system uses Python-based tools (Pandas, scikit-learn, XGBoost) to train and evaluate multiple classifiers – including Random Forests, Decision Trees, Gradient Boosting, XGBoost, and Logistic Regression. Features are extracted directly from URL strings (such as domain entropy, URL length, and special character counts) along with blacklist/whitelist checks. On a large URL dataset (≈700k samples), ensemble methods achieved high accuracy: for example, Random Forest reached 95% accuracy on the test set, and XGBoost reached 94%. In contrast, a simple logistic regression achieved only 78% accuracy, showing the advantage of tree-based models on this task. Our results demonstrate that ML-driven analysis of URL -based features can effectively detect malicious domains, significantly improving over naive baselines. The framework is implemented as a Python pipeline and can be integrated into real-time security tools. Future work will extend this approach with additional data sources and advanced learning techniques to further improve detection rates.
DOI: http://doi.org/