Prerana25pradeep/phishing-url-detection
GitHub: Prerana25pradeep/phishing-url-detection
Stars: 0 | Forks: 0
# phishing-url-detection
Comparative analysis of Machine Learning and Deep Learning models for phishing URL detection using lexical feature engineering, SMOTE balancing, and performance benchmarking.
# Phishing URL Detection using Machine Learning and Deep Learning
## Overview
This project presents a comparative analysis of Machine Learning and Deep Learning techniques for phishing URL detection. The system extracts lexical URL features, applies data preprocessing and class balancing using SMOTE, and evaluates multiple classification models for identifying malicious URLs.
The objective is to improve phishing detection accuracy while comparing model performance, resource utilization, and scalability.
## Dataset
The project utilizes a large phishing URL dataset containing:
- 54,807 malicious URLs
- 450,176 legitimate URLs
The dataset was cleaned, normalized, and transformed through feature engineering before model training.
## Feature Engineering
The following lexical URL features were extracted:
- URL Length
- Number of Dots
- Number of Dashes
- Number of Underscores
- Number of Question Marks
- Number of Equals Signs
- Subdomain Count
- Domain Length
## Data Preprocessing
- URL normalization and cleaning
- Duplicate removal
- Feature scaling using StandardScaler
- Class imbalance handling using SMOTE
- Train-Test Split (70:30)
- 4-Fold Cross Validation
## Models Evaluated
### Machine Learning Models
- XGBoost
- Random Forest
- Linear SVM
### Deep Learning Models
- Deep Neural Network (DNN)
- 1D Convolutional Neural Network (CNN)
- Long Short-Term Memory Network (LSTM)
## Results
The Random Forest model achieved the best overall performance.
| Metric | Score |
|----------|----------|
| Accuracy | 90.4% |
| F1 Score | 0.920 |
| AUC Score | 0.960 |
The study demonstrates that traditional machine learning approaches can outperform more complex deep learning architectures for lexical phishing URL detection.
## Technology Stack
- Python
- Pandas
- NumPy
- Scikit-Learn
- XGBoost
- TensorFlow / Keras
- Matplotlib
- Seaborn
- SMOTE
- TLDExtract
## Installation
pip install -r requirements.txt
## Running the Project
python url_model_training.py
## Applications
- Phishing Detection
- Threat Intelligence
- Secure Browsing Systems
- Email Security Solutions
- Cybersecurity Analytics
## Future Enhancements
- Real-time URL classification API
- Browser extension integration
- Domain reputation analysis
- Hybrid lexical and content-based detection
- Ensemble learning approaches