secureml-au/malicious-url-detection-using-ml

GitHub: mlsecdev/malicious-url-detection-using-ml

基于机器学习的恶意 URL 检测系统，通过提取 30 余维 URL 结构与统计特征实现高精度分类，无需访问网页内容即可识别钓鱼和恶意软件链接。

Stars: 13 | Forks: 0

# 恶意 URL 检测系统 **基于机器学习的钓鱼和恶意软件 URL 分类器** [![Python](https://img.shields.io/badge/Python-3776AB?style=flat-square&logo=python&logoColor=white)](https://www.python.org/) [![Scikit-learn](https://img.shields.io/badge/Scikit--Learn-F7931E?style=flat-square&logo=scikitlearn&logoColor=white)](https://scikit-learn.org/) [![Streamlit](https://img.shields.io/badge/Streamlit-FF4B4B?style=flat-square&logo=streamlit&logoColor=white)](https://streamlit.io/) [![License](https://img.shields.io/badge/License-Apache%202.0-blue?style=flat-square)](LICENSE)

## 概述一个基于机器学习的网络安全系统，通过分析结构和统计特征来检测和分类恶意 URL —— 无需检查网页内容。 **核心特性：** - 实时 URL 分析，检测准确率达 95% 以上 - 包含 30 多个工程化的 URL 特征，涵盖长度、熵、域名和安全维度 - 不检查页面内容 —— 设计上注重隐私保护 - 支持单个 URL 查询、批处理以及 REST API 集成 ## 技术栈 | 层级 | 技术 | |---|---| | 机器学习 | Scikit-learn, XGBoost | | 数据处理 | Pandas, NumPy | | Web 界面 | Streamlit | | API | Flask | | 数据可视化 | Matplotlib, Seaborn | ## 安装说明 ``` git clone https://github.com/ares-coding/malicious-url-detection-using-ml.git cd malicious-url-detection-using-ml pip install -r requirements.txt streamlit run app.py ``` **Docker:** ``` docker build -t url-detector . docker run -p 8501:8501 url-detector ``` ## 使用方法 ### Web 界面 ``` streamlit run app.py # 访问 http://localhost:8501 ``` ### Python API ``` from url_detector import URLDetector detector = URLDetector(model='xgboost') # 单个 URL result = detector.predict('https://suspicious-site.com') print(f"Malicious: {result['is_malicious']}") print(f"Confidence: {result['confidence']:.2%}") # 批量 urls = ['url1.com', 'url2.com', 'url3.com'] results = detector.predict_batch(urls) ``` ### REST API ``` python api.py curl -X POST http://localhost:5000/predict \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' ``` ## 工作原理 ### 特征提取 ``` def extract_url_features(url): features = { 'url_length': len(url), 'num_dots': url.count('.'), 'num_hyphens': url.count('-'), 'num_underscores': url.count('_'), 'num_slashes': url.count('/'), 'num_questionmarks': url.count('?'), 'num_equals': url.count('='), 'num_ats': url.count('@'), 'num_digits': sum(c.isdigit() for c in url), 'has_ip': check_ip_address(url), 'has_https': url.startswith('https'), 'domain_length': len(extract_domain(url)), # 20+ additional features } return features ``` ### 分类 ``` models = { 'Random Forest': RandomForestClassifier(n_estimators=100), 'XGBoost': XGBClassifier(max_depth=6), 'SVM': SVC(kernel='rbf', probability=True) } prediction, confidence = model.predict_proba(features) ``` ## 特征工程特征被分为以下几个类别： | 类别 | 特征 | |---|---| | 基于长度 | URL 长度，域名长度，路径长度 | | 基于字符 | 点号、连字符、斜杠、特殊字符 | | 域名 | IP 地址存在情况，子域名数量，TLD 类型 | | 路径 | 目录深度，文件扩展名 | | 查询参数 | 参数数量，可疑模式 | | 安全性 | HTTPS，证书有效性 | | 熵 | 字符分布的随机性 | | 信誉 | 域名年龄，黑名单评分 | **按重要性排名的前 10 个特征：** ``` url_length 0.142 has_ip_address 0.128 num_subdomains 0.095 domain_length 0.087 num_dots 0.076 has_https 0.068 entropy 0.062 num_hyphens 0.055 path_depth 0.051 num_digits 0.048 ``` ## 模型性能 | 模型 | 准确率 | 精确率 | 召回率 | F1 分数 | AUC-ROC | |---|---|---|---|---|---| | Random Forest | 94.2% | 93.8% | 94.6% | 94.2% | 0.97 | | **XGBoost** | **96.5%** | **96.2%** | **96.8%** | **96.5%** | **0.98** | | SVM (RBF) | 92.8% | 92.3% | 93.2% | 92.7% | 0.96 | | Ensemble | 97.1% | 96.9% | 97.3% | 97.1% | 0.99 | **混淆矩阵 (XGBoost):** ``` Predicted Benign Malicious Actual Benign 4,823 152 Malicious 118 4,907 ``` ## API 参考 ### `POST /predict` 分析单个 URL。 **请求：** ``` { "url": "https://example.com/path?param=value" } ``` **响应：** ``` { "url": "https://example.com/path?param=value", "is_malicious": false, "confidence": 0.923, "risk_score": "low", "features": { "url_length": 38, "has_https": true, "num_dots": 1 }, "timestamp": "2025-02-13T10:30:00Z" } ``` ### `POST /batch` 分析多个 URL。 **请求：** ``` { "urls": [ "https://google.com", "http://suspicious-site.tk" ] } ``` ## 项目结构 ``` malicious-url-detection/ ├── data/ │ ├── raw/ # Original datasets │ ├── processed/ # Cleaned data │ └── models/ # Trained model files ├── src/ │ ├── feature_extraction.py │ ├── model_training.py │ ├── prediction.py │ └── utils.py ├── notebooks/ │ ├── 01_data_analysis.ipynb │ ├── 02_feature_engineering.ipynb │ └── 03_model_evaluation.ipynb ├── api/ │ ├── app.py │ └── schemas.py ├── app.py # Streamlit interface ├── train.py # Training script ├── requirements.txt └── README.md ``` ## 许可证基于 [Apache License 2.0](LICENSE) 授权。 ## 作者 **Au Amores** — AI/ML 工程师 [![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=flat-square&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/au-amores/) [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/ares-coding) [![Email](https://img.shields.io/badge/Email-EA4335?style=flat-square&logo=gmail&logoColor=white)](mailto:auamores3@gmail.com)

标签：AMSI绕过, Apex, Docker, Flask, Kubernetes, Mutation, NumPy, Python, REST API, Scikit-learn, Streamlit, URL分类, XGBoost, 人工智能, 可视化, 威胁检测, 安全检测系统, 安全防御评估, 恶意URL检测, 恶意软件防护, 搜索语句（dork）, 数据科学, 无后门, 机器学习, 深度学习, 特征工程, 用户模式Hook绕过, 网络安全, 网络安全, 网络流量分析, 访问控制, 请求拦截, 资源验证, 逆向工具, 钓鱼网站检测, 隐私保护, 隐私保护