cont1n3nt/ai-phishing-detector

GitHub: cont1n3nt/ai-phishing-detector

基于TF-IDF与集成学习算法的钓鱼邮件文本检测系统，提供完整的训练、评估与FastAPI推理服务。

Stars: 3 | Forks: 0

# AI 网络钓鱼检测器 [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/05/fb8d614553031527.svg)](https://github.com/cont1n3nt/ai-phishing-detector/actions/workflows/ci.yml) [![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![pytest](https://img.shields.io/badge/pytest-passing-brightgreen)](https://docs.pytest.org/) ## 项目概述本项目是一个用于从文本中检测网络钓鱼信息的机器学习系统。它实现了包含预处理、模型训练、评估、推理和 FastAPI 的完整 pipeline。该解决方案使用了 TF-IDF 向量化以及结合了以下算法的集成模型： * Logistic Regression * Random Forest * XGBoost ## CI/CD 每次向 `main`/`master` 分支的推送和拉取请求都会触发自动化 pipeline： ``` checkout → setup Python 3.11 → install deps → run unit tests (pytest) ``` ## 安装说明 ### 前置条件 * Python 3.11+ * pip ### 设置 ``` # 1. Clone the repository git clone https://github.com/cont1n3nt/ai-phishing-detector.git cd ai-phishing-detector # 2. Create and activate a virtual environment python -m venv .venv # Windows .venv\Scripts\activate # Linux / macOS # source .venv/bin/activate # 3. Install dependencies pip install -r requirements.txt ``` ### 使用预训练模型运行从 HuggingFace Hub 下载模型，然后启动 API 服务器： ``` python scripts/download_model.py python app.py ``` ### 从头开始训练要求数据集位于 `data/preprocess/cleaned_dataset.csv`： ``` python -m src.train ``` ### 使用 Docker 运行 ``` docker build -t ai-phishing-detector . docker run -p 8000:8000 ai-phishing-detector ``` ### 运行测试 ``` pytest # 或特定子集 pytest tests/test_features.py tests/test_api.py -v ``` ## 配置在项目根目录创建一个 `.env` 文件以覆盖默认配置： ``` THRESHOLD=0.4 MAX_FEATURES=5000 HF_MODEL_REPO=cont1n3nt/ai-phishing-model HF_DATASET_REPO=cont1n3nt/ai-phishing-dataset ``` ## 动机网络钓鱼攻击仍然是最常见的网络安全威胁之一。自动化的检测系统可以通过在用户与恶意信息交互之前识别它们，从而显著降低风险。 ## 数据集 * 来源：从 Kaggle 收集的多个与网络钓鱼相关的数据集，并合并为一个单一的数据集 * 总样本数：约 114,000 * 类别平衡：约 50% 网络钓鱼 / 约 50% 合法 * [HuggingFace 数据集](https://huggingface.co/datasets/cont1n3nt/ai-phishing-dataset) ### 预处理步骤 * 转换为小写 * 移除 URL * 移除非字母字符 * 空白字符规范化 * TF-IDF 向量化（uni-grams + bi-grams，最多 5000 个特征）注意：为了保持仓库轻量化，本仓库中不包含原始数据集。 ## 特征与预处理在 `src/features.py` 中实现： * `clean_text(text: str) -> str` * `preprocess_texts(texts: List[str]) -> List[str]` * 用于推理的输入验证 * 用于可解释性的特征贡献提取 TF-IDF 向量化在统一的 Pipeline 中应用。 ## 模型最终模型是使用软投票的集成模型： * Logistic Regression（可解释的基线模型） * Random Forest（非线性模式） * XGBoost（梯度提升）该集成模型提高了鲁棒性和泛化能力，同时通过 Logistic Regression 保留了可解释性。 [HuggingFace 模型](https://huggingface.co/cont1n3nt/ai-phishing-model) ## 模型管理训练好的 pipeline 通过 HuggingFace Hub 进行版本控制和分发。制品包括： - TF-IDF 词汇表 - 预处理 pipeline - 集成权重 - 元数据/配置使用 `python scripts/download_model.py` 来获取最新版本。 ## 实验与指标可视化图表： ![mt](https://static.pigsec.cn/wp-content/uploads/repos/2026/05/052c84eaf2031528.png) ![cm](https://static.pigsec.cn/wp-content/uploads/repos/2026/05/562d789990031529.png) ![roc](https://static.pigsec.cn/wp-content/uploads/repos/2026/05/4f3a1ad1f5031530.png) ## 项目架构 Mermaid 图（兼容 GitHub）： ``` graph TD A[Raw Data from Kaggle] --> B[Preprocessing] B --> C[TF-IDF Vectorization] C --> D[Model Training] D --> E[VotingClassifier: LR + RF + XGB] E --> F[Evaluation & Metrics] F --> G[Saved Pipeline] G --> H[Inference: predict.py] H --> I[FastAPI: app.py] I --> J[Prediction Output] ``` ## API 用法 ### 运行 FastAPI ``` python app.py ``` ### 发送预测请求 ``` curl -X POST http://127.0.0.1:8000/predict \ -H "Content-Type: application/json" \ -d '{"text": "Your account is blocked. Verify immediately."}' ``` 示例响应： ``` { "prediction": 1, "label": "PHISHING", "probability": 0.987, "top_words": [["verify", 0.84], ["account", 0.63]] } ``` ## 目录结构 ``` ai-phishing-detector/ ├── app.py FastAPI application ├── config.py Configuration ├── requirements.txt Dependencies ├── Dockerfile Docker image ├── data/ datasets (not included) ├── images/ ROC curve and confusion matrix ├── model/ saved pipeline (auto-downloaded) ├── scripts/ │ └── download_model.py ├── src/ │ ├── features.py preprocessing and explainability │ ├── train.py training and evaluation │ └── predict.py inference logic └── tests/ unit tests ``` ## 参考文献 * [scikit-learn](https://scikit-learn.org/) * [XGBoost](https://xgboost.readthedocs.io/) * [FastAPI](https://fastapi.tiangolo.com/)

标签：AI钓鱼检测, AMSI绕过, Apex, API服务, AV绕过, Docker, FastAPI, HuggingFace, NLP, Python, TF-IDF, VotingClassifier, XGBoost, 威胁检测, 安全规则引擎, 安全防御评估, 文本分类, 无后门, 机器学习, 网络安全, 请求拦截, 逆向工具, 逻辑回归, 邮件安全, 钓鱼邮件识别, 随机森林, 隐私保护, 集成学习