Coaad/RPDS_KAAD

GitHub: Coaad/RPDS_KAAD

基于 Char-CNN、LightGBM 与黑名单混合引擎的实时钓鱼 URL 检测系统，提供可解释的风险评分与 RESTful API。

Stars: 0 | Forks: 2

# RPDS – 实时网络钓鱼与 Web 威胁检测系统 ### KAAD 网络情报核心 · Mini Brain #1 **架构师:** KAAD · **模型版本:** KAAD-1.1.0 ## 🏗 项目结构 ``` rpds/ ├── backend/ ← FastAPI Python backend │ ├── config.py ← Central configuration │ ├── logging_config.py ← Structured JSON logging (structlog) │ ├── connectivity.py ← Cached online/offline checker │ ├── circuit_breaker.py ← Async circuit breaker │ ├── url_validator.py ← URL validation & normalisation │ ├── feature_extractor.py ← Structural feature extractor │ ├── main.py ← FastAPI app entry point │ ├── api/ │ │ ├── routes.py ← /analyze, /health, /status │ │ ├── schemas.py ← Pydantic v2 models │ │ └── middleware.py ← CORS + request logging │ ├── engines/ │ │ ├── whitelist_engine.py ← Trusted domain bypass │ │ ├── blacklist_engine.py ← OpenPhish feed + cache │ │ ├── cnn_engine.py ← Char-CNN inference (PyTorch CPU) │ │ ├── tree_engine.py ← LightGBM/RandomForest │ │ └── orchestrator.py ← KAAD Core: adaptive fusion │ ├── training/ ← Standalone offline training scripts │ │ ├── dataset_utils.py ← Deduplicate, balance, split │ │ ├── train_cnn.py ← CNN training (FocalLoss + 5-fold CV) │ │ ├── train_tree.py ← Tree model training │ │ ├── calibrate.py ← Temperature scaling calibration │ │ └── tune_threshold.py ← ROC threshold tuning │ ├── models/ ← Trained model files (generated by training) │ └── data/ │ ├── raw/ ← Place your phishing dataset CSV here │ └── whitelists/whitelist.txt ├── frontend/ ← Next.js 14 UI │ ├── app/ │ │ ├── layout.tsx │ │ ├── page.tsx ← Main scan page │ │ └── globals.css ← Cyber dark theme │ ├── components/ │ │ ├── StatusBar.tsx ← Online/offline + KAAD branding │ │ ├── ScanInput.tsx ← Multi-URL scanner input │ │ ├── RiskGauge.tsx ← Animated SVG arc gauge │ │ ├── EngineCards.tsx ← Per-engine score cards │ │ ├── ThreatReasoningPanel.tsx ← Explainability layer │ │ ├── SystemLog.tsx ← AI system log panel │ │ └── ResultCard.tsx ← Full result per URL │ └── lib/api.ts ← Typed fetch client ├── smoke_test.py ← Quick orchestrator verification ├── start_backend.bat ← Windows: install + start backend └── start_frontend.bat ← Windows: install + start frontend ``` ## 🚀 快速入门 ### 步骤 1 — 安装 Python 依赖并启动后端 ``` start_backend.bat ``` 或手动操作： ``` cd backend pip install -r requirements.txt uvicorn main:app --host 0.0.0.0 --port 8000 --reload ``` ### 步骤 2 — 启动前端 ``` start_frontend.bat ``` 或手动操作： ``` cd frontend npm install npm run dev ``` 在浏览器中打开 **http://localhost:3000**。 ## 🤖 训练您自己的模型（可选，但建议进行以获得完整准确率）系统在**没有训练模型的情况下也能工作** —— 它通过熵、TLD 评分、关键字检测等运行结构性风险分析。在训练之前，CNN 和 Tree 模型将显示为 "N/A"。 ### 1. 获取数据集下载一个网络钓鱼 URL 数据集： - [PhishTank](https://www.phishtank.com/developer_info.php) — 免费 CSV 下载 - [UCI Phishing URLs](https://archive.ics.uci.edu/ml/datasets/phishing+websites) 将 CSV 放在 `backend/data/raw/` 中。 ### 2. 准备数据集 ``` cd backend python -m training.dataset_utils data/raw/phishing.csv url label 1 ``` ### 3. 训练 Char-CNN ``` python -m training.train_cnn # 保存至: backend/models/cnn_model.pt ``` ### 4. 训练 Tree 模型 ``` python -m training.train_tree # 保存至: backend/models/tree_model.pkl ``` ### 5. 校准和调整阈值 ``` python -m training.calibrate python -m training.tune_threshold ``` 重启后端服务器 —— 引擎将自动加载。 ## 📡 API 参考 **Base URL:** `http://localhost:8000` | Endpoint | Method | Description | |---|---|---| | `/api/v1/analyze` | POST | 分析 1-10 个 URL | | `/api/v1/health` | GET | 系统健康状态 + 引擎状态 | | `/api/v1/status` | GET | 版本 + 模式信息 | | `/docs` | GET | Swagger UI | ### 分析请求 ``` POST /api/v1/analyze { "urls": ["https://suspicious-site.tk/login"] } ``` ### 响应 ``` { "engine": "RPDS – KAAD CORE", "architect": "KAAD", "mode": "online", "url": "http://suspicious-site.tk/login", "final_verdict": "HIGH RISK", "final_score": 0.73, "confidence_class": "HIGH", "engines": { "whitelist": {"hit": false, "matched": null}, "blacklist": {"hit": false, "score": 0.0, "available": true}, "cnn": {"score": 0.0, "available": false}, "tree": {"score": 0.0, "available": false, "model_type": "none"} }, "threat_reasoning": [ {"factor": "High-risk TLD (risk=1.0)", "impact": "TLD commonly abused for phishing campaigns"}, {"factor": "Suspicious keyword detected", "impact": "Common phishing pattern"} ], "structural_analysis": { "entropy": 3.8, "tld_risk_score": 1.0, ... }, "threat_signature": "sha256hex...", "model_version": "KAAD-1.1.0", "analysis_time_ms": 1.3 } ``` ## 🔒 稳定性保证 | 规则 | 实现 | |---|---| | 模型仅加载一次 | `@app.on_event("startup")` | | 推理过程中不进行重训 | 训练脚本是独立的 | | 黑名单超时 + 熔断器 | `circuit_breaker.py` | | 所有引擎均被 try/except 包裹 | 返回降级结果，永不崩溃 | | 离线回退 | 仅在无网络连接时使用 CNN + Tree | | 每次请求最多 10 个 URL | Pydantic `max_length=10` | | 最大 URL 长度 2048 | Validator + Pydantic | | 全局异常处理程序 | 500 → JSON 错误（绝不暴露堆栈跟踪） | ## 🎯 目标性能（在高质量数据集上训练后） | 指标 | 目标 | |---|---| | 准确率 | > 99% | | 精确率 | > 98% | | 召回率 | > 97% | | 误报率 | < 1% | | ROC-AUC | > 0.995 |

标签：AV绕过, Char-CNN, FastAPI, Go语言工具, LightGBM, Pydantic V2, Python, PyTorch, SEO安全检测, Web安全, 凭据扫描, 威胁情报, 字符级卷积神经网络, 实时威胁防护, 开发者工具, 异步熔断器, 恶意URL检测, 无后门, 机器学习推理, 模型微调, 混合机器学习模型, 白名单机制, 结构化日志, 网络威胁拦截, 网络安全, 网络安全API, 网络安全防御系统, 网络钓鱼检测, 自动化攻击, 自适应决策融合, 蓝队分析, 逆向工具, 随机森林, 隐私保护, 黑名单过滤