partha2005-source/Network_Anomaly_Detection

GitHub: partha2005-source/Network_Anomaly_Detection

基于 NSL-KDD 数据集训练决策树与随机森林分类器的 AI 网络入侵检测系统，提供端到端 ML pipeline 与 Streamlit 交互式仪表盘。

Stars: 0 | Forks: 0

# AI 驱动的网络异常检测系统 🛡️🤖 [![Python Version](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/) [![Framework](https://img.shields.io/badge/framework-Streamlit-red.svg)](https://streamlit.io/) [![ML Framework](https://img.shields.io/badge/ML-Scikit--Learn-orange.svg)](https://scikit-learn.org/) [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) 一个端到端的机器学习 pipeline 和交互式网络安全仪表盘，使用 **NSL-KDD dataset** 进行实时网络流量分类和入侵检测。 ## 📖 项目概述现代企业网络架构面临着持续且不断演变的威胁。基于特征签名的防火墙无法有效分类零日漏洞或定制的攻击向量。本仓库实现了一个完整的**基于异常的网络入侵检测系统 (NIDS)**。通过评估来自连接流的各项参数（持续时间、协议、路由标志、数据包量、错误率），该系统能够区分**正常流量**和**异常流量**（代表拒绝服务、探测扫描、远程到本地和用户到根权限提升攻击）。我们训练并比较了两种主要的分类器： 1. **Decision Tree Classifier** 2. **Random Forest 集成分类器** 表现最佳的模型将被导出，并通过 Web 仪表盘提供交互式的实时单连接诊断和批量流量日志预测。 ## ✨ 功能特性 * 📥 **自动获取 Pipeline：** 自动下载并缓存 NSL-KDD dataset，并提供备用的合成数据生成器，以支持开箱即用的离线运行。 * ⚠️ **健壮的预处理器：** 自定义 Label Encoder 并带有词外回退（out-of-vocabulary fallback）映射，防御性地标准化数值列以处理缺失字段。 * 🔍 **单连接诊断器：** 提供带有预设配置（例如正常 HTTP 流量、SYN Flood DoS、Port Sweep Probe）的 Web 表单，用于快速评估。 * 📁 **批量日志处理器：** 上传 CSV 数据包捕获日志，对数千行数据运行预测，并下载诊断出的威胁日志。 * 📊 **安全分析仪表盘：** 交互式的 Plotly 图表，展示协议分布、相关系数矩阵和模型比较仪表盘。 * 📄 **学术报告门户：** 内置门户，可查看学术报告摘要、技术表格，并直接下载序列化的模型 pickle 文件。 ## 🛠️ 技术栈 * **编程语言：** Python 3.12 * **机器学习：** Scikit-Learn * **数据科学：** Pandas, NumPy * **可视化：** Plotly, Matplotlib, Seaborn * **前端界面：** Streamlit * **序列化：** Joblib ## 📦 目录结构 ``` Network-Anomaly-Detection/ ├── data/ # Cached dataset files (KDDTrain+, KDDTest+) ├── models/ # Serialized preprocessors and classifiers │ ├── preprocessor.pkl │ ├── decision_tree.pkl │ └── random_forest.pkl ├── reports/ # Diagnostic plots, JSON metrics, and reports │ ├── confusion_matrix_*.png │ ├── feature_importance_*.png │ ├── decision_tree_metrics.json │ ├── random_forest_metrics.json │ ├── project_report.md # Full academic report (Chapters 1-8) │ ├── interview_prep.md # 30 Tech QA & 25 Viva QA │ ├── resume_content.md # Ready resume statements │ └── testing_report.md # Functional/Load test records ├── src/ # Pipeline source code modules │ ├── data_loader.py # Data ingestion and synthetic generator │ ├── preprocessing.py # Robust encoders and standard scalers │ └── models.py # Classifier definition & metric evaluations ├── app.py # Multi-page Streamlit web application ├── train.py # Training orchestrator script ├── predict.py # CLI single/batch inference utility ├── requirements.txt # Package dependencies └── model.pkl # Production model file (Random Forest) ``` ## 🚀 安装与设置 ### 1. 克隆仓库 ``` git clone https://github.com/your-username/network-anomaly-detection.git cd network-anomaly-detection ``` ### 2. 配置虚拟环境（可选，但推荐） ``` python -m venv venv venv\Scripts\activate # On Windows source venv/bin/activate # On macOS/Linux ``` ### 3. 安装依赖 ``` python -m pip install -r requirements.txt ``` ## 💻 使用说明 ### 1. 运行训练 Pipeline 触发数据集加载，拟合预处理参数，训练两个分类器，保存评估图表，并导出生产环境权重： ``` python train.py ``` ### 2. 运行 CLI 预测使用模拟配置直接在终端测试预测： ``` python predict.py --run-test ``` 或者对自定义的 JSON payload 运行预测： ``` python predict.py --json "{\"duration\":0,\"protocol_type\":\"tcp\",\"service\":\"http\",\"flag\":\"SF\",\"src_bytes\":340,\"dst_bytes\":1480,\"count\":2,\"srv_count\":2,\"serror_rate\":0.0}" ``` 或者对原始 CSV 文件运行批量预测并将结果写入文件： ``` python predict.py --csv path/to/logs.csv --out path/to/results.csv ``` ### 3. 启动 Streamlit 仪表盘启动交互式 Web 应用程序： ``` streamlit run app.py ``` 打开你的 Web 浏览器并访问 `http://localhost:8501`。 ## 📊 实验结果 | 分类器模型 | 测试准确率 | Precision | Recall | F1-Score | ROC-AUC | | :--- | :--- | :--- | :--- | :--- | :--- | | **Decision Tree** | 75.62% | 96.54% | 59.29% | 73.46% | 72.98% | | **Random Forest** | **76.40%** | **96.66%** | **60.64%** | **74.53%** | **96.99%** | 我们选择 **Random Forest** 模型投入生产环境，因为它具有很高的 F1-Score (74.53%) 和极高的 ROC-AUC 得分 (96.99%)，从而确保了可靠的异常边界识别，并最大程度地减少了遗漏安全违规事件的风险。 ## 🔮 未来改进 1. **多分类标签：** 升级分类层以识别特定的攻击类型（如 Neptune、Satan），从而实现专业的应急响应。 2. **无监督异常检测：** 实现 Autoencoder 网络以在不需要历史标签的情况下捕获新型攻击。 3. **实时捕获审计：** 将 pipeline 与 Scapy 集成，以嗅探并分类实时的网络数据包捕获。 4. **边缘集成：** 将序列化的 pipeline 转换为 ONNX 格式，以支持低延迟的防火墙。

标签：Apex, Kubernetes, Python, Scikit-learn, Streamlit, 入侵检测系统, 后端开发, 安全数据湖, 异常检测, 无后门, 机器学习, 网络安全, 访问控制, 逆向工具, 隐私保护