Vimal7747/Cybersecurity-Threat-Intelligence

GitHub: Vimal7747/Cybersecurity-Threat-Intelligence

一个基于 AI/ML 的预测性网络安全威胁情报研究项目，旨在解决网络流量中的类别不平衡与罕见攻击漏检问题。

Stars: 1 | Forks: 0

# 🔐 利用人工智能进行预测性网络安全威胁情报 ## 📌 项目概述本项目开发并评估了一个用于**预测性网络安全威胁情报**的 AI/ML 框架，重点关注在高度不平衡的网络流量数据集中检测**少数类网络攻击**（Bot 流量、渗透、APT）。解决的核心挑战：传统机器学习模型偏向多数类（良性流量），导致它们**遗漏罕见但危险的攻击**。本项目采用 **SMOTE 过采样** 并对比四种分类模型来解决此问题。 ## 🧠 评估模型 | 模型 | 准确率 | Bot 召回率 | Bot F1 值 | |---|---|---|---| | 随机森林 | **1.00** | **1.00** | **1.00** | | XGBoost | **1.00** | **1.00** | **1.00** | | SVM (LinearSVC) | 0.96 | 1.00 | 0.93 | | 逻辑回归 | 0.97 | 1.00 | 0.95 | ## 📂 仓库结构 ``` cybersecurity-threat-intelligence/ │ ├── notebooks/ │ └── Artefacts.ipynb # Main ML pipeline notebook │ ├── data/ │ └── README_data.md # Dataset info & download instructions │ ├── results/ │ ├── random_forest_report.txt │ ├── xgboost_report.txt │ ├── svm_report.txt │ └── logistic_regression_report.txt │ ├── docs/ │ └── Applied_Research_Project.pdf # Full thesis document │ ├── requirements.txt # Python dependencies ├── .gitignore └── README.md ``` ## 📊 数据集 **CIC-IDS2018 (UNB, 2018)** — 加拿大网络安全入侵检测数据集 - **行数：** 1,048,575 | **特征数：** 80 | **目标：** `Label`（良性 / Bot） - 由于数据集较大（约 336 MB），**本仓库不包含**该数据集。 - 📥 下载地址：[https://www.unb.ca/cic/datasets/ids-2018.html](https://www.unb.ca/cic/datasets/ids-2018.html) - 下载后将文件放置于：`data/CIC-IDS2018.csv` **类别不平衡：** | 类别 | 数量 | |---|---| | 良性 | ~700,000 | | Bot | ~286,000 | 经过 SMOTE 处理后：两类样本均平衡至 **608,644 条记录**。 ## ⚙️ 方法论（CRISP-DM） ``` 1. Business Understanding → Detect rare cyber threats proactively 2. Data Understanding → EDA on CIC-IDS2018 (class dist, flow analysis, port analysis) 3. Data Preprocessing → Remove duplicates, encode labels, handle inf/NaN, drop Timestamp 4. Modeling → Random Forest, XGBoost, SVM, Logistic Regression 5. Evaluation → Precision, Recall, F1-Score, Confusion Matrix 6. Deployment → Framework for real-time SOC integration ``` ## 🚀 快速开始 ### 1. 克隆仓库 ``` git clone https://github.com/YOUR_USERNAME/cybersecurity-threat-intelligence.git cd cybersecurity-threat-intelligence ``` ### 2. 安装依赖 ``` pip install -r requirements.txt ``` ### 3. 添加数据集下载 CIC-IDS2018 数据集并放置到： ``` data/CIC-IDS2018.csv ``` ### 4. 运行笔记本 ``` jupyter notebook notebooks/Artefacts.ipynb ``` 或直接在 **Google Colab** 中打开（推荐处理大数据集）： [![在 Colab 中打开](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/) ## 🔬 关键发现 - **SMOTE** 成功解决了类别不平衡问题，使两类样本均平衡至 608,644 条样本 - **随机森林** 与 **XGBoost** 在平衡数据集上实现了完美的分类评分 - **SVM** 与 **逻辑回归** 实现了完美的 Bot 召回率（1.00），但良性样本的精确率略低 - 所有模型的恶意流量**召回率**均超过 **90%** —— 满足 H1 - 集成模型在非线性特征关系上显著优于线性模型 ## 🛠️ 技术栈 - **语言：** Python 3.x - **机器学习库：** scikit-learn、XGBoost、imbalanced-learn - **数据处理：** pandas、NumPy - **可视化：** Matplotlib、Seaborn - **环境：** Google Colab / Jupyter Notebook ## 🔮 后续工作 - [ ] 应用 **SHAP / LIME** 进行模型可解释性分析 - [ ] 测试 **ADASYN** 与 **SMOTE-ENN** 混合重采样方法 - [ ] 在**真实网络流量**上验证模型以实现现实泛化 - [ ] 实施**持续再训练**管道以应对新兴威胁 - [ ] 扩展至多类攻击分类（渗透、DDoS、Web 攻击、APT） ## 📄 引用若您在研究中使用本项目，请引用： ``` Vimalkanth, M.V. (2025). Leveraging Artificial Intelligence for Predictive Cybersecurity Threat Intelligence. M.Sc. Applied Research Project, Dublin Business School. ``` ## 📜 许可证本项目仅供学术用途。数据集使用需遵守 [UNB CIC-IDS2018 使用条款](https://www.unb.ca/cic/datasets/ids-2018.html)。

标签：AI安全, Apex, APT检测, Bot检测, Chat Copilot, CIC-IDS2018, Logistic回归, MSc研究项目, NoSQL, SMOTE, SVM, XGBoost, 不平衡数据, 分类模型, 少数类检测, 异常检测, 数据科学, 机器学习, 模型评估, 深度学习, 特征工程, 网络安全, 网络流量分析, 资源验证, 过采样, 逆向工具, 随机森林, 隐私保护, 预测性威胁情报