Tharaa-Oueslati/MalScan-ML

GitHub: Tharaa-Oueslati/MalScan-ML

基于机器学习的 Windows PE 恶意软件静态分析与分类平台，通过提取 77 个特征实现恶意与良性文件的自动判别。

Stars: 0 | Forks: 0

# 🛡️ MalScan-ML — 静态恶意软件检测器 [![Python](https://img.shields.io/badge/Python-3.10+-3776ab?logo=python&logoColor=white)](https://python.org) [![scikit-learn](https://img.shields.io/badge/scikit--learn-1.4+-f7931e?logo=scikitlearn&logoColor=white)](https://scikit-learn.org) [![XGBoost](https://img.shields.io/badge/XGBoost-2.0+-189fdd)](https://xgboost.ai) [![测试](https://img.shields.io/badge/tests-28%20passed-brightgreen)](tests/) [![许可证: MIT](https://img.shields.io/badge/License-MIT-yellow)](LICENSE) ## 📊 结果 | 模型 | 准确率 | F1 分数 | AUC-ROC | 精确率 | 召回率 | 交叉验证 AUC (5折) | |---|---|---|---|---|---|---| | Random Forest | 95.5% | 0.955 | 0.953 | 0.958 | 0.952 | 0.964 ± 0.002 | | **XGBoost** ✓ | **95.5%** | **0.955** | **0.955** | **0.958** | **0.952** | **0.965 ± 0.001** | *在 1,000 个样本的留出集上进行评估。添加了 4% 的标签噪声以模拟规避型/错误标记的样本。* ## ⚡ 快速开始 ``` git clone https://github.com/Tharaa-Ouelsati/MalScan-ML.git cd MalScan-ML pip install -r requirements.txt # 训练 models python src/train.py --data data/processed/features.csv # 扫描文件 python scan.py suspicious.exe ``` ### CLI 输出 ``` ╔══════════════════════════════════════════╗ ║ Static Malware Detector v1.0 ║ ╚══════════════════════════════════════════╝ [*] Analyzing: suspicious.exe... ┌────────────────────────────────────────────┐ │ File: suspicious.exe │ │ ⚠️ VERDICT: MALWARE Confidence: 94.7% │ │ [████████████████████░░░░] │ │ Model: xgboost │ ├────────────────────────────────────────────┤ │ Risk Indicators: │ │ • High file entropy (7.8) — likely packed │ │ • CreateRemoteThread — process injection │ │ • IsDebuggerPresent — anti-debug technique│ │ • Network library imports (WinINet) │ │ • No debug info — stripped binary │ └────────────────────────────────────────────┘ ⏱ Scan time: 47ms ``` ## 🔬 特征工程（77 个特征）我们从 PE（Portable Executable）二进制格式中提取跨 5 个类别的 **77 个特征**： | 类别 | 数量 | 关键信号 | |---|---|---| | **Header** | 22 | `compile_timestamp`（恶意软件中为 0）、`checksum`、`entry_point_entropy` | | **Sections** | 14 | Shannon 熵、`virtual/raw ratio`、W^X 节、加壳器名称 | | **Imports** | 21 | `CreateRemoteThread`、`WriteProcessMemory`、`IsDebuggerPresent`、网络 DLL | | **Strings** | 10 | URL、IP 地址、base64 块、注册表项、`powershell`/`cmd.exe` | | **Metadata** | 7 | `file_entropy`、`has_debug_info`、加壳启发式分析 | ### 为什么熵是关键信号 Shannon 熵衡量字节序列的随机性。恶意软件作者会对他们的 payload 进行加密或压缩以规避特征检测——这会使熵值飙升至 7.0 以上。合法的可执行文件通常介于 4.0–6.5 之间。 ![熵分布](https://raw.githubusercontent.com/Tharaa-Oueslati/MalScan-ML/main/reports/figures/02_entropy_distribution.png) ## 📈 评估 ### ROC 曲线 ![ROC 曲线](https://raw.githubusercontent.com/Tharaa-Oueslati/MalScan-ML/main/reports/figures/05_roc_curves.png) ### 混淆矩阵 ![混淆矩阵](https://raw.githubusercontent.com/Tharaa-Oueslati/MalScan-ML/main/reports/figures/06_confusion_matrices.png) ### 特征重要性 ![特征重要性](https://raw.githubusercontent.com/Tharaa-Oueslati/MalScan-ML/main/reports/figures/07_feature_importance.png) ### 交叉验证（5折） ![交叉验证](https://raw.githubusercontent.com/Tharaa-Oueslati/MalScan-ML/main/reports/figures/08_cross_validation.png) ### 预测置信度与误分类分析 ![置信度分析](https://raw.githubusercontent.com/Tharaa-Oueslati/MalScan-ML/main/reports/figures/09_confidence_misclassification.png) ## 🗂️ 项目结构 ``` MalScan-ML/ ├── src/ │ ├── feature_extractor.py # PE parsing + 77-feature extraction engine │ ├── train.py # Training pipeline with CV and plot generation │ ├── evaluate.py # ROC, confusion matrix, importance plots │ ├── predict.py # Inference + human-readable risk indicators │ └── utils.py # Dataset utilities + synthetic data generator ├── notebooks/ │ └── MalScan_Analysis.ipynb # Full EDA + model comparison (11 sections) ├── data/ │ ├── processed/ # Feature CSVs (generated by feature_extractor.py) │ └── samples/ # Test PE files (benign only in repo) ├── models/ # Saved model .pkl files ├── reports/figures/ # Generated evaluation plots (10 plots) ├── tests/ │ └── test_feature_extractor.py # 28 unit + integration tests ├── .github/workflows/ci.yml # GitHub Actions CI (Python 3.10 + 3.11) ├── scan.py ← CLI entry point └── requirements.txt ``` ## 🧪 测试 ``` pytest tests/ -v # 28 passed in 2.36s ``` 测试覆盖率包括熵的边缘情况、特征提取验证、数据集完整性检查，以及完整的 RF + XGBoost pipeline 集成测试。 ## 📦 使用真实数据本项目兼容 **[EMBER 数据集](https://github.com/elastic/ember)**（Elastic Malware Benchmark for Empowering Researchers — 包含 110 万个 PE 文件）。要从你自己的 PE 样本中提取特征： ``` # 从 malware 目录中提取 features python src/feature_extractor.py -d /path/to/malware/ -l 1 -o malware_features.csv # 从 benign 样本中提取 features python src/feature_extractor.py -d /path/to/benign/ -l 0 -o benign_features.csv # 组合并训练 cat malware_features.csv benign_features.csv > combined.csv python src/train.py --data combined.csv ``` ## 🔧 技术栈 `pefile` · `scikit-learn` · `xgboost` · `numpy` · `pandas` · `matplotlib` · `seaborn` · `click` · `joblib` · `pytest` ## 📄 许可证 MIT — 详见 [LICENSE](LICENSE)

标签：Apex, DNS 反向解析, Windows PE文件, XGBoost, 云安全监控, 安全规则引擎, 机器学习, 逆向工具, 静态分析