vytautas-bunevicius/windows-malware-classifier

GitHub: vytautas-bunevicius/windows-malware-classifier

一个基于 Python 的端到端恶意软件静态分析管道，通过提取 PE 文件特征并利用 XGBoost、神经网络及 AutoGluon 模型，实现针对低误报率优化的 Windows 恶意代码检测。

Stars: 0 | Forks: 0

# Windows 恶意软件分类器用于检测恶意 Windows 二进制文件的静态分析 pipeline，通过提取可移植可执行文件 (PE) 和脚本特征，构建信号丰富的数据集，并训练针对低误报率优化的机器学习模型。 ## 目录 1. [概述](#overview) 2. [架构](#architecture) 3. [数据集](#dataset) 4. [关键结果](#key-results) 5. [仓库布局](#repository-layout) 6. [前置条件](#prerequisites) 7. [安装](#installation) 8. [用法](#usage) 9. [测试](#testing) 10. [开发](#development) 11. [未来改进](#future-improvements) 12. [许可证](#license) ## 概述 - 端到端特征提取从 S3 下载二进制文件，验证 PE 头，并导出带有 schema 文档的 CSV 以用于建模。 - 特征工程将手动构建的结构、熵和时间信号与 Featuretools 生成的交互项叠加，以提升分类器的召回率。 - 建模工作流涵盖神经网络、梯度提升树和 AutoGluon 集成模型，每个模型都经过调整以达到严格的 FPR 目标，同时通过 SHAP 和报告工具保持可解释性。 ## 架构 ### 项目流程 ``` flowchart TD S1["Feature Extraction
PE and Script Analysis"] S2["Feature Engineering
Signal Enhancement"] S3["Model Training
XGBoost, NN, AutoGluon"] S4["Evaluation
FPR Tuning and SHAP"] S5["Deployment
Model Artifacts"] S1 --> S2 S2 --> S3 S3 --> S4 S4 --> S5 S4 -.-> S2 S4 -.-> S3 classDef prepStyle fill:#7E7AE6,stroke:#3D3270,stroke-width:2px,color:#FFFFFF classDef analysisStyle fill:#3A5CED,stroke:#18407F,stroke-width:2px,color:#FFFFFF classDef decisionStyle fill:#85A2FF,stroke:#18407F,stroke-width:2px,color:#FFFFFF classDef validationStyle fill:#82E5E8,stroke:#1C8BA5,stroke-width:2px,color:#FFFFFF classDef documentationStyle fill:#C2A9FF,stroke:#3D3270,stroke-width:2px,color:#FFFFFF class S1 prepStyle class S2 analysisStyle class S3 decisionStyle class S4 validationStyle class S5 documentationStyle ``` ### 特征提取 Pipeline ``` flowchart TD S1["S3 File Download
s3_downloader.py"] S2["PE Validation
utils.py"] S3["File Metadata
models.py"] S4["PE Feature Extraction
pipeline.py"] S5["Non-PE Analysis
non_pe_analyzer.py"] S6["CSV Export
main.py"] S1 --> S2 S2 --> S3 S3 --> S4 S3 --> S5 S4 --> S6 S5 --> S6 classDef prepStyle fill:#7E7AE6,stroke:#3D3270,stroke-width:2px,color:#FFFFFF classDef analysisStyle fill:#3A5CED,stroke:#18407F,stroke-width:2px,color:#FFFFFF classDef decisionStyle fill:#85A2FF,stroke:#18407F,stroke-width:2px,color:#FFFFFF classDef validationStyle fill:#82E5E8,stroke:#1C8BA5,stroke-width:2px,color:#FFFFFF classDef documentationStyle fill:#C2A9FF,stroke:#3D3270,stroke-width:2px,color:#FFFFFF class S1 prepStyle class S2,S3 analysisStyle class S4,S5 decisionStyle class S6 documentationStyle ``` ## 数据集本仓库包含 `data/` 下的数据集快照（`data/raw/`、`data/processed/` 和 `data/engineered/`）。完整的提取包含来自公共 AWS S3 bucket 的大约 25,000 个 Windows 二进制文件。过滤为有效的 PE 文件后，剩下 23,895 个样本，按时间顺序分为 19,115 个训练样本和 4,780 个测试样本。分层时间拆分在每个类别中保留时间顺序的同时，在两个集合中保持约 61% 的恶意软件类别分布。 ## 关键结果以下结果取自在 PE-only 时间拆分（`data/raw/malware_dataset_{train,test}.csv`）上运行的 `notebooks/03_modeling_and_evaluation.ipynb`。列出的所有模型均满足召回率大于等于 95% 且误报率低于 1% 的成功标准。 | Model | Recall | FPR | F1 | AUC | Notes | |-------|--------|-----|-----|-----|-------| | Random Forest | 98.63% | 0.54% | 0.991 | 1.000 | Strong baseline with conservative errors | | XGBoost (Default) | 98.98% | 0.27% | 0.994 | 1.000 | High-performing baseline | | XGBoost (Calibrated) | 99.21% | 0.32% | 0.995 | 0.998 | Calibrated probabilities with strong recall | | Simple NN (Default) | 98.09% | 0.11% | 0.990 | 0.999 | Focal loss with FPR monitoring | | Simple NN (Threshold Opt) | 99.35% | 0.16% | 0.996 | 0.999 | Threshold tuned on held-out calibration | | AutoGluon (FPR-opt, default) | 99.45% | 0.05% | 0.997 | 1.000 | Best FPR control in current run | | AutoGluon (FPR-opt, tuned) | 99.59% | 0.16% | 0.997 | 1.000 | Highest recall in current run | AutoGluon 使用自定义 FPR 约束评分器进行训练，并使用留出的校验集进行阈值调整（参见 `notebooks/03_modeling_and_evaluation.ipynb`）。 ## 仓库布局 ``` windows-malware-classifier/ ├── scripts/ │ └── pe_features_pipeline/ # Feature extraction package │ ├── __init__.py # Package exports │ ├── main.py # CLI entry point │ ├── pipeline.py # PE feature extraction │ ├── non_pe_analyzer.py # Non-PE file analysis │ ├── s3_downloader.py # S3 client and tracking │ ├── models.py # Data models and constants │ └── utils.py # Utility functions ├── src/windows_malware_classifier/ │ ├── preprocessing/ # Data preprocessing and feature engineering │ ├── modeling/ # Model training and evaluation │ ├── analysis/ # Statistical tests and SHAP helpers │ ├── visualization/ # Plotting utilities │ ├── config/ # Typed config and data models │ └── utils/ # Shared helpers (warnings, etc.) ├── notebooks/ # Jupyter notebooks (01-03) ├── images/ # Notebook figures (EDA, modeling, etc.) ├── models/ # Trained models and predictions (gitignored) ├── data/ # Extracts and intermediate datasets │ ├── raw/ # Extracted CSV + schema + temporal splits │ ├── processed/ # Cleaned parquet splits │ └── engineered/ # Feature-engineered parquet outputs └── tests/ ├── scripts/ │ └── pe_features_pipeline/ # Unit tests for extraction pipeline └── windows_malware_classifier/ # Unit tests for the main package ``` ## 前置条件 - Python 3.12 或更新版本 - 可选（仅限特征提取）：具有对源 bucket 读取权限的 AWS 凭证 - 可选（仅限特征提取）：指向 S3 数据集根目录的 `BUCKET_URL` 环境变量 - 可选：[uv](https://github.com/astral-sh/uv) 用于快速、可复现的环境 ## 安装在安装依赖之前克隆仓库： ``` git clone https://github.com/vytautas-bunevicius/windows-malware-classifier.git cd windows-malware-classifier ``` ### 使用 uv（推荐） 1. **安装 uv** # Unix/macOS curl -LsSf https://astral.sh/uv/install.sh | sh # Windows (PowerShell) irm https://astral.sh/uv/install.ps1 | iex 2. **创建并激活虚拟环境** uv venv source .venv/bin/activate # Unix/macOS # 或 .venv\Scripts\activate # Windows 3. **同步项目依赖** uv sync ### 使用 pip（替代方案） 1. **创建并激活虚拟环境** python3 -m venv .venv source .venv/bin/activate # Unix/macOS # 或 .venv\Scripts\activate # Windows 2. **安装项目** pip install --upgrade pip pip install -e . ## 用法如果您只想复现特征工程和模型结果，请使用 `data/` 下的数据集快照从 notebooks 开始，并跳过特征提取。 1. **配置凭证（仅限特征提取）** - 将 `.env.example` 复制到 `.env` 并更新 `BUCKET_URL` 以指向您的 S3 bucket。 - 确保已配置 AWS 凭证（通过环境变量或 `~/.aws/credentials`）。提取器使用 `boto3` 默认值。 - S3 bucket 结构为 `0/` 用于良性二进制文件，`1/` 用于恶意软件样本。 2. **提取特征** python -m scripts.pe_features_pipeline.main 输出 `data/raw/malware_dataset.csv` 和 `data/raw/malware_dataset_schema.csv`，以及 `data/logs/` 中的日志。 3. **模型实验** - 使用 `notebooks/` 中的 notebooks 复现特征工程和训练工作流。 - 可复用的工具位于 `src/windows_malware_classifier/` 下。 4. **评估模型** - 模型产物写入 `models/`（已在 git 中忽略）。 - Notebook 图表保存在 `images/` 下。 ## 测试运行单元测试套件： ``` pytest ``` ## 开发我们使用 `uv` 进行依赖管理，使用 `ruff` 进行 linting 和格式化，以及使用 `ty` 进行类型检查。 ### Linting 和格式化 ``` ruff check ruff format ``` ### 类型检查 ``` ty check ``` ### Git Hooks `.github/hooks/pre-push` 提供了一个可选的 pre-push hook。要在本地启用它： ``` cp .github/hooks/pre-push .git/hooks/pre-push chmod +x .git/hooks/pre-push ``` ## 未来改进 ### 基于 Transformer 的恶意软件检测在 PE 文件结构和字节码序列上测试 transformer 架构，以评估其与基于树的模型和神经网络相比的性能。 ### 基于 Web 的用户界面为表现最佳的模型构建基本的 Web 界面，包括： - 二分类端点 - SHAP 值可视化 - 批量文件处理 - 性能指标展示 ## 许可证本项目根据 Unlicense 条款授权。有关详细信息，请参阅 `LICENSE`。

标签：AMSI绕过, Apex, AutoGluon, Python, SHAP解释性, Windows PE 文件分析, XGBoost, 二进制分析, 云安全监控, 云安全运维, 人工智能, 假阳性率优化, 威胁检测, 归档响应下载, 恶意代码识别, 文件元数据分析, 无后门, 机器学习, 特征工程, 用户模式Hook绕过, 神经网络, 网络安全, 逆向工具, 隐私保护, 静态分析