bolin8017/upxelfdet

GitHub: bolin8017/upxelfdet

一个基于机器学习检测 UPX 打包 ELF 恶意软件的分析工具。

Stars: 0 | Forks: 0

# upxelfdet [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python Version](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/) [![GitHub release](https://img.shields.io/github/v/release/bolin8017/upxelfdet)](https://github.com/bolin8017/upxelfdet/releases) [![GitHub issues](https://img.shields.io/github/issues/bolin8017/upxelfdet)](https://github.com/bolin8017/upxelfdet/issues) [![GitHub stars](https://img.shields.io/github/stars/bolin8017/upxelfdet)](https://github.com/bolin8017/upxelfdet/stargazers) 一种基于 n-gram 特征提取和支持向量机（SVM）分类的机器学习方法，用于识别 UPX 打包的 ELF 恶意软件。 ## 概述 upxelfdet 是一个用于恶意软件分析和研究的 Python 工具。它从 ELF 二进制文件的各个节中提取特征，使用 n-gram 方法进行向量化，并应用机器学习模型来分类二进制文件是否使用 UPX 打包或识别恶意软件家族。 **主要特性：** * **ELF 二进制分析**：从 ELF 文件的特定节中提取特征 * **N-gram 向量化**：使用可配置的 n-gram 大小将二进制特征转换为数值向量 * **SVM 分类**：训练并评估支持向量机模型 * **灵活配置**：基于 JSON 的配置，便于实验调整 * **CLI 接口**：用于训练、评估和预测的命令行工具 * **结构化日志**：提供人类可读和 JSON 格式的综合日志 ## 目录 * [安装](#installation) * [快速开始](#quick-start) * [用法](#usage) * [配置](#configuration) * [训练](#training) * [评估](#evaluation) * [预测](#prediction) * [项目结构](#project-structure) * [架构](#architecture) * [示例](#examples) * [开发](#development) * [许可证](#license) * [引用](#citation) ## 安装 ### 要求 * Python >= 3.12 * pip 或 uv（推荐） ### 从源代码安装 ``` # 克隆仓库 git clone https://github.com/bolin8017/upxelfdet.git cd upxelfdet # 安装依赖（推荐使用 uv） uv pip install -e . # 或使用 pip pip install -e . ``` ### 从 PyPI 安装（未来） ``` pip install upxelfdet ``` ## 快速开始 1. **准备数据集**：将 ELF 二进制文件组织在 `input/dataset/` 中，并创建带有标签的 CSV 文件。 2. **配置检测器**：编辑 `config.json` 以设置路径和参数。 3. **训练模型**： upxelfdet train --config config.json 4. **评估性能**： upxelfdet evaluate --config config.json 5. **进行预测**： upxelfdet predict --config config.json ## 用法 ### 配置创建或修改 `config.json`： ``` { "data": { "train": "./input/train.csv", "test": "./input/test.csv", "predict": "./input/test.csv", "dataset": "./data/samples" }, "output": { "feature": "./output/features", "model": "./output/model", "prediction": "./output/predictions/predictions.csv", "log": "./output/logs" }, "feature": { "section_name": ".block_1" }, "vectorize": { "method": "ngram_numeric", "size_features": 256, "offset": 0, "ngram_size": 2, "encoding": "TF" }, "model": { "type": "SVM", "params": { "C": 100, "gamma": 0.001, "kernel": "rbf" } }, "classify": true, "seed": 8017 } ``` **配置选项：** * `data.train`：训练 CSV 文件的路径 * `data.test`：测试 CSV 文件的路径 * `data.dataset`：包含 ELF 二进制文件的目录 * `feature.section_name`：用于提取特征的 ELF 节（例如 `.block_1`） * `vectorize.method`：向量化方法（`ngram_numeric` 或 `raw_bytes`） * `vectorize.ngram_size`：n-gram 大小（通常为 2-4） * `vectorize.encoding`：编码方法（`TF` 表示词频） * `model.type`：模型类型（目前为 `SVM`） * `classify`：如果为 `true`，执行多类分类；如果为 `false`，执行二分类 ### 训练使用数据集训练新模型： ``` upxelfdet train --config config.json ``` **训练期间发生的过程：** 1. 从 CSV 加载训练数据 2. 从数据集目录中的 ELF 二进制文件提取特征 3. 使用指定方法对特征进行向量化 4. 使用配置的参数训练 SVM 模型 5. 将训练好的模型保存到 `output/model/` **输出：** * `output/model/` 中的训练模型文件 * `output/features/` 中的特征提取结果 * `output/vectorize/` 中的向量化结果 * `output/logs/` 中的训练日志 ### 评估在测试数据上评估模型性能： ``` upxelfdet evaluate --config config.json ``` **报告的指标：** * 准确率 * 精确率 * 召回率 * F1 分数 * 混淆矩阵 * 分类报告（多类情况） ### 预测对新样本进行预测： ``` upxelfdet predict --config config.json ``` 预测结果将保存在 `config.output.prediction` 指定的路径中。 ### Python API 你也可以以编程方式使用检测器： ``` from upxelfdet import UpxElfDetector from upxelfdet.config import UpxElfDetectorConfig # 加载配置 config = UpxElfDetectorConfig.from_file("config.json") # 初始化检测器 detector = UpxElfDetector(config) # 训练模型 model_path = detector.train() # 评估模型 metrics = detector.evaluate() print(f"Accuracy: {metrics['accuracy']:.4f}") # 进行预测 predictions_path = detector.predict() ``` 请参阅 [examples/basic_usage.py](examples/basic_usage.py) 获取完整示例。 ## 项目结构 ``` upxelfdet/ ├── src/ │ └── upxelfdet/ │ ├── __init__.py │ ├── cli.py # Command-line interface │ ├── config.py # Configuration management │ ├── detector.py # Main detector class │ ├── constants.py # Constants and defaults │ ├── exceptions.py # Custom exceptions │ ├── logging.py # Logging configuration │ ├── feature/ # Feature extraction │ │ ├── __init__.py │ │ └── extractor.py │ ├── vectorizer/ # Vectorization methods │ │ ├── __init__.py │ │ ├── base.py │ │ ├── ngram_numeric.py │ │ ├── raw_bytes.py │ │ └── factory.py │ ├── model/ # ML models │ │ ├── __init__.py │ │ ├── base.py │ │ ├── svm.py │ │ └── factory.py │ └── predictor/ # Prediction logic │ ├── __init__.py │ └── predictor.py ├── tests/ # Unit tests │ ├── __init__.py │ ├── conftest.py │ ├── test_config.py │ └── test_detector.py ├── examples/ # Usage examples │ └── basic_usage.py ├── data/ # Example data (see data/README.md) │ ├── samples/ │ └── README.md ├── input/ # Input data (not in repo) │ ├── dataset/ # ELF binaries (excluded) │ ├── train.csv # Training labels (excluded) │ └── test.csv # Test labels (excluded) ├── output/ # Output directories │ ├── features/ # Extracted features │ ├── vectorize/ # Vectorized features │ ├── model/ # Trained models │ ├── predictions/ # Prediction results │ └── logs/ # Log files ├── config.json # Configuration file ├── pyproject.toml # Project metadata and dependencies ├── LICENSE # MIT License ├── README.md # This file └── .gitignore # Git ignore rules ``` ## 架构 ### 特征提取流水线 1. **输入**：ELF 二进制文件 + 带有标签的 CSV 2. **特征提取**：从 ELF 文件中提取指定节（例如 `.block_1`） 3. **向量化**：使用 n-gram 将二进制数据转换为数值向量 4. **模型训练**：在向量化特征上训练 SVM 分类器 5. **评估/预测**：将训练好的模型应用于新样本 ### 组件概述 * **FeatureExtractor**：使用 `upx-elf-parser` 从 ELF 文件中提取二进制节 * **Vectorizer**：实现不同的向量化策略（n-gram、原始字节） * **Model**：封装 scikit-learn 模型并提供一致的接口 * **Predictor**：处理完整的预测流水线 * **UpxElfDetector**：协调所有组件的主控类 ## 示例 ### 示例 1：基本训练与评估 ``` from upxelfdet import UpxElfDetector from upxelfdet.config import UpxElfDetectorConfig config = UpxElfDetectorConfig.from_file("config.json") detector = UpxElfDetector(config) # 训练与评估 detector.train() metrics = detector.evaluate() ``` ### 示例 2：自定义配置 ``` from upxelfdet.config import ( UpxElfDetectorConfig, DataConfig, VectorizeConfig, ModelConfig, ) config = UpxElfDetectorConfig( data=DataConfig( train="./my_train.csv", test="./my_test.csv", dataset="./my_dataset", ), vectorize=VectorizeConfig( method="ngram_numeric", ngram_size=3, size_features=512, ), model=ModelConfig( type="SVM", params={"C": 10, "kernel": "linear"}, ), ) detector = UpxElfDetector(config) detector.train() ``` 请参阅 [examples/basic_usage.py](examples/basic_usage.py) 获取完整的工作示例。 ## 开发 ### 设置开发环境 ``` # 克隆仓库 git clone https://github.com/bolin8017/upxelfdet.git cd upxelfdet # 安装开发依赖 uv pip install -e ".[dev]" ``` ### 运行测试 ``` pytest tests/ ``` ### 代码质量本项目使用： * **ruff**：用于代码检查和格式化 * **mypy**：用于类型检查 * **pytest**：用于测试 ``` # 代码检查 ruff check src/ tests/ # 代码格式化 ruff format src/ tests/ # 类型检查 mypy src/ ``` ## 许可证本项目根据 MIT 许可证授权。请参阅 [LICENSE](LICENSE) 获取详细信息。 ## 引用如果你在研究中使用此工具，请引用： ``` @software{upxelfdet, author = {bolin8017}, title = {upxelfdet: Machine Learning-Based Detection for UPX-Packed ELF Malware}, year = {2025}, url = {https://github.com/bolin8017/upxelfdet} } ``` ## 感谢本项目基于以下项目构建： * [islab-malware-detector](https://github.com/yourusername/islab-malware-detector)：基础恶意软件检测框架 * [upx-elf-parser](https://github.com/yourusername/upx-elf-parser)：ELF 解析工具 * [scikit-learn](https://scikit-learn.org/)：机器学习库 ## 安全声明 ⚠️ **本工具仅供安全研究和教育用途。** * 不要将此工具用于恶意活动 * 处理恶意软件样本时请格外小心 * 在分析恶意二进制文件时使用隔离环境 * 遵守所有适用的法律法规 ## 联系方式如有问题、反馈或贡献： * **问题**：[GitHub Issues](https://github.com/bolin8017/upxelfdet/issues) * **仓库**：[GitHub](https://github.com/bolin8017/upxelfdet) **注意**：本项目处于积极开发中，API 和功能可能会发生变化。

标签：Apex, Caido项目解析, DAST, ELF二进制分析, ELF文件解析, Homebrew安装, JSON配置, meg, N-gram特征提取, Python机器学习, SVM分类, UPX加壳, 二分类, 二进制向量化, 云安全监控, 云资产清单, 信息安全, 恶意软件分析, 支持向量机, 文档结构分析, 机器学习, 模型训练, 模型评估, 特征向量, 特征工程, 病毒检测, 结构化日志, 网络安全, 逆向工具, 逆向工程, 隐私保护, 静态分析, 预测