MarcusToledo/firmware-ml-classifier

GitHub: MarcusToledo/firmware-ml-classifier

基于静态分析和机器学习的固件分类管道，通过提取统计特征和语义嵌入来识别固件制造商并评估安全风险。

Stars: 0 | Forks: 0

# 固件分类 (TCC) ## 概述本仓库对应于一个课程结业工作 (TCC) 软件工程项目，专注于使用静态分析和机器学习对嵌入式固件进行自动化分类。 ## 目标 - 按制造商对固件进行分类（监督学习）。 - 从二进制文件中提取统计特征（entropy、byte mean、compression_ratio）。 - 提取 ASCII 字符串和 Doc2Vec embeddings 用于语义特征。 - 训练 Extra Trees 模型（主要）和 Random Forest 模型（基线）。 ## 实验流程 1. 按制造商收集和整理数据集。 2. 提取统计特征和字符串。 3. 针对每个固件训练 Doc2Vec embeddings (DM/DBOW)。 4. 训练和评估监督模型。 5. 生成指标和报告。 ## 如何运行安装依赖： - `python3 -m pip install -r requirements.txt` 以可编辑模式安装（启用 CLI 命令）： - `python3 -m pip install -e .` 训练 Doc2Vec（单独训练）： - `python3 scripts/train_doc2vec.py --config configs/feature_extraction.yaml --input dataset/raw/` 通过已安装的 CLI 训练 Doc2Vec： - `train-doc2vec --config configs/feature_extraction.yaml --input dataset/raw/` 使用 embeddings 提取特征： - `python3 scripts/extract_features.py --config configs/feature_extraction.yaml --input dataset/raw/ --output dataset/processed/features.parquet` 通过已安装的 CLI 提取特征： - `extract-features --config configs/feature_extraction.yaml --input dataset/raw/ --output dataset/processed/features.parquet` 检查 Doc2Vec 中使用的 tokens： - `python3 scripts/inspect_tokens.py --config configs/feature_extraction.yaml --input dataset/raw/ --limit 50 --max-docs 20` 通过已安装的 CLI 检查 tokens： - `inspect-tokens --config configs/feature_extraction.yaml --input dataset/raw/ --limit 50 --max-docs 20` 每个固件的读取限制： - 通过 `max_bytes` 在 `configs/feature_extraction.yaml` 中配置。测试： - `python3 -m pytest` - `python3 -m pytest tests/path::test_name` ## 代码质量安装开发工具： - `python3 -m pip install -e ".[dev]"` 配置 pre-commit： - `pre-commit install` 手动运行： - `ruff check .` - `black .` - `mypy src/` ## 内部 API src/io_utils: - `read_binary`：带可选限制的安全二进制读取。 - `normalize_binary`：字节类型验证。 src/features/statistics: - `shannon_entropy`, `byte_mean`, `compress_ratio`：统计特征。 src/features/strings: - `extract_ascii_strings`：提取 ASCII 字符串。 - `limit_strings`, `strings_to_document`, `tokenize_document`：文本处理。 src/features/doc2vec: - `build_corpus`, `train_doc2vec`：Doc2Vec 训练。 - `infer_embedding`：embeddings 推断。 - `save_doc2vec`, `load_doc2vec`：模型持久化。 src/feature_extraction: - `extract_features`：完整提取（stats + embedding）。 - `combine_features`：转换为平面字典。 pipeline/feature_extraction: - `load_pipeline_config`：带覆盖的 YAML 加载。 - `extract_features_from_path`, `extract_features_batch`：提取 pipeline。

标签：Apex, Caido项目解析, CVE, DeepSeek, Doc2Vec, Extra Trees, NLP, Python, Random Forest, 二进制分析, 云安全监控, 云安全运维, 固件分析, 固件安全, 域名收集, 嵌入式系统, 数字签名, 无后门, 机器学习, 漏洞分类, 熵值分析, 特征提取, 监督学习, 网络安全, 软件供应链安全, 远程方法调用, 逆向工具, 隐私保护, 静态分析