ar005/mal-rev

GitHub: ar005/mal-rev

一个自托管的本地恶意软件自动化逆向分析流水线，结合静态分析、反编译与 LLM 驱动的威胁情报生成。

Stars: 0 | Forks: 0

# RE Pipeline 一个用于威胁狩猎的本地恶意软件逆向工程 pipeline。投入一个二进制文件，即可获得包含反编译 C 伪代码和 LLM 驱动分析的完整威胁情报报告。分析过程通过 Ollama 保留在本地机器上，或路由至 Anthropic API。 ``` sample.exe ──► unpack ──► static analysis ──► disassembly + decompilation (r2ghidra) │ C pseudocode / raw asm │ LLM analysis (4 passes) │ report.md + code/*.c + vt.json ``` ## 功能说明对于每个样本，pipeline 会运行六个阶段： | 阶段 | 工具 | 输出 | |-------|------|--------| | **摄取** | python-magic | MD5 / SHA1 / SHA256，文件类型 | | **脱壳** | upx, unipacker | 在分析前进行透明的 UPX / MPRESS / ASPack / PEtite 脱壳 | | **静态分析** | pefile, YARA | PE 头、导入/导出、熵值、加壳检测、YARA 命中 | | **反汇编** | radare2 + r2pipe + r2ghidra | 函数列表；优先提取 **C 伪代码**，失败则回退为原始汇编 | | **LLM 分析** | Ollama / Claude API / Claude-Ollama | 行为摘要、MITRE ATT&CK 映射、IOC 提取、家族分类 | | **报告** | — | `report.md` + 每个反编译函数生成一个 `.c` 文件 | 所有内容均保存在 `reports//` 下： ``` reports/ ├── index.md └── / ├── report.md ← full threat report ├── vt.json ← VirusTotal enrichment (if vt.py has been run) └── code/ ├── main.c ├── fcn_00401000.c └── ... ``` ## 环境要求 | 要求 | 说明 | |-------------|-------| | Python 3.10+ | | | [radare2](https://github.com/radareorg/radare2) **5.8+** | 请从源码构建 — 见下文；apt 包 (5.5) 对于 r2ghidra 来说版本过旧 | | [Ollama](https://ollama.com) *(provider=ollama / claude-ollama)* | 本地 LLM 运行时 | | [Anthropic API key](https://console.anthropic.com/) *(provider=claude)* | 在 config.yaml 中设置 `claude.api_key` 或设置 `ANTHROPIC_API_KEY` 环境变量 | | r2ghidra *(可选)* | C 代码反编译 — `r2pm -ci r2ghidra` — 强烈推荐 | | upx *(可选)* | UPX 脱壳 — `sudo apt install upx-ucl` | | unipacker *(可选)* | 解压 MPRESS、ASPack、FSG、PEtite 等 — `pip install unipacker` | | libxxhash-dev | 构建 r2ghidra 所需 — `sudo apt install libxxhash-dev` | | ~8 GB VRAM | 用于 `gemma4:12b`（推荐，256K 上下文，thinking 模式） | ## 快速开始 ### 1 — 安装 ``` git clone re cd re python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt ``` ### 2 — 系统工具 ``` # 需要 radare2 5.8+（apt 自带 5.5，版本太旧 — 请从源码编译） sudo apt remove -y radare2 libradare2-5.0.0t64 libradare2-dev libradare2-common 2>/dev/null || true sudo apt install -y build-essential git meson ninja-build pkg-config \ libcapstone-dev libssl-dev zlib1g-dev libxxhash-dev git clone --depth=1 https://github.com/radareorg/radare2 cd radare2 && sudo ./sys/install.sh && cd .. # r2ghidra decompiler（强烈推荐 — 启用 C pseudocode 输出） r2pm update && r2pm -ci r2ghidra # 解包支持（可选） sudo apt install -y upx-ucl # UPX unpacker pip install unipacker # MPRESS, ASPack, FSG, PEtite, etc. ``` ### 3 — 启动 Ollama 并拉取模型 ``` ollama serve & ollama pull gemma4:12b # ~7.6 GB, 256K context, thinking mode ``` ### 4 — 分析单个样本 ``` python main.py analyze /path/to/sample.exe ``` ### 5 — 批量分析文件夹 ``` # 在 config.yaml (batch.samples_dir) 中设置你的样本文件夹，然后： python batch.py run # 在另一个终端中 — 带有暂停/恢复控制的 web 监视器： python batch.py monitor --host 0.0.0.0 --port 7000 # 打开 http://:7000 ``` ### 6 — VirusTotal 富化 ``` cp .env.example .env # 编辑 .env — 添加你的 VT API key(s) python vt.py # fetch VT reports for all samples python check_reports.py # summarise malware / benign / not-defined ``` ## LLM 提供商支持三种提供商，通过 `config.yaml` 中的 `provider:` 进行选择： | 提供商 | 值 | 说明 | |----------|-------|-------| | Ollama (默认) | `ollama` | 通过 Ollama 原生 API 使用本地模型。需要运行 `ollama`。 | | Anthropic Claude | `claude` | Anthropic 云端 API。需要 `ANTHROPIC_API_KEY`。 | | Claude-Ollama | `claude-ollama` | 通过 Ollama 的 Anthropic 兼容 API 使用本地 Ollama 模型 — 无需 API key。 | **切换至 Claude API：** ``` # config.yaml: provider: claude # claude.api_key: sk-ant-... # 或者： ANTHROPIC_API_KEY=sk-ant-... python main.py analyze sample.exe pip install anthropic # if not already installed ``` **切换至 Claude-Ollama (本地，无需 API key)：** ``` # config.yaml: provider: claude-ollama # 使用 Ollama 的 Anthropic-compatible endpoint — 相同的 gemma4:12b 模型，无需密钥。 # 参见：https://docs.ollama.com/integrations/claude-code pip install anthropic # SDK routes to Ollama, not Anthropic ``` ## 单文件 CLI (`main.py`) ``` python main.py [--config CONFIG] Commands: analyze Analyze a single binary watch Watch a directory — analyze new files automatically serve [--host HOST] [--port] Launch the interactive web control panel (default: 127.0.0.1:8080) ``` ## 批量运行器 (`batch.py`) 用于批量分析样本文件夹，支持暂停/恢复和远程监控。 ``` python batch.py Commands: run [--dir DIR] Start or resume batch (dir defaults to batch.samples_dir in config.yaml) pause Signal the runner to stop after the current sample resume Clear the pause flag status Print a progress table in the terminal reset [sha256 …] Re-queue done/failed sample(s) back to pending monitor [--host HOST] [--port] Start the read-only web monitor (default: 0.0.0.0:7000) ``` **典型工作流：** ``` # 终端 1 python batch.py run # 终端 2（或远程浏览器） python batch.py monitor --host 0.0.0.0 --port 7000 ``` 监控仪表板显示总体进度、每个样本的状态、实时日志以及 **暂停 / 恢复 / 重置** 控制。状态会在每次 pipeline 阶段结束后写入 `batch_state.json`；单样本日志保存至 `logs/.log`。在运行器终端中按下 Ctrl-C 也会实现优雅暂停 —— 当前样本会在停止前完成处理。 ## VirusTotal 富化 (`vt.py`) 通过 SHA256 获取所有样本的 VirusTotal 报告，并将它们保存到 `reports//vt.json`。 ``` # 在 .env 中配置 API keys（参见 .env.example）： VT_KEY_1=your_key_here VT_KEY_2=second_key_if_available # keys are round-robin rotated # 获取 samples/ 中所有文件的报告 python vt.py # 获取特定 hashes python vt.py [sha256 …] # 显示已获取 hashes 的索引 python vt.py --list ``` API Key 会自动轮换，以保持在免费级别的速率限制内（每个 key 每分钟 4 次请求）。使用多个 key 可按比例提高吞吐量。已获取过的哈希值会被自动跳过；使用 `--force` 可重新获取。 ``` # 获取后总结判定结果 python check_reports.py # Malware 12 48.0% # Benign 8 32.0% # Not defined 5 20.0% ``` 传入 `--threshold N` 以要求在判定为恶意软件前至少有 N 个查杀引擎检出（默认：1）。 ## 配置 (`config.yaml`) ``` # LLM provider: "ollama"（默认）| "claude"（Anthropic API）| "claude-ollama"（Ollama Anthropic-compat） provider: ollama ollama: base_url: http://localhost:11434 model: gemma4:12b # 256K context, thinking mode, ~7.6 GB VRAM chunk_size: 120000 # chars per disasm chunk (most binaries fit in one pass) num_ctx: 131072 # KV cache tokens — 128K is safe on 16 GB VRAM think: true # chain-of-thought reasoning for deeper analysis claude: api_key: "" # or set ANTHROPIC_API_KEY env var model: claude-sonnet-4-6 # claude-opus-4-7 for maximum depth chunk_size: 180000 # 200K context window max_tokens: 8192 think: false # extended thinking (higher cost) think_budget_tokens: 5000 claude_ollama: # local models via Ollama's Anthropic-compatible API base_url: http://localhost:11434 api_key: ollama # any non-empty string; Ollama ignores it model: gemma4:12b # any model pulled in Ollama chunk_size: 120000 max_tokens: 8192 paths: samples: ./samples # watch-mode drop zone reports: ./reports # output root rules: ./rules # YARA rules directory (.yar / .yara) batch: samples_dir: ./samples # directory batch.py scans (override with --dir) analysis: max_functions: 30 # top N functions by size to disassemble + decompile min_string_len: 6 entropy_threshold: 7.0 # sections above this are flagged as packed web: backend_url: "" # set to remote host URL for split server/browser setup cors_origins: - "*" ``` ## YARA 规则将 `.yar` 或 `.yara` 文件放入 `rules/` 目录。它们会在静态分析期间被编译和匹配；命中结果将包含在报告中并传递给 LLM 作为上下文。 ``` git clone https://github.com/Yara-Rules/rules rules/ ``` ## 报告章节每个 `report.md` 包含： 1. **文件元数据** — 哈希值、类型、大小、熵值、脱壳状态 2. **PE 分析** — 节区 (包含熵值 + 加壳标志)、导入、导出 3. **YARA 匹配** *(如果填入了 rules/)* 4. **提取的字符串** — 前 100 个可打印字符串 5. **反汇编概述** — 包含反编译状态的函数表 6. **行为分析** — LLM 分析陈述 7. **MITRE ATT&CK 映射** — 带有依据的技术表格 8. **入侵指标** — IP、域名、注册表键、互斥锁名、文件路径 9. **恶意软件家族分类** — 家族、类型、置信度、证据 ## 调整 LLM 提示词所有提示词均为 `prompts/` 中的纯文本文件，并带有 `{placeholder}` 占位符替换。直接编辑它们即可 —— 无需修改代码。 | 文件 | 变量 | |------|-----------| | `behavior.txt` | `{strings}`, `{disassembly}` (来自 r2ghidra 的 C 伪代码，如果可用) | | `mitre.txt` | `{behavior_summary}` | | `ioc.txt` | `{strings}`, `{behavior_summary}` | | `family.txt` | `{behavior_summary}`, `{mitre_mapping}`, `{iocs}` | ## 添加自定义分析阶段 1. 创建 `pipeline/newstage.py`，包含 `run(sample: Sample, cfg: dict) -> Sample` 2. 在 `pipeline/__init__.py` 中为 `Sample` 添加新字段 (保持其可序列化) 3. 在 `main.py → run_pipeline()` 和 `batch.py → run_sample()` 中同时插入该阶段 4. 在 `pipeline/reporting.py → _render_report()` 中添加输出 ## 项目布局 ``` re/ ├── main.py # single-file CLI (analyze / watch / serve) ├── batch.py # batch runner + web monitor ├── vt.py # VirusTotal enrichment ├── check_reports.py # VT verdict summary ├── config.yaml # all tunable settings ├── .env.example # VT API key template ├── requirements.txt ├── pipeline/ │ ├── __init__.py # Sample dataclass │ ├── ingestion.py # hashing + file-type detection │ ├── unpacking.py # UPX / unipacker transparent unpacking │ ├── static_analysis.py # pefile, strings, entropy, YARA │ ├── disassembly.py # radare2 + r2ghidra (C pseudocode preferred over asm) │ ├── llm_analysis.py # multi-provider LLM: Ollama / Claude API / Claude-Ollama │ └── reporting.py # report.md + code/*.c output ├── prompts/ # editable prompt templates ├── rules/ # YARA rules (gitignored content) ├── samples/ # drop zone (gitignored) ├── reports/ # output (gitignored) ├── logs/ # per-sample batch logs (gitignored) └── web/ ├── app.py # FastAPI interactive control panel ├── queue_manager.py # job queue, per-stage checkpointing, resume logic └── templates/ └── dashboard.html ``` ## 免责声明本工具仅用于授权的威胁狩猎和恶意软件研究。请务必在隔离环境中分析样本。切勿在您的主机系统上运行不受信任的二进制文件。

标签：AI风险缓解, DAST, DLL 劫持, 云安全监控, 云资产清单, 大语言模型, 威胁情报, 开发者工具, 恶意软件分析, 自动化流水线, 逆向工具, 逆向工程, 静态分析