arman-hosain/malforge

GitHub: arman-hosain/malforge

MalForge 是一个面向动态恶意软件行为分析的多模态基准评测平台，通过交叉验证的 ATT&CK 标签与智能体评估，弥补了传统静态文本评测的不足。

Stars: 0 | Forks: 0

# MalForge NeurIPS 2026（评估与数据集赛道）用于评估大语言模型在动态恶意软件行为分析方面的基准测试。 ## 截止日期 - 摘要：2026年5月4日（AOE） - 完整论文：2026年5月6日（AOE） ## 流程 ``` MalwareBazaar (hashes) → VirusTotal (behavior reports + ATT&CK) → Malpedia (cross-validation) → LLM Evaluation (static + agentic) → Metrics + Analysis ``` ## 执行顺序 ``` conda activate AdaptiveAttackAgent_with_RAG_based_defense python scripts/01_pull_hashes.py # Pull hashes from MalwareBazaar python scripts/02_fetch_vt_reports.py # Fetch VT sandbox reports (overnight) python scripts/03_build_ground_truth.py # Cross-validate ATT&CK labels python scripts/04_preprocess.py # Generate text/json/graph modalities python scripts/05_sanity_check.py # Smoke test on 5 samples python scripts/06_run_static_eval.py # Static eval: 4 models x 3 tasks x 3 modalities python scripts/07_run_agentic_eval.py # Agentic ReAct eval: GPT-4o + LLaMA3-70B python scripts/08_compute_metrics.py # Tables 1-5 with bootstrap CIs python scripts/09_error_analysis.py # Error analysis for paper python scripts/10_upload_to_huggingface.py # Release dataset ``` ## 任务 | 任务 | 输入 | 真实标签 | 指标 | |-------|--------------------|--------------------------|-------------| | MFC | VT 行为报告 | MalwareBazaar 家族标签 | 准确率 | | ATE | VT 行为报告 | VT ∩ Malpedia ATT&CK | 微平均 F1 | | IOC-E | VT 行为报告 | VT 网络/文件字段 | 精确率/召回率 F1 | ## 模型 | Key | 来源 | 模型字符串 | |--------------|-----------|-------------------------------------------| | gpt-4o | OpenAI | gpt-4o | | gpt-oss-20b | vLLM | openai/gpt-oss-20b | | llama3-70b | vLLM | meta-llama/Meta-Llama-3-70B-Instruct | | llama3-8b | vLLM | meta-llama/Meta-Llama-3-8B-Instruct | ## 关键设计决策 - **交叉验证的真实标签**：ATT&CK 标签由 VT 和 Malpedia 共同确认 - **多模态输入**：文本 / JSON / 图（进程树边列表） - **预/后截止划分**：样本按首次出现时间（first_seen）与 2024 年 1 月截止划分 - **难度分层**：简单（≤3 种技术）/ 中等（4-7 种）/ 困难（8+ 种） - **5 个具体智能体工具**：search_mitre、lookup_hash、lookup_family、analyze_api_call、check_ioc - **所有报告指标的 95% 自举置信区间** ## 目录结构 ``` malforge/ ├── config.json ├── dataset/ │ ├── hashes/ # per-family sha256 lists │ ├── vt_reports/ # raw VT behaviour_summary │ ├── vt_metadata/ # raw VT file metadata │ ├── llm_inputs/ │ │ ├── text/ │ │ ├── json/ │ │ └── graph/ │ ├── hash_index.json │ ├── sample_meta.json # first_seen, cutoff flags │ ├── ground_truth_per_sample.json │ ├── ground_truth_by_family.json │ └── malpedia_techniques.json ├── results/ │ ├── static_results.json │ ├── agentic_results.json │ ├── all_metrics.json │ └── error_analysis.json ├── scripts/ │ ├── 01_pull_hashes.py │ ├── 02_fetch_vt_reports.py │ ├── 03_build_ground_truth.py │ ├── 04_preprocess.py │ ├── 05_sanity_check.py │ ├── 06_run_static_eval.py │ ├── 07_run_agentic_eval.py │ ├── 08_compute_metrics.py │ ├── 09_error_analysis.py │ └── 10_upload_to_huggingface.py └── tools/ └── agentic_tools.py # 5 tools + OpenAI schemas + dispatch ``` ## 相比 CTIBench（NeurIPS 2024）的区别 | 方面 | CTIBench | MalForge | |-------------------|-----------------------------|----------------------------------| | 输入来源 | 静态文本描述 | 动态沙箱行为报告 | | 模态 | 仅文本 | 文本 + JSON + 图 | | 真实标签 | 仅 VT 标签 | VT ∩ Malpedia 交叉验证 | | 智能体评估 | 无 | ReAct 循环 + 评判器 | | 智能体工具 | 无 | 5 个具体工具 | | 时间划分 | 是（CTI-ATE） | 是（所有任务） | | 难度等级 | 无 | 简单 / 中等 / 困难 |

标签：APT评估, Ask搜索, Bootstrap置信区间, DAST, DLL 劫持, HuggingFace, IOC, LLM评估, MalwareBazaar, NeurIPS, Ollama, ReAct, VirusTotal, 代理评估, 反取证, 多任务评估, 多模态, 大模型评测, 大语言模型, 安全评估, 恶意软件分析, 数据集基准, 行为报告, 跨验证, 逆向工具, 静态文本