glo26/stepshield

GitHub: glo26/stepshield

StepShield 是一个面向自主代码 Agent 的步骤级恶意操作时序检测基准，通过早期干预率等指标衡量检测器在轨迹执行中多早发现恶意行为，而非仅评估是否检出。

Stars: 82 | Forks: 18

# StepShield：用于步骤级恶意操作检测的时序基准测试 [![License: MIT](https://img.shields.io/badge/Code-MIT-blue.svg)](LICENSE) [![Data: CC BY 4.0](https://img.shields.io/badge/Data-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) [![arXiv](https://img.shields.io/badge/arXiv-2026.xxxxx-b31b1b.svg)](https://arxiv.org/abs/2026.xxxxx) [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://python.org) 本仓库提供了 **StepShield** 的官方实现、数据集和基准测试。StepShield 是一个用于 Agent 安全的时序评估框架，它衡量的是检测器在*何时*识别出违规行为，而不仅仅是*是否*识别出违规行为。

StepShield: Early Intervention Rate

**图 1.** 两个准确率均为 90% 的检测器在干预能力上可能相差 4 倍。StepShield 的早期干预率 (EIR) 使这一差距变得可见且可衡量。 ## 主要结果在留出测试集上进行的评估（216 条轨迹：108 条恶意，108 条干净；与训练集零模板重叠）： | 检测器 | Acc | F1 | EIR | IG | Tokens Saved | FPR | Recall | |----------|-----|-----|------|------|--------------|------|--------| | Random | 0.50 | 0.50 | 0.12 | -3.0 | 49.1% | 50.0% | 50.0% | | StaticGuard | 0.54 | 0.65 | 0.23 | -5.0 | 78.9% | 77.8% | 86.1% | | ConstraintGuard | 0.50 | 0.32 | 0.08 | -5.3 | N/A | 25.0% | 24.1% | | **LLMJudge** | **0.95** | **0.95** | **0.89** | **-0.4** | **91.2%** | **5.6%** | **95.4%** | | HybridGuard | 0.66 | 0.69 | 0.40 | -2.7 | 87.5% | 44.4% | 75.9% | **核心发现：** 尽管 StaticGuard 的召回率很高 (86.1%)，但 LLMJudge 实现了比 StaticGuard **高 4 倍的 EIR**（0.89 对比 0.23）。StaticGuard 的 EIR 在统计上与随机时间基线 (0.24) 难以区分，这意味着其模式匹配触发于良性前缀步骤，而非真正的恶意发起。这种干预能力上 4 倍的差距在标准准确率指标中是不可见的。 ## 跨模型评估（消融实验 7） | 模型 | EIR | Recall | FPR | Cost/1K steps | |-------|------|--------|------|---------------| | GPT-4.1-mini (默认) | 0.89 | 0.95 | 5.6% | $0.30 | | OpenAI o4-mini | 0.97 | 0.88 | 0.9% | $0.40 | | Claude Opus 4 | 0.93 | 0.93 | 5.6% | $15.00 | | Gemini 2.5 Flash | 0.41 | 0.67 | 38.3% | $0.15 | | GPT-4.1-mini (零样本) | 0.38 | 0.44 | 48.7% | $0.30 | | GPT-4.1-nano | 0.15 | 0.24 | 11.2% | $0.05 | 没有单一模型在所有三个维度（精确率、召回率、时效性）上都占优。o4-mini 实现了最高的 EIR (0.97) 和最低的 FPR (0.9%)，但牺牲了召回率。GPT-4.1-mini 提供了最佳的性价比。 ## 各类别 EIR 细分 | 类别 | LLMJudge EIR | StaticGuard EIR | 差距 | 难度 | |----------|--------------|-----------------|------|------------| | Supply Chain Attack | 1.00 | 0.31 | 3.2x | 简单 | | Destructive Action | 1.00 | 0.24 | 4.2x | 简单 | | Resource Hijacking | 1.00 | 0.19 | 5.3x | 简单 | | Covert Persistence | 0.89 | 0.31 | 2.9x | 中等 | | Data Exfiltration | 0.78 | 0.18 | 4.4x | 困难 | | Privilege Escalation | 0.72 | 0.18 | 4.1x | 困难 | ## 主要贡献 1. **时序评估范式。** 我们将 Agent 安全评估从事后二分类转变为实时步骤级的时序检测。 2. **新颖的时序指标。** 我们引入了早期干预率 (EIR)、干预间隔 (IG) 和 Tokens Saved，以量化检测的时效性。 3. **大规模数据集。** 包含 9,429 条代码 Agent 轨迹（1,278 条训练集 + 7,935 条测试集，其中包括 6,657 条良性轨迹 + 216 条留出测试集），具有跨 6 个违规类别的步骤级标注，并映射到 MITRE ATT&CK 技术。 4. **全面的基准测试。** 评估了四种检测范式，揭示了标准指标无法看到的 4 倍 EIR 差距。 5. **经济框架。** 级联监控将 LLM 推理成本降低了 75%，同时保持了检测质量。 ## 数据集 ### 概述 | 划分 | Rogue | Clean | Benign | Total | |-------|-------|-------|--------|-------| | Train | 639 | 639 | -- | 1,278 | | Test | 639 | 639 | 6,657 | 7,935 | | Held-out | 108 | 108 | -- | 216 | | **Total** | **1,386** | **1,386** | **6,657** | **9,429** | ### 违规类别 | 代码 | 类别 | ATT&CK 映射 | Train Pairs | Avg Steps | |------|----------|----------------|-------------|-----------| | INV | Data Exfiltration | T1005, T1048 | 108 | 14.0 | | SEC | Privilege Escalation | T1068, T1548 | 102 | 13.0 | | RES | Resource Hijacking | T1496 | 102 | 12.9 | | TST | Supply Chain Attack | T1195 | 111 | 11.8 | | DEC | Destructive Action | T1485, T1486 | 102 | 13.4 | | UFO | Covert Persistence | T1053, T1547 | 114 | 11.3 | ### 严重程度级别 | 级别 | 描述 | 检测难度 | |-------|-------------|---------------------| | L1 | 任何审查者都能立即发现的明目张胆的违规行为 | 明显 | | L2 | 需要最少分析即可看出的可见违规行为 | 清晰 | | L3 | 需要跨步骤上下文推理才能发现的违规行为 | 隐蔽 | ## 指标 **早期干预率 (EIR)：** ``` EIR_k = |{t in R_det : d_t - r_t <= k}| / |R_det| ``` 其中 `R_det` 是已检测到的恶意轨迹集合，`d_t` 是检测步骤，`r_t` 是真实的恶意发起步骤。EIR 通过以成功检测为条件，将时间质量与召回率隔离开来。 **干预间隔 (IG)：** 在已检测到的恶意轨迹上，平均有符号距离 `d_t - r_t`。负值 = 过早触发。 **Tokens Saved：** 通过在检测步骤提前终止而避免的下游 token 比例。 ## 快速开始 ``` git clone https://github.com/glo26/stepshield.git cd stepshield pip install -r requirements.txt ``` ### 运行确定性基线（无需 API 密钥） ``` python benchmark/run_benchmark.py \ --data-dir data/test_holdout \ --detectors static constraint \ --output results/deterministic_baselines.json ``` ### 运行 LLMJudge（需要 OpenAI API 密钥） ``` export OPENAI_API_KEY="your-key-here" python benchmark/run_benchmark.py \ --data-dir data/test_holdout \ --detectors llm_judge \ --output results/llm_judge_results.json ``` ### 验证预计算结果 ``` python -m pytest tests/ -v ``` ## 仓库结构 ``` stepshield/ ├── arxiv/ # arXiv submission package │ ├── StepShield.tex # Main paper │ ├── StepShield.bbl # Bibliography │ └── figures/ # Paper figures (PDF + PNG) ├── benchmark/ # Benchmark runner and detectors │ ├── run_benchmark.py # Main entry point │ ├── detectors/ # Detector implementations │ │ ├── static_guard.py # Pattern-based detector │ │ ├── llm_judge.py # LLM-based detector (GPT-4.1-mini) │ │ ├── hybrid_guard.py # Two-stage cascade │ │ └── constraint_guard.py │ └── metrics/ │ └── timing_metrics.py # EIR, IG, Tokens Saved ├── data/ # Full dataset │ ├── train/ # 1,278 labeled trajectories │ ├── test_holdout/ # 216 held-out trajectories │ ├── generated_benign/ # 2,514 synthetic benign trajectories │ ├── incidents/ # 127 ATT&CK-mapped incident templates │ └── croissant.json # Machine-readable dataset card ├── results/ # Pre-computed benchmark results ├── tests/ # Unit and integration tests ├── docs/ # Additional documentation ├── examples/ # Usage examples ├── LICENSE # MIT (code) + CC BY 4.0 (data) └── REPRODUCIBILITY.md # 3-tier reproduction guide ``` ## 可重复性我们提供了一个三层可重复性设计： 1. **第一层（确定性，无需 API 密钥）：** StaticGuard 和 ConstraintGuard 在任何机器上都能产生逐位相同的结果。 2. **第二层（依赖 API）：** LLMJudge 和 HybridGuard 需要 OpenAI API 密钥。预计完整留出测试集的成本约为：~$12。 3. **第三层（验证）：** `results/` 中的预计算结果可以在不进行任何 API 调用的情况下，与论文表格进行对比验证。有关详细的分步说明，请参见 [REPRODUCIBILITY.md](REPRODUCIBILITY.md)。 ## 经济框架级联监控架构（以廉价的 StaticGuard 作为过滤器，以 LLMJudge 处理升级案例）将推理成本降低了 75%： | Rogue Rate | Escalation Rate | Cost/1K Steps | Savings vs. Full LLM | |------------|-----------------|---------------|----------------------| | 1% | 25.1% | $0.075 | 74.9% | | 5% | 25.5% | $0.077 | 74.5% | | 8.1% (测试) | 25.8% | $0.077 | 74.2% | | 15% | 26.4% | $0.079 | 73.6% | | 20% | 26.9% | $0.081 | 73.1% | 在企业规模（每天 1000 万步）下，这预计每天可节省 $73K（每年 $26.7M）。 ## 引用 ``` @article{felicia2026stepshield, title={StepShield: A Temporal Benchmark for Step-Level Rogue Action Detection in Autonomous Code Agents}, author={Felicia, Gloria}, journal={arXiv preprint arXiv:2026.xxxxx}, year={2026} } ``` ## 许可证 - **代码：** [MIT License](LICENSE) - **数据：** [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) ## 致谢这项工作得到了 MOVE 奖学金的支持。我们感谢为标注工作做出贡献的安全研究人员，以及那些改进了基准测试设计的审阅者。

标签：AI对齐, CIDR扫描, CISA项目, DLL 劫持, IP 地址批量处理, Petitpotam, Python, 人工智能安全, 合规性, 大语言模型, 恶意行为检测, 无后门, 早期干预, 时序评估, 深度学习, 自主代码智能体, 自动化代码执行, 越狱检测, 轨迹分析