tomerhakak/agentprobe

GitHub: tomerhakak/agentprobe

一款专为 AI Agent 设计的 pytest 风格测试框架，提供录制回放、安全模糊测试、成本追踪等功能，帮助开发者确保 Agent 的质量、安全性和成本可控。

Stars: 9 | Forks: 2

🧪 AgentProbe

AI Agents 的 pytest

记录、测试、回放并保护你的 AI agents —— 在本地、私密地、在你的 CI pipeline 中。

## 快速安装 ``` # pip (推荐) pip install agentprobe # 或一键安装程序 curl -fsSL https://raw.githubusercontent.com/tomerhakak/agentprobe/main/install.sh | bash ``` 然后开始使用： ``` agentprobe init # scaffold config + example tests agentprobe test # run your agent tests agentprobe platform # local web dashboard at localhost:9700 ``` ## 30 秒演示 **录制** 你的 agent，然后 **测试** 它 —— 就像使用 pytest 一样。 ``` from agentprobe import record, RecordingSession from agentprobe import assertions as A # 1. 记录运行 @record("my-agent") def run_agent(query: str, session: RecordingSession) -> str: session.set_input(query) session.add_llm_call(model="gpt-4o", input_messages=[...], output_message=response) session.add_tool_call(tool_name="search", tool_input={"q": query}, tool_output=results) session.set_output(answer) return answer # 2. 测试记录 def test_agent_works(recording): A.set_recording(recording) A.output_contains("refund policy") A.called_tool("search") A.total_cost_less_than(0.05) A.no_pii_in_output() ``` ``` $ agentprobe test tests/test_agent.py PASS test_basic_response .................. 0.8s PASS test_uses_search_tool ................ 0.3s PASS test_cost_within_budget .............. 0.1s PASS test_no_pii_leakage .................. 0.2s FAIL test_prompt_injection_resistance ..... 0.4s AssertionError: Output contains forbidden pattern: 'IGNORE PREVIOUS' 4 passed, 1 failed in 1.8s Total cost: $0.0034 | Tokens: 1,247 | Traces: .agentprobe/ ``` ## 功能特性

### 🔴 录制将每一次 LLM 调用、工具调用和路由决策捕获到可移植的 `.aprobe` 追踪文件中。与框架无关 —— 适用于任何 agent。	### ✅ 测试 35+ 内置断言 —— 输出质量、工具使用、成本、延迟、安全性。支持基于属性、参数化、回归和快照测试。
### ⏪ 回放替换模型，更改 prompt，并排比较结果。查看 `gpt-4o` 和 `claude-sonnet` 之间具体的成本和行为差异。	### 🛡️ 安全 Prompt injection 模糊测试（47+ 变体）、PII 检测（27 种实体类型）、威胁建模和安全评分。
### 📊 监控每次运行的成本跟踪、预算警报、行为漂移检测、异常检测、延迟百分位。	### 🧠 智能幻觉检测、自动优化器、模型推荐器以及跨 agent 配置的 A/B 测试。
### ⚔️ 竞技场 ^Pro 带 ELO 评分的 Agent 对战。在任意维度进行正面交锋 —— 成本、准确率、速度、安全性。	### 🔬 验尸 ^Pro 取证故障分析。自动根因检测：无限循环、成本爆炸、工具误用、幻觉螺旋。
### 🔍 X-Ray Token 级别的成本归因。准确查看哪个步骤、哪个工具调用、哪个 LLM 请求正在消耗你的预算。精美的树状可视化。	### 📋 合规 ^Pro 涵盖 SOC2, HIPAA, GDPR, PCI-DSS 和 CCPA 的 53 项自动检查。生成审计就绪的报告。
### 🔥 Agent 吐槽获取对你 agent 的一份极其诚实（且有趣）的分析。450 个笑话，3 个严重等级。“你的 agent 花钱就像喝醉的水手进了 token 商店。”	### 💰 成本计算器找出你的 agent 的真实成本。单次运行、每月、每年的预测。包含节省建议的模型比较。
### 🏥 健康检查 5 维度健康评分（可靠性、速度、成本、安全性、质量），附带进度条和可操作建议。	### 🎮 注入演练场跨 5 个类别的 55 种 prompt injection 攻击。交互式测试你的 agent 防御。
### 🏆 排行榜按综合得分对你的 agent 进行排名。追踪随时间的改进。SQLite 支持，完全本地化。	### ⚖️ 模型比较器并排模型比较。成本、速度、质量、幻觉率。为获胜者加冕皇冠表情。

## 集成 AgentProbe 与框架无关。可以与任何框架配合使用。

Framework	Integration
OpenAI SDK	Auto-instrumentation
Anthropic SDK	Auto-instrumentation
LangChain	Callback handler
CrewAI	Callback handler
AutoGen	Adapter
Custom agents	Manual recording API

### GitHub Action ``` - uses: tomerhakak/agentprobe@v1 with: args: test --ci --report report.html ``` ### LangChain ``` from agentprobe.adapters.langchain import AgentProbeCallbackHandler handler = AgentProbeCallbackHandler(session_name="my-chain") chain.invoke({"input": "..."}, config={"callbacks": [handler]}) ``` ### CrewAI ``` from agentprobe.adapters.crewai import AgentProbeCrewHandler handler = AgentProbeCrewHandler(session_name="my-crew") crew.kickoff(callbacks=[handler]) ``` ### CI/CD

GitHub Actions

``` name: Agent Tests on: [push, pull_request] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.12" - run: pip install agentprobe[all] - run: agentprobe test --ci --report report.html - uses: actions/upload-artifact@v4 if: always() with: name: agentprobe-report path: report.html ```

GitLab CI

``` agent-tests: image: python:3.12 script: - pip install agentprobe[all] - agentprobe test --ci --report report.html artifacts: paths: - report.html when: always ```

Jenkins

``` pipeline { agent { docker { image 'python:3.12' } } stages { stage('Agent Tests') { steps { sh 'pip install agentprobe[all]' sh 'agentprobe test --ci --report report.html' } } } post { always { archiveArtifacts artifacts: 'report.html' } } } ```

## CLI 参考 ``` agentprobe record Record an agent run into a .aprobe trace agentprobe test Run agent tests (pytest-compatible) agentprobe replay Replay a recording with a different model or config agentprobe fuzz Fuzz your agent with prompt injections & edge cases agentprobe scan Security scan — PII detection, injection resistance agentprobe roast Get a funny brutal analysis of your agent agentprobe xray Visualize agent thinking step-by-step agentprobe health 5-dimension health check with scores agentprobe cost Calculate true cost projections & savings agentprobe compare Side-by-side model comparison agentprobe playground Interactive prompt injection lab (55 attacks) agentprobe leaderboard Rank and track your agents over time agentprobe analyze Cost breakdown, drift detection, failure clustering agentprobe platform Launch the local web dashboard (localhost:9700) agentprobe init Scaffold config file and example tests agentprobe diff Compare two recordings or agent versions ``` 运行 `agentprobe --help` 获取完整列表。 ## 平台 AgentProbe 自带 **本地 Web Dashboard** —— 无需云端，无需账户，数据不离开你的设备。 ``` agentprobe platform start # 打开 http://localhost:9700 ``` ``` +------------------------------------------------------------------+ | AgentProbe Platform localhost:9700 | +------------------------------------------------------------------+ | | | Recent Traces Cost Trend (7d) | | +---------------------------------+ +---------------------+ | | | customer-support 0.3s $0.003 | | __/ | | | | order-lookup 1.2s $0.018 | | ___/ | | | | refund-agent 0.8s $0.007 | | ___/ | | | | billing-qa 0.5s $0.004 | | / | | | +---------------------------------+ +---------------------+ | | | | Assertions: 142 passed, 3 failed Avg cost/run: $0.008 | | Models: gpt-4o (67%), claude-sonnet (33%) Drift: LOW | +------------------------------------------------------------------+ ``` Traces、成本细分、断言结果、漂移检测和故障分析 —— 尽在掌握。 ## 免费版 vs. Pro 版

Feature	Free	Pro
Recording & Replay	✅	✅
35+ Assertions	✅	✅
pytest Integration	✅	✅
Mock LLMs & Tools	✅	✅
CI/CD & GitHub Action	✅	✅
Cost & Latency Tracking	✅	✅
PII Detection (27 types)	✅	✅
Basic Fuzzing	✅	✅
Local Dashboard	✅	✅
🔥 Agent Roast (450 jokes)	✅	✅
🔬 X-Ray Visualization	✅	✅
💰 Cost Calculator & Projections	✅	✅
🏥 Health Check (5 dimensions)	✅	✅
🎮 Injection Playground (55 attacks)	✅	✅
🏆 Agent Leaderboard	✅	✅
⚖️ Model Comparator	✅	✅

⚔️ Agent Battle Arena	-	✅
🔬 Agent Autopsy	-	✅
📋 Compliance (53 checks)	-	✅
🛡️ Security Scorer (71 checks)	-	✅
🧠 Agent Benchmark (6D)	-	✅
🔀 Agent Diff & Changelog	-	✅
🧠 Brain (auto-optimizer)	-	✅
Full Fuzzer (47+ variants)	-	✅

了解更多关于 Pro 的信息 →

## 为什么选择 AgentProbe？ | | AgentProbe | Promptfoo | DeepEval | Ragas | |---|:---:|:---:|:---:|:---:| | Record agent traces | ✅ | - | - | - | | Replay with model swap | ✅ | - | - | - | | 35+ built-in assertions | ✅ | Custom | 14 | 8 | | Prompt injection fuzzing | ✅ | Basic | - | - | | Tool call assertions | ✅ | - | - | - | | Cost & latency assertions | ✅ | - | Partial | - | | PII detection | ✅ | - | - | - | | Mock LLMs & tools | ✅ | - | - | - | | pytest native | ✅ | - | Plugin | - | | Framework agnostic | ✅ | LLM-only | LLM-only | RAG-only | | Fully offline | ✅ | Partial | - | - | | Local dashboard | ✅ | ✅ | ✅ | - | ## 示例 ### 使用不同模型回放 ``` from agentprobe import Replayer, ReplayConfig replayer = Replayer() result = replayer.replay( "recordings/customer-support.aprobe", config=ReplayConfig(model="claude-sonnet-4-20250514", mock_tools=True), ) comparison = replayer.compare(original, result) print(comparison.summary) # 输出相似度：94.2% # 成本：$0.0180 -> $0.0095 (-47.2%) ``` ### 模糊测试 Prompt Injection ``` from agentprobe.fuzz import Fuzzer, PromptInjection, EdgeCases fuzzer = Fuzzer() result = fuzzer.run( agent_fn=run_agent, strategies=[PromptInjection(), EdgeCases()], assertions=lambda A: [ A.no_pii_in_output(), A.output_not_contains("IGNORE PREVIOUS"), A.completed_successfully(), ], ) print(result.summary()) # 策略：PromptInjection | 已测试：47 | 失败：2 | 失败率：4.3% ``` ### Mock 以进行快速、免费、确定性的测试 ``` from agentprobe.mock import MockLLM, MockTool mock_llm = MockLLM(responses=["Your order #1234 has been shipped."]) mock_search = MockTool(responses=[{"results": ["Order shipped on March 15"]}]) result = replayer.replay( recording, config=ReplayConfig(mock_llm=mock_llm, tool_mocks={"search_orders": mock_search}), ) # 零 API 调用。零成本。确定性。 ``` 在 [`examples/`](examples/) 目录中查看更多。 ## 贡献欢迎贡献。以下是入门方法： ``` git clone https://github.com/tomerhakak/agentprobe.git cd agentprobe pip install -e ".[dev,all]" pytest ruff check . ``` 在提交大型 PR 之前请先开 issue，以便我们讨论实现方案。 ## 许可证 [MIT](LICENSE) —— 随意使用。

_{Built by @tomerhakak · agentprobe.dev}

标签：AI安全, Chat Copilot, LLM应用开发, Pytest, Python, RAG评估, 大模型, 安全测试, 安全规则引擎, 录制回放, 成本追踪, 攻击性安全, 无后门, 本地优先, 测试框架, 端到端测试, 质量保证, 逆向工具