langwatch/scenario

GitHub: langwatch/scenario

一个基于模拟的 Agent 测试框架，用于在多轮对话中验证智能体行为并评估其安全性。

Stars: 920 | Forks: 67

![scenario](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/da14385c66023015.webp)

# 场景 Scenario 是一个基于模拟的 Agent 测试框架，它可以： - 通过模拟不同场景和边缘情况下的用户来测试真实 Agent 行为 - 在对话的任何阶段进行评估和判断，提供强大的多轮控制 - 与任何 LLM 评估框架或自定义评估结合使用，设计上保持中立 - 仅通过实现一个 [`call()`](https://scenario.langwatch.ai/agent-integration) 方法即可集成你的 Agent - 提供 Python、TypeScript 和 Go 版本 📖 [文档](https://scenario.langwatch.ai)\ 📺 [观看视频教程](https://www.youtube.com/watch?v=f8NLpkY0Av4) ## 示例这是使用 Scenario 进行带工具检查的模拟示例： ``` # 定义任何自定义断言 def check_for_weather_tool_call(state: scenario.ScenarioState): assert state.has_tool_call("get_current_weather") result = await scenario.run( name="checking the weather", # Define the prompt to guide the simulation description=""" The user is planning a boat trip from Barcelona to Rome, and is wondering what the weather will be like. """, # Define the agents that will play this simulation agents=[ WeatherAgent(), scenario.UserSimulatorAgent(model="openai/gpt-4.1-mini"), ], # (Optional) Control the simulation script=[ scenario.user(), # let the user simulator generate a user message scenario.agent(), # agent responds check_for_weather_tool_call, # check for tool call after the first agent response scenario.succeed(), # simulation ends successfully ], ) assert result.success ```

TypeScript 示例

``` const result = await scenario.run({ name: "vegetarian recipe agent", // Define the prompt to guide the simulation description: ` The user is planning a boat trip from Barcelona to Rome, and is wondering what the weather will be like. `, // Define the agents that will play this simulation agents: [new MyAgent(), scenario.userSimulatorAgent()], // (Optional) Control the simulation script: [ scenario.user(), // let the user simulator generate a user message scenario.agent(), // agent responds // check for tool call after the first agent response (state) => expect(state.has_tool_call("get_current_weather")).toBe(true), scenario.succeed(), // simulation ends successfully ], }); ```

## 快速开始安装 scenario 和测试运行器： ``` # 在 Python 上 uv add langwatch-scenario pytest # 或在 TypeScript 上 pnpm install @langwatch/scenario vitest ``` 现在创建你的第一个场景，复制下面的完整可运行示例。

快速开始 - Python

保存为 `tests/test_vegetarian_recipe_agent.py`： ``` import pytest import scenario import litellm scenario.configure(default_model="openai/gpt-4.1-mini") @pytest.mark.agent_test @pytest.mark.asyncio async def test_vegetarian_recipe_agent(): class Agent(scenario.AgentAdapter): async def call(self, input: scenario.AgentInput) -> scenario.AgentReturnTypes: return vegetarian_recipe_agent(input.messages) # Run a simulation scenario result = await scenario.run( name="dinner idea", description=""" It's saturday evening, the user is very hungry and tired, but have no money to order out, so they are looking for a recipe. """, agents=[ Agent(), scenario.UserSimulatorAgent(), scenario.JudgeAgent( criteria=[ "Agent should not ask more than two follow-up questions", "Agent should generate a recipe", "Recipe should include a list of ingredients", "Recipe should include step-by-step cooking instructions", "Recipe should be vegetarian and not include any sort of meat", ] ), ], set_id="python-examples", ) # Assert for pytest to know whether the test passed assert result.success # 示例代理实现 import litellm @scenario.cache() def vegetarian_recipe_agent(messages) -> scenario.AgentReturnTypes: response = litellm.completion( model="openai/gpt-4.1-mini", messages=[ { "role": "system", "content": """ You are a vegetarian recipe agent. Given the user request, ask AT MOST ONE follow-up question, then provide a complete recipe. Keep your responses concise and focused. """, }, *messages, ], ) return response.choices[0].message # type: ignore ```

快速开始 - TypeScript

保存为 `tests/vegetarian-recipe-agent.test.ts`： ``` import scenario, { type AgentAdapter, AgentRole } from "@langwatch/scenario"; import { openai } from "@ai-sdk/openai"; import { generateText } from "ai"; import { describe, it, expect } from "vitest"; describe("Vegetarian Recipe Agent", () => { const agent: AgentAdapter = { role: AgentRole.AGENT, call: async (input) => { const response = await generateText({ model: openai("gpt-4.1-mini"), messages: [ { role: "system", content: `You are a vegetarian recipe agent.\nGiven the user request, ask AT MOST ONE follow-up question, then provide a complete recipe. Keep your responses concise and focused.`, }, ...input.messages, ], }); return response.text; }, }; it("should generate a vegetarian recipe for a hungry and tired user on a Saturday evening", async () => { const result = await scenario.run({ name: "dinner idea", description: `It's saturday evening, the user is very hungry and tired, but have no money to order out, so they are looking for a recipe.`, agents: [ agent, scenario.userSimulatorAgent(), scenario.judgeAgent({ model: openai("gpt-4.1-mini"), criteria: [ "Agent should not ask more than two follow-up questions", "Agent should generate a recipe", "Recipe should include a list of ingredients", "Recipe should include step-by-step cooking instructions", "Recipe should be vegetarian and not include any sort of meat", ], }), ], setId: "javascript-examples", }); expect(result.success).toBe(true); }); }); ```

导出你的 OpenAI API 密钥： ``` OPENAI_API_KEY= ``` 现在运行测试： ``` # 在 Python 上 pytest -s tests/test_vegetarian_recipe_agent.py # 或在 TypeScript 上 npx vitest run tests/vegetarian-recipe-agent.test.ts ``` 效果如下： [![asciicast](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/2849ef9310023017.svg)](https://asciinema.org/a/nvO5GWGzqKTTCd8gtNSezQw11) 你可以在 [python/examples/](python/examples/test_vegetarian_recipe_agent.py) 或 [javascript/examples/](javascript/examples/vitest/tests/vegetarian-recipe-agent.test.ts) 找到相同的代码示例。现在查看 [完整文档](https://scenario.langwatch.ai) 以了解更多信息和后续步骤。 ## 自动模拟通过提供一个用户模拟器 Agent 和场景描述（无需脚本），模拟用户将自动向 Agent 生成消息，直到场景成功或达到最大轮数。然后可以使用 Judge Agent 根据特定标准实时评估场景，每一轮 Judge Agent 都会决定是继续模拟还是以判决结束。例如，以下是一个测试 vibe coding 助手的场景： ``` result = await scenario.run( name="dog walking startup landing page", description=""" the user wants to create a new landing page for their dog walking startup send the first message to generate the landing page, then a single follow up request to extend it, then give your final verdict """, agents=[ LovableAgentAdapter(template_path=template_path), scenario.UserSimulatorAgent(), scenario.JudgeAgent( criteria=[ "agent reads the files before go and making changes", "agent modified the index.css file, not only the Index.tsx file", "agent created a comprehensive landing page", "agent extended the landing page with a new section", "agent should NOT say it can't read the file", "agent should NOT produce incomplete code or be too lazy to finish", ], ), ], max_turns=5, # optional ) ``` 查看 [examples/test_lovable_clone.py](examples/test_lovable_clone.py) 中的完整可运行 Lovable Clone 示例。你也可以结合部分脚本使用！例如只控制对话的开头，其余自动进行。 ## 对话的完全控制你可以通过向 `script` 字段传递步骤列表来指定场景引导脚本，这些步骤是接收当前场景状态作为参数的任意函数，因此你可以： - 控制用户说什么，或让内容自动生成 - 控制 Agent 说什么，或让内容自动生成 - 添加自定义断言，例如确保调用了某个工具 - 添加自定义评估，使用外部库 - 让模拟进行指定轮数，并在每轮新回合进行评估 - 触发 Judge Agent 做出判决 - 在对话中间添加任意消息，例如模拟工具调用一切皆有可能，使用相同的简单结构： ``` @pytest.mark.agent_test @pytest.mark.asyncio async def test_early_assumption_bias(): result = await scenario.run( name="early assumption bias", description=""" The agent makes false assumption that the user is talking about an ATM bank, and user corrects it that they actually mean river banks """, agents=[ Agent(), scenario.UserSimulatorAgent(), scenario.JudgeAgent( criteria=[ "user should get good recommendations on river crossing", "agent should NOT keep following up about ATM recommendation after user has corrected them that they are actually just hiking", ], ), ], max_turns=10, script=[ # Define hardcoded messages scenario.agent("Hello, how can I help you today?"), scenario.user("how do I safely approach a bank?"), # Or let it be generated automatically scenario.agent(), # Add custom assertions, for example making sure a tool was called check_if_tool_was_called, # Generate a user follow-up message scenario.user(), # Let the simulation proceed for 2 more turns, print at every turn scenario.proceed( turns=2, on_turn=lambda state: print(f"Turn {state.current_turn}: {state.messages}"), ), # Time to make a judgment call scenario.judge(), ], ) assert result.success ``` ## 红队测试 Scenario 还提供了一个 `RedTeamAgent` —— 作为用户模拟器的即插即用替代方案，它使用相同的 `scenario.run()` 循环和 CI 管道，对你的 Agent 执行多轮对抗攻击（包括 Crescendo 升级、每轮评分、拒绝检测和回溯）。完整指南：[scenario.langwatch.ai/advanced/red-teaming](https://scenario.langwatch.ai/advanced/red-teaming)（以及 [快速开始](https://scenario.langwatch.ai/advanced/red-teaming/quick-start)）。 ## LangWatch 可视化设置你的 [LangWatch API 密钥](https://app.langwatch.ai/) 以在运行时实时可视化场景，获得更好的调试体验和团队协作： ``` LANGWATCH_API_KEY="your-api-key" ``` ![LangWatch Visualization](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/669edbd3a2023018.webp) ## 调试模式你可以通过在 `Scenario.configure` 方法或正在运行的特定场景中设置 `debug` 字段为 `True`，或通过传递 `--debug` 标志给 pytest 来启用调试模式。调试模式允许你逐步慢动作查看消息，并介入输入以从对话中间调试你的 Agent。 ``` scenario.configure(default_model="openai/gpt-4.1-mini", debug=True) ``` 或 ``` pytest -s tests/test_vegetarian_recipe_agent.py --debug ``` ## 缓存每次运行 scenario 时，测试 Agent 可能会选择不同的起始输入，这有助于确保覆盖真实用户的多样性，但我们理解其非确定性可能使结果不易重复、成本更高且更难调试。为解决此问题，你可以在 `Scenario.configure` 方法或正在运行的特定场景中使用 `cache_key` 字段，这将使测试 Agent 对相同的场景输入相同的内容： ``` scenario.configure(default_model="openai/gpt-4.1-mini", cache_key="42") ``` 要清除缓存，可以简单地传递不同的 `cache_key`、禁用它，或删除位于 `~/.scenario/cache` 的缓存文件。更进一步，你可以使用 `@scenario.cache` 装饰器将应用程序中的 LLM 调用或任何其他非确定性函数包装在测试一侧： ``` # 在实际代理实现中 class MyAgent: @scenario.cache() def invoke(self, message, context): return client.chat.completions.create( # ... ) ``` 这将在运行测试时缓存你装饰的函数调用，根据函数参数、执行的场景和你提供的 `cache_key` 进行哈希处理，使结果可重复。你可以排除不应参与缓存键计算的参数。 ## 分组和批次虽然可选，但我们强烈建议为场景、集合和批次设置稳定的标识符，以便在 LangWatch 中更好地组织和跟踪： - **set_id**：将相关场景分组为一个测试套件，对应 UI 中的“模拟集合” - **SCENARIO_BATCH_RUN_ID**：环境变量，用于将一起运行的所有场景分组（例如单个 CI 作业），会自动生成但可以覆盖 ``` import os result = await scenario.run( name="my first scenario", description="A simple test to see if the agent responds.", set_id="my-test-suite", agents=[ scenario.Agent(my_agent), scenario.UserSimulatorAgent(), ] ) ``` 你也可以使用环境变量设置 `batch_run_id` 以用于 CI/CD 集成： ``` import os # 为 CI/CD 集成设置批次 ID os.environ["SCENARIO_BATCH_RUN_ID"] = os.environ.get("GITHUB_RUN_ID", "local-run") result = await scenario.run( name="my first scenario", description="A simple test to see if the agent responds.", set_id="my-test-suite", agents=[ scenario.Agent(my_agent), scenario.UserSimulatorAgent(), ] ) ``` `batch_run_id` 会为每个测试运行自动生成，但你也可以使用 `SCENARIO_BATCH_RUN_ID` 环境变量全局设置。 ## 关闭输出你可以通过从 pytest 中移除 `-s` 标志来隐藏测试期间的输出，仅在测试失败时显示。或者，你可以在 `Scenario.configure` 方法或正在运行的特定场景中设置 `verbose=False`。 ## 并行运行随着场景数量增长，你可能希望并行运行以加快整个测试套件的速度。我们建议使用 [pytest-asyncio-concurrent](https://pypi.org/project/pytest-asyncio-concurrent/) 插件来实现。只需从上述链接安装插件，然后将测试中的 `@pytest.mark.asyncio` 注解替换为 `@pytest.mark.asyncio_concurrent`，并添加组名称以标记应一起并行运行的场景组，例如： ``` @pytest.mark.agent_test @pytest.mark.asyncio_concurrent(group="vegetarian_recipe_agent") async def test_vegetarian_recipe_agent(): # ... @pytest.mark.agent_test @pytest.mark.asyncio_concurrent(group="vegetarian_recipe_agent") async def test_user_is_very_hungry(): # ... ``` 这两个场景现在应该并行运行。 ## 支持 - 📖 [文档](https://scenario.langwatch.ai) - 💬 [Discord 社区](https://discord.gg/langwatch) - 🐛 [问题追踪器](https://github.com/langwatch/scenario/issues) ## 许可证 MIT 许可证 - 详见 [LICENSE](LICENSE) 以获取详细信息。

标签：Agentic testing, Agent 框架, Agent 测试, AI 代理测试, Go, Langwatch, LLM 评估, Python, Ruby工具, Scenario, TypeScript, 多轮对话, 安全插件, 对话评估, 开源测试框架, 无后门, 日志审计, 模拟用户行为, 自动化攻击, 行为模拟, 跨语言 SDK, 边缘用例, 逆向工具, 集成测试