gonzaloMorenoc/ai-testing-lab

GitHub: gonzaloMorenoc/ai-testing-lab

一个系统化的 LLM 测试教学实验室，通过 13 个独立模块覆盖语义评估、红队对抗、幻觉检测、防护栏、可观测性与漂移监控等关键环节，帮助开发者快速掌握大模型应用的测试与质量保障方法。

Stars: 2 | Forks: 0

# ai-testing-lab LLM 与聊天机器人测试 —— 从经典测试金字塔到语义评估、红蓝对抗与可观测性。 [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/04/ba90520240173145.svg)](https://github.com/gonzaloMorenoc/ai-testing-lab/actions/workflows/ci.yml) [![Coverage](https://codecov.io/gh/gonzaloMorenoc/ai-testing-lab/branch/main/graph/badge.svg)](https://codecov.io/gh/gonzaloMorenoc/ai-testing-lab) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) [![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org/) [![Open in Dev Containers](https://img.shields.io/static/v1?label=Dev%20Containers&message=Open&color=blue&logo=visualstudiocode)](https://vscode.dev/redirect?url=vscode://ms-vscode-remote.remote-containers/cloneInVolume?url=https://github.com/gonzaloMorenoc/ai-testing-lab) ## 你将学到什么 - 使用 DeepEval 编写你的第一个 `LLMTestCase` 并通过 pytest 运行 - 使用 RAGAS 的 9 项指标（faithfulness、context recall、answer relevancy 等）评估 RAG pipeline - 使用自定义评分标准构建 LLM 评审（G-Eval、DAG Metric） - 使用 `ConversationalTestCase` 测试多轮对话 - 使用 Promptfoo 检测提示词回归：prompts × 模型的 YAML 矩阵 - 使用 `HallucinationMetric` 和 TruLens 的 RAG Triad 检测幻觉 - 使用 Garak 扫描 LLM 漏洞（DAN、编码攻击、毒性） - 使用 DeepTeam 和 OWASP Top 10 LLM 2025 进行攻击与防御 - 使用 NeMo Guardrails 和 Guardrails AI 添加 I/O 防护栏 - 评估智能体：`ToolCallAccuracy`、`AgentGoalAccuracy`、轨迹评估 - 在 Playwright 中针对聊天机器人 UI 编写流式 E2E 测试 - 使用 Langfuse/Phoenix 为 pipeline 添加埋点并测量生产环境数据（OTel） - 使用 Evidently AI 检测语义漂移，并在用户察觉之前发出警报 ## 快速开始 ``` git clone https://github.com/gonzaloMorenoc/ai-testing-lab.git cd ai-testing-lab pip install deepeval pytest pytest-recording pyyaml # dependencias mínimas pytest modules/01-primer-eval/tests/ -m "not slow" # 8 tests, 0 llamadas LLM reales pytest modules/02-ragas-basics/tests/ -m "not slow" # 10 tests, métricas RAGAS mock ``` 没有任何模块需要 API 密钥即可运行其快速测试。对于标记为 `@pytest.mark.slow` 的测试，请在运行 `pytest -m slow` 之前导出 `GROQ_API_KEY`（免费套餐）。 ## 仓库地图 ``` ai-testing-lab/ ├── modules/ # 13 labs numerados e independientes │ ├── 01-primer-eval/ # ← empieza aquí ✅ │ ├── 02-ragas-basics/ # ✅ │ ├── 03-llm-as-judge/ # ✅ │ ├── 04-multi-turn/ # ✅ │ ├── 05-prompt-regression/ # ✅ │ ├── 06-hallucination-lab/ # ✅ │ ├── 07-redteam-garak/ # ✅ │ ├── 08-redteam-deepteam/ # ✅ │ ├── 09-guardrails/ # ✅ │ ├── 10-agent-testing/ # ✅ │ ├── 11-playwright-streaming/ # ✅ │ ├── 12-observability/ # ✅ │ └── 13-drift-monitoring/ # ✅ ├── demos/ # sistemas bajo prueba (RAG, Streamlit, Rasa, bot vulnerable) ├── docs/ # manual troceado por capítulos + glosario ├── goldens/ # datasets de evaluación versionados ├── exercises/solutions/ # soluciones de los ejercicios de cada módulo └── docker/compose.yml # Langfuse + Ollama + demos ``` ## 模块 | # | 模块 | 测试数 | 状态 | 核心概念 | |---|--------|-------|--------|----------------| | 01 | [primer-eval](modules/01-primer-eval/) | 8 | ✅ 已实现 | LLMTestCase · AnswerRelevancy · Faithfulness | | 02 | [ragas-basics](modules/02-ragas-basics/) | 10 | ✅ 已实现 | RAGAS · faithfulness · context_precision · context_recall | | 03 | [llm-as-judge](modules/03-llm-as-judge/) | 11 | ✅ 已实现 | G-Eval · DAG Metric · position bias · verbosity bias | | 04 | [multi-turn](modules/04-multi-turn/) | 10 | ✅ 已实现 | ConversationalTestCase · KnowledgeRetention · 历史记录 | | 05 | [prompt-regression](modules/05-prompt-regression/) | 11 | ✅ 已实现 | PromptRegistry · RegressionChecker · Promptfoo | | 06 | [hallucination-lab](modules/06-hallucination-lab/) | 9 | ✅ 已实现 | claim extraction · groundedness · RAG Triad | | 07 | [redteam-garak](modules/07-redteam-garak/) | 8 | ✅ 已实现 | DAN · encoding attacks · jailbreak · hit rate | | 08 | [redteam-deepteam](modules/08-redteam-deepteam/) | 8 | ✅ 已实现 | OWASP Top 10 LLM 2025 · prompt injection · agency | | 09 | [guardrails](modules/09-guardrails/) | 11 | ✅ 已实现 | PII detection · output validation · input/output pipeline | | 10 | [agent-testing](modules/10-agent-testing/) | 9 | ✅ 已实现 | tool selection · trajectory evaluation · AgentGoalAccuracy | | 11 | [playwright-streaming](modules/11-playwright-streaming/) | 8 | ✅ 已实现 | SSE streaming · E2E chatbot UI · FastAPI mock server | | 12 | [observability](modules/12-observability/) | 8 | ✅ 已实现 | OTel spans · @trace decorator · latency · error tracking | | 13 | [drift-monitoring](modules/13-drift-monitoring/) | 9 | ✅ 已实现 | PSI · semantic drift · alert rules | ## 运行所有已实现的模块 ``` # 模块 01-13 合并（107+ tests，~0.1s，无需 API key） pytest modules/01-primer-eval/tests/ \ modules/02-ragas-basics/tests/ \ modules/03-llm-as-judge/tests/ \ modules/04-multi-turn/tests/ \ modules/05-prompt-regression/tests/ \ modules/06-hallucination-lab/tests/ \ modules/07-redteam-garak/tests/ \ modules/08-redteam-deepteam/tests/ \ modules/09-guardrails/tests/ \ modules/10-agent-testing/tests/ \ modules/11-playwright-streaming/tests/ \ modules/12-observability/tests/ \ modules/13-drift-monitoring/tests/ \ -m "not slow" # 模块 11 需要安装 playwright+fastapi；未安装则自动跳过。 ``` ## 致谢 | 工具 | 用途 | |------------|---------| | [DeepEval](https://deepeval.com) | 原生 pytest 评估，50+ 指标，LLM-as-judge | | [RAGAS](https://docs.ragas.io) | 无参考 RAG 指标（faithfulness、context recall…） | | [Promptfoo](https://promptfoo.dev) | 提示词回归测试，YAML 矩阵 | | [Garak](https://github.com/NVIDIA/garak) | LLM 漏洞扫描器（NVIDIA，Apache 2.0） | | [PyRIT](https://github.com/Azure/PyRIT) | 红蓝对抗框架（Microsoft，MIT） | | [Botium](https://botium-docs.readthedocs.io) | 对话式聊天机器人测试 | | [TruLens](https://www.trulens.org) | RAG Triad + OpenTelemetry | | [Phoenix](https://github.com/Arize-ai/phoenix) | 开源可观测性（OTel，自动检测） | | [Langfuse](https://langfuse.com) | 追踪、在线评估、自托管（MIT） | | [Guardrails AI](https://guardrailsai.com) | LLM 的 I/O 验证 | | [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) | 对话防护栏（Colang DSL） | ## 许可证 ``` MIT © 2026 Gonzalo Moreno ```

标签：Agent评估, AI护栏, AI测试框架, API集成, CISA项目, Clair, DeepEval, Dev Containers, Evidently AI, Garak, Kubernetes, Kubernetes 安全, Langfuse, LLM度量, LLM评估, NeMo Guardrails, Ollama, OWASP LLM Top 10, Playwright, Promptfoo, pytest, Python, RAGAS, RAG管道, RAG评估, TruLens, 主机安全, 人工智能安全, 可观测性, 合规性, 大模型安全实验室, 大语言模型测试, 安全规则引擎, 对抗攻击, 敏感信息检测, 无后门, 模型鲁棒性, 测试自动化, 特征检测, 用户代理, 端到端测试, 聊天机器人测试, 语义漂移, 语义评估, 请求拦截, 越狱检测, 逆向工具