llm-books/sre-agent

GitHub: llm-books/sre-agent

一本关于生产级 AI Agent 书籍的配套项目，提供仿真微服务环境和按章节逐步构建的 SRE 事件响应 Agent。

Stars: 0 | Forks: 0

# sre-agent

[*Production AI Agents: Building Systems That Survive Real Users*](https://www.llm-books.com/production-ai-agents)的配套代码库。它包含两部分内容： 1. 一个可以在笔记本电脑上运行的**仿真生产环境**：六个模拟电商结账的 microservice、完整的 telemetry 技术栈、一个负载生成器，以及一个按计划注入真实故障的混沌引擎。 2. 在该环境中运行的 **SRE agent**，按章节逐个组件进行构建，因此书中的架构就是你可以运行的真实代码，而不是只能盲目信任的图表。该 agent 会监控 incidents，使用 telemetry 工具进行调查，提出诊断和修复方案，并（当它获得信任后，根据第 12 章的内容）执行这些操作。这个环境就是它工作的对象。 ## 前置条件只需要 **Docker Desktop**。Go、k6 和 Python 工具全部在 container 中运行，因此无需安装任何其他内容。所有内容都可以在 16GB 内存的笔记本电脑上运行，空闲时的资源消耗远低于 6GB。 - 安装 Docker Desktop：https://www.docker.com/products/docker-desktop/ - 首次拉取镜像大约需要 3GB 的磁盘空间。 - Docker 的默认资源分配即可满足需求。如果你修改过设置，请确保分配给 Docker 至少 4GB 的内存（Docker Desktop -> Settings -> Resources）。 ## 快速开始 **步骤 0：启动 Docker Desktop 并等待其准备就绪。** 这是大家经常跳过的一步。`make` 命令需要与 Docker daemon 通信，而该 daemon 只有在 Docker Desktop 运行时才会启动。打开 Docker Desktop 应用（在 macOS 上也可以运行 `open -a Docker`），并等待菜单栏中的鲸鱼图标停止动画，或者直到以下命令成功输出版本号且没有报错： ``` docker info ``` 如果你在 Docker 准备好之前运行了 `make up`，你会看到 `Cannot connect to the Docker daemon` 的报错。这不是本代码库的 bug；这仅仅意味着 Docker Desktop 尚未运行。请启动它，等待片刻后重试。请参阅下方的 [故障排除](#troubleshooting)。 **步骤 1：启动环境。** ``` make up # build and start everything make smoke # confirm the six services respond (expect six 200s) make urls # print the dashboard links ``` **首次**运行 `make up` 会拉取多个镜像并构建 Go service，因此可能需要几分钟时间。后续启动只需几秒钟。如果看起来像是卡住了，那几乎可以肯定是在拉取或构建中；可以在 Docker Desktop 仪表板或通过 `docker compose logs -f` 查看进度。 **步骤 2：观察并破坏系统。** 通过 http://localhost:3000 打开 Grafana （匿名管理员模式）并查看 Services Overview 仪表板。然后注入一个故障并观察它的发生： ``` make chaos-list make chaos-inject NAME=orders-slow-query # ... watch orders p99 climb in Grafana ... make chaos-clear NAME=orders-slow-query ``` 或者运行整个混沌日活动，将五个 incident 压缩在约一分钟内执行： ``` make chaos-day # or: make chaos-day SPEED=20 (slower, more watchable) ``` 使用 `make down` 停止，或者使用 `make nuke` 连同 volume 一并删除。 ## 故障排除 **`Cannot connect to the Docker daemon at unix:///...docker.sock. Is the docker daemon running?`** Docker Desktop 未运行。请启动它，等待直到 `docker info` 成功执行，然后重新运行你的命令。这里的每个 `make` 目标都需要 daemon。这是最常见的首次运行障碍。 **首次运行 `make up` 似乎卡住了。** 它正在拉取镜像（Postgres、Prometheus、Grafana、Loki、Tempo、k6）并使用 `go mod tidy` 构建 Go service。在正常网络连接下，首次运行需要几分钟，而不是几秒钟。可以在另一个终端使用 `docker compose logs -f` 或在 Docker Desktop 仪表板中查看实际进度。 **`Bind for 0.0.0.0:3000 failed: port is already allocated`**（或针对 9090、5432、3100、3200、6379、8081-8086 端口报错）。你的机器上已经有其他程序占用了该端口。请停止该进程，或者修改 `docker-compose.yml` 中端口映射的主机侧端口（即 `"3000:3000"` 中的左侧数字），然后再次运行 `make up`。 **Go 构建在解析模块时失败。** 构建过程会运行 `go mod tidy`，该命令在首次运行时需要网络访问权限来获取依赖图。请检查你的网络连接并重试 `make up`。一旦构建完成，镜像将被缓存。 **Grafana 没有显示任何数据。** 在运行 `make up` 后请等待约 30 秒以进行首次 Prometheus 抓取，并使用 `docker compose ps load` 确认负载生成器正在运行。如果 `load` 已退出，说明爬坡已完成；它会自动重启，或者你也可以运行 `docker compose up -d load`。 **Loki 中没有日志 / Explore 页面为空。** 日志通过挂载的 Docker socket 由 Promtail 发送。这在 Docker Desktop 上可以正常工作，但日志只是一个便利功能，不是核心信号；大部分 agent 的工作是由 metrics 驱动的。如果你确实需要日志但它们丢失了，请检查 `promtail` container 是否正在运行。 **Tempo 为空。** 这是故意的。这些 service 会在第 9 章的构建（observability）中开始发出 trace。在此之前，Tempo 将在空状态下运行并等待。 **我想要一个干净的状态。** `make reset` 会清除所有注入的故障，并在远低于 30 秒的时间内重启 service。`make nuke` 会停止所有内容，包括 volume，以便进行彻底的重建。 ## 目录说明 ``` sre-agent/ env/ the synthetic environment (fixed across chapters) services/ one configurable Go service, run as six instances telemetry/ prometheus, grafana, loki, promtail, tempo configs load/ k6 load generator chaos/ the chaos engine scenarios/ incidents as YAML; also the eval ground truth runbooks/ deliberately uneven runbooks initdb/ postgres schema agent/ the SRE agent (grows per chapter; scope.yaml is ch03) evals/ the eval harness (lands in ch07) deploys.jsonl the deploy ledger the agent correlates against README-CHAPTERS.md which git tag holds the agent at the end of each chapter ``` ## 架构 ``` flowchart LR load[k6 load] --> web web --> gw[api-gateway] gw --> orders orders --> payments orders --> inventory notifications[notifications worker] subgraph telemetry prom[Prometheus] loki[Loki] tempo[Tempo] graf[Grafana] end web -.metrics/logs.-> telemetry orders -.-> telemetry payments -.-> telemetry inventory -.-> telemetry notifications -.-> telemetry chaos[chaos engine] -. /admin/fault .-> orders chaos -. /admin/fault .-> payments chaos -. /admin/fault .-> inventory chaos -. /admin/fault .-> notifications ``` 流量从 `web` 进入并通过依赖链发散，因此一个 service 中的故障会作为上游 service 的症状显现出来。混沌引擎通过调用各个 service 的 `/admin/fault` endpoint 来注入故障，无需重启。`notifications` worker 会定时清空队列，这也是静默失败场景发生停滞的原因。 ## agent 的演进过程每一章的“构建”部分都会增加一个组件，并在 git 中打上标签。完整的映射关系请参见 [README-CHAPTERS.md](README-CHAPTERS.md)。简而言之：首先出现的是作为 `agent/scope.yaml` 的边界（第 3 章），接着是 orchestrator 和 executor（第 4 章）、状态（第 5 章）、工具（第 6 章）、评估（第 7 章至第 8 章）、observability（第 9 章）、成本（第 10 章）、安全（第 11 章）和发布（第 12 章），并在第 13 章中对 multi-agent 进行了深入的量化探讨。 ## 状态这是一个脚手架。目前可以运行的内容包括：完整的环境、telemetry、负载和混沌系统，以及第 3 章的 scope 配置。该 agent 的执行组件会随对应的章节发布。来自 service 的 trace 发送被推迟到了第 9 章的构建（observability）中，这就是 Tempo 初始为空的原因。 ## 许可证 MIT。详见 [LICENSE](LICENSE)。

标签：AI智能体, Docker, SRE, 人工智能, 偏差过滤, 安全防御评估, 微服务架构, 搜索引擎查询, 日志审计, 测试用例, 混沌工程, 版权保护, 用户模式Hook绕过, 自定义请求头, 请求拦截, 运维自动化, 逆向工具