venkiverse7/launch-ops-ai

GitHub: venkiverse7/launch-ops-ai

一个以 SpaceX 遥测为场景的 AI 驱动 DevOps 监控系统，整合了流式异常检测、基于 RAG 的 SRE Copilot 和完整的可观测性技术栈。

Stars: 0 | Forks: 0

# 启动 Ops AI 这是一个以 SpaceX 任务遥测为主题的、由 AI 驱动的 DevOps 监控与突发事件响应系统。它模拟了一个任务控制环境，用于监控 SpaceX 风格的遥测数据（燃料液位、引擎温度、级分离事件），检测异常，并利用具备 **RAG + memory** 的 LLM agent 来解释异常并推荐 runbook 操作 —— 相当于太空任务的 SRE copilot。 [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/2fe9f12409053935.svg)](https://github.com/venkiverse7/launch-ops-ai/actions/workflows/ci.yml) ## 为什么它很有趣它将 AI/平台工程师实际被要求构建的内容整合在了一起：流式遥测 pipeline、透明的异常检测、**基于 RAG 的 LLM agent**（LangGraph 风格的编排 + memory）、FastAPI 服务，以及一整套 **Prometheus + Grafana** 可观测性技术栈 —— 所有这些都经过了容器化和 CI 测试。关键是，它**在端到端运行时不需要任何外部服务和 API 密钥**（拥有完善的离线降级方案），因此克隆和演示都极其简单。 ## 架构 ``` SpaceX v4 API specs ──► telemetry_simulator ──► data_pipeline ──► AnomalyDetector (static baselines) (time-series ticks) (stream + metrics) (Incidents) │ knowledge_base (runbooks + failure reports) │ │ ▼ └────► ai_agent (RAG ► LangGraph ► memory) ──► SRE copilot │ Prometheus ◄── /metrics ── chat_interface (FastAPI) ◄─┘ │ ▲ Grafana /chat /simulate /incidents ``` | 包 | 角色 | |---|---| | `telemetry_simulator/` | 根据 Falcon 9 的飞行剖面生成逐秒的 SpaceX 风格遥测数据；注入故障 | | `data_pipeline/` | 兼容 Redis-Streams 的队列，基于规则的 `AnomalyDetector`，摄入层，Prometheus 指标 | | `knowledge_base/` | Runbook 和真实的故障报告（CRS-7、AMOS-6），用于 RAG | | `ai_agent/` | RAG retriever，LLM 客户端（Anthropic/OpenAI/降级方案），LangGraph 风格图，memory，`SRECopilot` | | `chat_interface/` | FastAPI "Mission Control Chat" 服务 | | `monitoring/` | Prometheus 抓取配置 + 预配置的 Grafana 数据源与 dashboard | **技术栈：** Python · FastAPI · 兼容 Redis-Streams 的队列 · Prometheus + Grafana · Docker / Compose · LangGraph 风格 agent · 支持 Chroma 的 RAG · GitHub Actions。 ## 快速开始 ``` make install # venv + core deps make test # 50 tests, all offline make run # http://localhost:8000/docs ``` 或者运行完整的可观测性技术栈： ``` make docker-up # API http://localhost:8000/docs # 指标 http://localhost:8000/metrics # Prom http://localhost:9090 # Grafana http://localhost:3000 (admin/admin) → "Launch Ops AI" 仪表盘 ``` ### 尝试一下 ``` # 通过流式传输运行带有两个注入故障的 mission python -m telemetry_simulator.run --anomaly engine_out@120:40:3 --anomaly fuel_leak@60:80 # 通过 API 在 pipeline + agent 中运行 mission curl -s localhost:8000/simulate -H 'content-type: application/json' -d '{ "anomalies": [{"kind":"engine_out","start_t":50,"engine_index":1}, {"kind":"engine_out","start_t":50,"engine_index":2}, {"kind":"engine_out","start_t":50,"engine_index":3}], "explain": true }' | python -m json.tool # 询问 copilot curl -s localhost:8000/chat -H 'content-type: application/json' \ -d '{"message":"what do I do about an engine overtemp?"}' | python -m json.tool ``` ## AI agent 的工作原理 1. **检测** — 基于规则的 `AnomalyDetector` 使用飞行器的剖面限制（例如 Falcon 9 的 `engine_loss_max = 2`），将遥测数据转换为具有特定类型的 `Incident`（engine_out, overtemp, fuel_leak, sensor_dropout, pressure_drop）。 2. **检索** — 基于知识库的 TF-IDF 余弦向量库会提取最相关的 runbook + 故障报告。（可以通过 `requirements-ai.txt` 接入 Chroma，它隐藏在同一个 `Retriever` 接口背后。） 3. **推理** — 节点 `classify → retrieve → diagnose → recommend → compose` 构建了一个基于事实的解释，并提取 runbook 中的*紧急操作*。 4. **记忆** — `AgentMemory` 会保留最近的突发事件 + 对话记录，以便 copilot 能够回答“我该如何处理上一个异常？”。 ### LLM 后端 `LLMClient.auto()` 按以下顺序进行解析：**Anthropic**（如果设置了 `ANTHROPIC_API_KEY` + 安装了 `anthropic`）→ **OpenAI** → **确定性的离线降级方案**。这个降级方案非常实用（它会根据 RAG 生成有根据的诊断结果 + runbook 操作），这也是为什么整个项目能够在零密钥的情况下运行并通过测试的原因。 ``` make install-ai export ANTHROPIC_API_KEY=sk-... # now /chat uses a real LLM, no code changes ``` ## 遥测模型（Day 1 → Day 2 映射）静态的 SpaceX v4 字段成为了模拟器的基线和限制： | 真实 API 字段 | 模拟信号 | |---|---| | `first_stage.engines: 9`, `thrust_*` | 9 个单引擎温度 + 推力通道 | | `fuel_amount_tons`, `burn_time_sec` | 燃烧期间每个阶段的 `fuel_pct` 消耗情况 | | `engines.engine_loss_max: 2` | 引擎失效严重性阈值 | | `stages: 2` + 燃烧时间 | MECO / 级分离 / SECO 事件时间线 | | `cores[].landing_*` | 助推器着陆燃烧阶段 | | `propellant_1/2` (LOX / RP-1) | 两个推进剂液位 | ## API | 方法 | 路径 | 描述 | |---|---|---| | GET | `/health` | 存活状态 + 活动的 LLM 后端 + KB 大小 | | GET | `/metrics` | Prometheus 展示格式 | | POST | `/chat` | 向 SRE copilot 提问 | | POST | `/simulate` | 运行一次任务（包含可选的异常情况），检测并解释突发事件 | | GET | `/incidents` | 获取上一次 `/simulate` 运行产生的突发事件 | | GET | `/telemetry/sample` | 获取少量遥测数据点 | ## 测试 ``` make test # pytest: simulator, pipeline, RAG, agent, API, monitoring make lint # ruff ``` CI (GitHub Actions) 会在 Python 3.9/3.11/3.12 上运行 lint + 测试，并构建和冒烟测试 Docker 镜像。 ## 项目布局 ``` telemetry_simulator/ simulator, flight profile, anomalies, API explorer data_pipeline/ stream queue, anomaly detector, ingest, incidents ai_agent/ llm, rag, memory, graph, agent (SRE copilot) knowledge_base/ runbooks/ + failure_reports/ + loader chat_interface/ FastAPI app + schemas monitoring/ metrics.py, prometheus.yml, grafana/ tests/ full pytest suite .github/workflows/ CI Dockerfile, docker-compose.yml, Makefile, pyproject.toml ``` ## 路线图 - [x] Day 1 — 基础构建 + SpaceX API 探索 - [x] Day 2 — 遥测模拟器（基于剖面 + 异常注入） - [x] Day 3 — 流式 pipeline + 异常检测器 + 指标 - [x] Day 4 — 知识库 + RAG retriever - [x] Day 5 — LLM agent（RAG + LangGraph 风格图 + memory） - [x] Day 6 — FastAPI Mission Control Chat - [x] Day 7 — Prometheus + Grafana - [x] Day 8 — Docker + Compose - [x] Day 9 — CI/CD + Makefile + 文档 - [ ] 未来 — Redis Streams 后端，Chroma 向量库，Kubernetes，以及 API 恢复后的实时 API

标签：AIOps, API集成, AV绕过, FastAPI, RAG, 可观测性, 异常检测, 自定义请求头, 请求拦截, 运维监控, 逆向工具