YanpengQi7/ai-reliability-copilot
GitHub: YanpengQi7/ai-reliability-copilot
Stars: 1 | Forks: 0
# AI Reliability Copilot
But the real story isn't the prompt. It's the **eval pipeline** — a 5-dimension rubric, a 5-scenario regression suite, and an LLM-as-judge that scores every change, so prompt iteration is measured instead of vibes-based.
**Live demo:** [ai-reliability-copilot.vercel.app](https://ai-reliability-copilot.vercel.app)
**📖 Usage guide (中文):** [USAGE.md](./USAGE.md) — how to actually use it, end-to-end
**Methodology deep-dive:** [EVALUATION.md](./EVALUATION.md)
**30-day build log:** [`notes/`](./notes/)
## Architecture
┌───────────────────┐ ┌───────────────────┐ ┌──────────────────┐
│ Browser (RSC) │◀────▶│ Next.js 16 App │◀────▶│ DeepSeek (AI SDK)│
│ experimental_ │ │ Router on Vercel │ │ generate/stream │
│ useObject hook │ │ (Fluid Compute) │ │ Object │
└───────────────────┘ └─────────┬─────────┘ └──────────────────┘
│
▼
┌───────────────────┐
│ Supabase (PG) │
│ incidents / │
│ analyses / │
│ scenarios / │
│ evaluations │
└───────────────────┘
- **Next.js 16** App Router, RSC for read-heavy pages (incident list/detail, evals dashboard); client components only where needed (streaming form, copy buttons)
- **AI SDK** with `streamObject` + Zod schema for guaranteed structured output
- **DeepSeek** for both analyzer and judge (provider-swappable in one file)
- **Supabase Postgres** for persistence; service-role client server-side only
- **Vercel** auto-deploys on push to `main`
## The 9-section output schema
Enforced by Zod ([`src/lib/schema.ts`](./src/lib/schema.ts)):
1. **Summary** + severity badge with quantitative reasoning
2. **Severity** (SEV1/2/3) — apply rubric strictly
3. **Root cause hypotheses** (3–5, ranked by likelihood with cited evidence)
4. **Investigation checklist** (copy-pasteable commands, expected outputs)
5. **Mitigation plan** (with risk + mandatory rollback per step)
6. **Customer impact** (externally-facing)
7. **Postmortem draft** (markdown, all H2 sections in order)
8. **Follow-ups** (P0–P2, tied to owner roles)
9. **Severity reasoning** (citing the rubric rule applied)
## Prompt engineering, measured
Every prompt iteration is tracked against the same 5-scenario regression suite × 2 output languages (en/zh), scored by an LLM judge against a 5-dimension rubric.
| Dimension | What it measures |
|---|---|
| Specificity | Are commands/metrics/services concrete? |
| Safety | Is every mitigation reversible? Are destructive ops gated? |
| Actionability | Can on-call execute in <5 min without further research? |
| Domain correctness | Right SRE mechanism? No invented evidence? |
| Completeness | All 9 sections substantively filled? |
### Latest results (n=18, deepseek-chat for both analyzer and judge)
| | overall (1–5) |
|---|---|
| **Prompt v1** (rules-only) | **4.64** |
| **Prompt v2** (rules + anchors + few-shot) | **4.44** |
| **English output** | **4.60** |
| **Chinese output** | **4.47** |
**Surprising finding:** v2 — which I wrote specifically to fix v1's known failure modes (vague commands, missing rollbacks, severity under-rating) — scored *worse* on average. Likely cause: the additional constraints (mandatory rollback fields, required postmortem H2 list, etc.) over-narrow the model and it produces shorter, more checklist-y responses that the judge marks down on completeness. This is exactly the kind of regression you can only catch with a measured rubric — eyeballing v2 output it "looks more disciplined," but the judge disagrees. Investigating in v3.
**Cross-lingual finding:** Chinese output scored ~0.13 lower on average. Per-dim breakdown points the loss to `actionability` (Chinese explanations are slightly more verbose, pushing commands into walls of prose). Codes/commands themselves were correctly kept in English (the prompt's `languageInstruction()` works).
See [EVALUATION.md](./EVALUATION.md) for the full methodology, including limitations and roadmap.
## Scenario library
5 curated SRE scenarios cover the most common production failure modes:
| Scenario | Category |
|---|---|
| Payment-svc connection pool exhausted | Database |
| Order-svc OOM crashloop after deploy | Deploy |
| Stripe API timeout cascading into checkout outage | Dependency |
| Regional 5xx after DNS misconfiguration | Network |
| Black Friday cache stampede | Capacity |
Each has enough context (metrics, logs, deploy history, on-call notes) to differentiate prompt versions. Browse them at `/scenarios`.
## Use it from your own Claude Code (MCP server mode)
This project ships as **both a web app and an MCP server**. Power users add the MCP endpoint to their local Claude Code and drive analysis with their own Claude subscription — the platform pays $0 in LLM costs, the user gets Claude Opus quality.
claude mcp add --transport http ai-reliability https://ai-reliability-copilot.vercel.app/api/mcp
7 tools exposed: `search_kb`, `find_similar_incidents`, `list_scenarios`, `get_scenario`, `parse_alert_json`, `get_output_schema`, `save_incident_analysis`. See [USAGE.md](./USAGE.md) workflow D-bis for the full pattern.
## Knowledge base (internal RAG)
Make the AI understand **your company**: drop your runbooks, postmortems, and service catalog into `sample-kb/` (or any directory), then `npm run kb:ingest`. Every subsequent analysis automatically retrieves the top-5 most relevant chunks and injects them into the prompt as `# Internal context`, so the LLM grounds its answer in *your* systems instead of generic SRE advice.
- **Storage:** `kb_documents` (one row per file, dedupe by content hash) + `kb_chunks` (paragraph-aware chunks ~1500 chars with 150-char overlap)
- **Embeddings:** OpenAI `text-embedding-3-small` (1536-dim) when `OPENAI_API_KEY` is set; **falls back to pg_trgm** otherwise
- **Audit trail:** `analysis_kb_chunks` records which chunks fed which analysis with their similarity scores. Detail page shows "📚 Internal docs used by the AI" with bracket-numbered citations matching what was in the prompt.
- **CLI:** `npm run kb:ingest -- ./docs/runbooks` (idempotent via SHA256 content hash, skip-if-unchanged)
- **Sample docs** in `sample-kb/` show what's expected — replace with yours.
The signature for similarity retrieval is the chunk text itself. Service-catalog snippets, runbook playbook steps, and past postmortems all index correctly.
## Similar-incident search
Every incident gets a **signature** (concatenation of title + service + symptoms + summary + severity) and, when `OPENAI_API_KEY` is configured, a **1536-dim embedding**. The detail page shows up to 5 past incidents ranked by similarity.
Two backends, chosen at runtime:
- **`pgvector` + HNSW + cosine distance** — semantic match (preferred). Embeddings from `text-embedding-3-small` ($0.02/M tokens). Returns matches above `1 - cosine_distance > 0.4`.
- **`pg_trgm`** — lexical fallback when no embedding provider is configured. Returns matches with trigram similarity > 0.15.
The choice is automatic and shown in the UI (`semantic match (pgvector)` vs `lexical match (pg_trgm)`). Migration to OpenAI later is one env var away; existing rows backfill via `npm run backfill:similar`.
The signature deliberately excludes `raw_context` — logs and timestamps dominate that field and produce noisy matches.
## Run locally
git clone https://github.com/YanpengQi7/ai-reliability-copilot
cd ai-reliability-copilot
npm install
# env: create .env.local with
# DEEPSEEK_API_KEY=
# NEXT_PUBLIC_SUPABASE_URL=
# NEXT_PUBLIC_SUPABASE_ANON_KEY=
# SUPABASE_SERVICE_ROLE_KEY=
# DB: in Supabase SQL editor, run supabase/schema.sql
# seed the scenario library
npm run seed:scenarios
# dev
npm run dev # → http://localhost:3000
# run the eval batch (writes to your Supabase)
npm run evals:run
## Known limitations
- **In-memory rate limiter** (`src/lib/rateLimit.ts`) — resets on cold start. Production swap: Upstash Redis.
- **Judge ≠ ground truth** — same model family judges the analyzer. ~10–20% optimistic bias likely. Mitigation: periodic human review (see EVALUATION.md).
- **No per-scenario repeats** — single-shot evaluation. Doesn't capture run-to-run variance.
- **5 scenarios is narrow** — real production has long tails.
## Tech stack
- Next.js 16 (App Router, RSC, Turbopack), TypeScript, Tailwind v4
- Vercel AI SDK 6 (`streamObject`, `generateObject`, `experimental_useObject`)
- DeepSeek API (analyzer + judge)
- Supabase (Postgres, service-role client on server)
- Zod (schemas everywhere — DB inserts, LLM outputs, API inputs)
- react-markdown + `@tailwindcss/typography` for postmortem rendering
## License
MIT
Built in 30 days as a side project to learn AI engineering and evaluation methodology. The full daily build log lives in [`notes/`](./notes/).
标签:自动化攻击