Dafomu96/ai-incident-response
GitHub: Dafomu96/ai-incident-response
Stars: 0 | Forks: 0
# AI Incident Response System
## Table of Contents
1. [Motivation](#1-motivation)
2. [Architecture](#2-architecture)
3. [The 5 Agents](#3-the-5-agents)
4. [Technical Stack](#4-technical-stack)
5. [Architecture Decision Records (ADRs)](#5-architecture-decision-records-adrs)
6. [Evaluation Results](#6-evaluation-results)
7. [Repository Structure](#7-repository-structure)
8. [Setup](#8-setup)
9. [Usage](#9-usage)
10. [Observability](#10-observability)
11. [Roadmap](#11-roadmap)
## 1. Motivation
SRE teams spend an average of 40-60 minutes identifying the root cause of a P1 incident. During that time, a production system can generate losses of tens of thousands of euros and affect thousands of users. The manual process has three structural problems:
**Context fragmentation.** Logs are in Loki, metrics in Prometheus, recent commits in GitHub, and pod status in Kubernetes. The on-call SRE must correlate these sources manually under pressure.
**Non-persistent knowledge.** Runbooks and historical postmortems exist but are not consulted systematically. Each incident is resolved from scratch without leveraging accumulated knowledge.
**High-risk decisions without full context.** The engineer approving a rollback at 3am does not always have access to the complete diagnosis that led to that recommendation.
This system addresses all three: it collects context in parallel, automatically queries the historical knowledge base, and presents high-risk decisions with all the context needed for an informed approval.
## 2. Architecture
### Full flow
Prometheus/PagerDuty Alert
|
v
+-----------------+
| Agent 1 | Groq Llama 3.3 70B -- P1/P2/P3 classification
| Monitor & | Target latency: <500ms
| Triage |
+--------+--------+
| [P1/P2: escalate] ----------- [P3 trivial: auto-resolve]
v
+-----------------+
| Agent 2 | asyncio.gather -- parallel collection
| Data | Loki (logs) + Prometheus (metrics) +
| Collector | GitHub API (commits) + K8s API (pods)
+--------+--------+
|
v
+-----------------+
| Agent 3 | Groq Llama 3.3 70B (dev) / Claude Sonnet (prod)
| Diagnostic | RAG over runbooks + historical postmortems
| Reasoner -----+--[low confidence]--> Agent 2 (retry, wider window)
| (Core) | Chain-of-thought + Pydantic structured output
+--------+--------+
| [confidence >= threshold]
v
+-----------------+
| Agent 4 | Classifies actions by risk (LOW/HIGH)
| Remediation | LOW: auto-executable
| Planner | HIGH: generates HITLRequest for Slack approval
+--------+--------+
|
+----+----+
| |
v v
[HIGH] [LOW / auto]
HITL Auto
Slack --> execution
approval |
+--------+
|
v
+-----------------+
| Agent 5 | Generates structured postmortem
| Postmortem | Ingests into ChromaDB <-- Learning loop
| Writer |
+-----------------+
### LangGraph graph properties
**Persistent state.** `IncidentState` (typed TypedDict) is persisted at each node via checkpointing. If the system crashes mid-incident, it resumes exactly where it left off.
**Native cyclic loop.** If the Diagnostic Reasoner determines it needs more data (`requires_more_data=True`), the graph automatically returns to the Data Collector with an expanded time window. Maximum 2 retries to prevent infinite loops.
**Conditional edges.** Three explicit decision points in the graph: escalate or auto-resolve (after triage), re-diagnose or plan (after diagnosis), HITL or execute (after planning).
**Error recovery.** Dedicated `error` node that captures exceptions, logs them to LangSmith, and prevents the graph from entering an inconsistent state.
## 3. The 5 Agents
### Agent 1 -- Monitor & Triage (`agents/monitor_triage.py`)
**Role:** System entry point. First contact with the alert.
**Model:** Groq Llama 3.3 70B -- chosen for latency (<500ms), not reasoning capability.
**Responsibilities:** receives `IncidentAlert` from Prometheus or PagerDuty, classifies severity P1/P2/P3, extracts affected components and time window, decides whether to escalate to the full graph.
**Output:** `IncidentReport` -- severity, affected components, escalation flag, classification reasoning.
### Agent 2 -- Data Collector (`agents/data_collector.py`)
**Role:** Investigator. Collects full context in parallel.
**Model:** No LLM -- direct tool-calling to APIs only.
**Responsibilities:** Loki logs (last N hours), Prometheus metrics (error rate, p99 latency, CPU, memory), GitHub commits and PRs, Kubernetes pod status.
**Parallelism:** `asyncio.gather` -- all 4 sources queried simultaneously. All tools have automatic mock fallback when URL/token is not configured.
### Agent 3 -- Diagnostic Reasoner (`agents/diagnostic_reasoner.py`)
**Role:** Core of the project. The most complex agent in the system.
**Model:** Groq Llama 3.3 70B (dev) / Claude Sonnet (prod).
**Responsibilities:** reasons over the full collected context, queries the RAG knowledge base of runbooks and historical postmortems, generates root cause hypotheses ordered by probability with evidence, sets `requires_more_data=True` if confidence is low.
**Output:** `DiagnosisResult` with hypotheses, `overall_confidence`, full `reasoning_chain`.
### Agent 4 -- Remediation Planner (`agents/remediation_planner.py`)
**Role:** Translates the diagnosis into concrete actions.
**Model:** Groq Llama 3.3 70B (dev) / Claude Sonnet (prod).
**Permission matrix:**
- **LOW (auto-executable):** restart pod, clear cache, reload config, scale replicas
- **HIGH (requires approval):** rollback deployment, delete data, modify firewall
**Output:** `RemediationPlan` with classified actions + `HITLRequest` for high-risk actions.
### Agent 5 -- Postmortem Writer (`agents/postmortem_writer.py`)
**Role:** Closes the loop. Generates the postmortem and feeds back into the system.
**Model:** Groq Llama 3.3 70B (dev) / Claude Sonnet (prod).
**Key differentiator:** ingests the generated postmortem into ChromaDB so future diagnoses benefit from accumulated knowledge. Closed learning loop.
**Output:** `PostmortemDraft` + automatic ingestion into the RAG knowledge base.
## 4. Technical Stack
### Agent orchestration
| Component | Technology | Decision |
|---|---|---|
| Agent framework | LangGraph | Cyclic state, checkpointing, conditional edges |
| Graph state | Typed TypedDict | Static typing, mypy compatible |
| Checkpointing dev | MemorySaver | In-memory, no dependencies |
| Checkpointing prod | SqliteSaver | Persistence across restarts |
### LLM Models
| Agent | Model (dev) | Model (prod) | Criterion |
|---|---|---|---|
| Agent 1 (Triage) | Groq Llama 3.3 70B | Groq Llama 3.3 70B | Latency <500ms |
| Agents 3/4/5 | Groq Llama 3.3 70B | Claude Sonnet | Deep reasoning |
| RAG contextualization | Groq Llama 3.3 70B | Claude Haiku | Minimum cost per chunk |
| Global fallback | GPT-4o | GPT-4o | Resilience |
### RAG -- Knowledge Base
| Component | Technology | Decision |
|---|---|---|
| Vector store | ChromaDB | Local persistence, easy setup |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) | Open source, no per-call cost |
| Chunking | RecursiveCharacterTextSplitter | 512 tokens, 50 overlap |
| Contextualization | Contextual Retrieval (Anthropic) | +50-100 context tokens per chunk |
| Reranking | Cohere Rerank v3 | Improved retrieval precision (optional) |
### Schemas and validation
All implemented with **Pydantic v2** -- strict validation, native JSON Schema, retry logic on parsing errors: `IncidentAlert`, `IncidentReport`, `DiagnosisResult`, `RemediationPlan`, `HITLRequest`, `PostmortemDraft`.
### Infrastructure
| Component | Technology |
|---|---|
| API backend | FastAPI + WebSockets |
| Containerization | Docker + docker-compose |
| CI/CD | GitHub Actions |
| Observability | LangSmith |
| HITL | Slack bot with interactive buttons |
## 5. Architecture Decision Records (ADRs)
### ADR-001 -- LangGraph over crewAI
**Context:** LangGraph, crewAI and AutoGen were evaluated as orchestration frameworks.
**Decision:** LangGraph.
**Trade-off:** More verbose code and steeper learning curve than crewAI. Acceptable in exchange for granular flow control.
### ADR-002 -- Three models with task-specific criteria
**Context:** A single model could be used for all agents.
**Decision:** Different models per task type.
**Reasons:** Agent 1 needs latency below 500ms -- Groq with Llama 3.3 70B responds in ~300ms. Agents 3, 4 and 5 need deep reasoning and reliable structured outputs -- Claude Sonnet in production. Claude Haiku for RAG chunk contextualization where hundreds of small calls are generated. GPT-4o as global fallback.
**Trade-off:** Higher operational complexity (multiple API keys). Offset by cost-latency-quality optimization per task.
### ADR-003 -- Contextual Retrieval over classic RAG
**Context:** Classic RAG (chunk -> embedding -> retrieval) was initially implemented.
**Decision:** Contextual Retrieval (Anthropic, September 2024).
**Reasons:** Operational runbooks lose meaning when split into 512-token chunks. Contextual Retrieval adds 50-100 LLM-generated context tokens to each chunk before embedding. According to Anthropic benchmarks, this reduces retrieval errors by up to 67% vs classic RAG.
**Trade-off:** Additional ingestion cost (one-time per document, not per retrieval). Offset by improved diagnostic precision.
### ADR-004 -- HITL by action risk, not incident severity
**Context:** Whether to implement HITL for all incidents or only some.
**Decision:** Permission matrix by action type, independent of incident severity.
**Reasons:** Severity describes incident impact. Action risk describes remediation impact. They are orthogonal dimensions. A reversible action (restart pod) is safe to auto-execute even in a P1. A destructive action (rollback deployment) requires human approval even in a P2.
**Trade-off:** Agent 4's LOW/HIGH classification may be wrong in edge cases. The immutable audit log of all decisions allows identifying and correcting these cases.
### ADR-005 -- Custom evaluation over RAGAS
**Context:** RAGAS is the standard framework for RAG system evaluation.
**Decision:** Direct evaluation against ground truth with LangSmith and custom metrics.
**Reasons:** RAGAS has dependency conflicts with LangGraph 0.2+ in Python 3.11 and adds LLM cost per evaluation. Direct evaluation against a dataset of 8 historical incidents with real root causes is more relevant than abstract faithfulness metrics.
## 6. Evaluation Results
Evaluation over **8 historical incidents** with real ground truth (root cause, severity, correct actions). Model: Groq Llama 3.3 70B (development).
| Metric | Result | Production target |
|---|---|---|
| Severity accuracy | 62% | >85% |
| Top-1 diagnostic accuracy | 38% | >70% |
| Top-3 diagnostic accuracy | 62% | >90% |
| Avg keyword score | 23% | >50% |
| Avg confidence | 84% | -- |
| HITL trigger rate | 100% | -- |
| Postmortem rate | 100% | -- |
| Avg diagnosis attempts | 1.0 | -- |
| Time-to-diagnose (avg) | ~7s | <30s |
**Analysis:** The system correctly diagnoses incidents with clear signals in logs and commits (DB connection pool, N+1 queries, Elasticsearch). It struggles with infrastructure incidents without code signals (expired SSL, full disk). The dev/prod gap closes with Claude Sonnet, which has better reasoning over ambiguous signals.
## 7. Repository Structure
ai-incident-response/
|-- agents/ # The 5 LangGraph agents
| |-- monitor_triage.py # Agent 1: P1/P2/P3 classification with Groq
| |-- data_collector.py # Agent 2: parallel collection with asyncio
| |-- diagnostic_reasoner.py # Agent 3: RAG + chain-of-thought (core)
| |-- remediation_planner.py # Agent 4: remediation plan + HITL trigger
| `-- postmortem_writer.py # Agent 5: postmortem + RAG ingestion
|
|-- graph/ # LangGraph orchestration
| |-- state.py # IncidentState -- typed TypedDict
| |-- workflow.py # StateGraph + conditional edges
| `-- checkpointer.py # MemorySaver (dev) / SqliteSaver (prod)
|
|-- tools/ # Tool-calling to external APIs
| |-- prometheus.py # Historical metrics (service-specific mock)
| |-- loki.py # Logs for last N hours (service-specific mock)
| |-- github_api.py # Recent commits and PRs (service-specific mock)
| |-- kubernetes_api.py # Pod status (with mock)
| `-- slack_hitl.py # Slack bot HITL with Approve/Reject buttons
|
|-- rag/ # RAG pipeline with Contextual Retrieval
| |-- ingestion.py # Chunking + contextualization + embedding
| |-- retriever.py # Dense search + optional Cohere reranker
| |-- chroma_store.py # ChromaDB singleton
| `-- seed_runbooks.py # 5 example runbooks for initial ingestion
|
|-- schemas/ # Pydantic v2 -- typed structured outputs
| |-- incident.py # IncidentAlert + IncidentReport
| |-- diagnosis.py # DiagnosisResult + RootCauseHypothesis
| |-- remediation.py # RemediationPlan + HITLRequest
| `-- postmortem.py # PostmortemDraft + to_rag_document()
|
|-- evals/ # Evaluation with LangSmith + ground truth
| |-- datasets/
| | `-- historical_incidents.json # 8 incidents with real root causes
| |-- run_evals.py # Evaluation runner
| |-- metrics.py # Top-1/3 accuracy, keyword score, severity accuracy
| `-- results_latest.json # Latest evaluation results
|
|-- api/ # FastAPI backend
| `-- main.py # REST + WebSocket + /slack/actions callback
|
|-- frontend/ # React + Vite dashboard
| `-- src/App.jsx # Live pipeline log, HITL queue, eval metrics
|
|-- infra/ # Infrastructure
| |-- Dockerfile
| `-- docker-compose.yml
|
|-- .github/workflows/
| `-- ci.yml # Tests + lint + Docker build on each push
|
|-- docs/ # Architecture Decision Records
| |-- ADR-001-langgraph.md
| |-- ADR-002-models.md
| |-- ADR-003-contextual-retrieval.md
| |-- ADR-004-hitl.md
| `-- ADR-005-evaluation.md
|
|-- tests/ # 71 tests -- schemas, routing, tools, HITL, API, integration
| |-- test_schemas.py
| |-- test_state.py
| |-- test_tools_mock.py
| |-- test_hitl.py
| `-- test_data_collector.py
|
|-- demo.py # Single-command demo script
|-- run_incident.py # End-to-end test script
|-- pyproject.toml
|-- .env.example
`-- README.md
## 8. Setup
### Requirements
- Python 3.11+
- Docker Desktop (for containerized execution)
- Node.js 18+ (for frontend dashboard)
### Installation
# 1. Clone the repository
git clone https://github.com/Dafomu96/ai-incident-response.git
cd ai-incident-response
# 2. Virtual environment and dependencies
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pip install langchain-text-splitters python-multipart
# 3. Environment variables
cp .env.example .env
# Edit .env -- minimum: GROQ_API_KEY
# 4. Seed runbooks into ChromaDB
python -m rag.seed_runbooks
# 5. Run the demo
python demo.py
### Minimum environment variables for development
GROQ_API_KEY=gsk_... # Required -- all agents
LANGSMITH_API_KEY=lsv2_pt_... # Recommended -- observability
LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT=https://eu.api.smith.langchain.com
LANGCHAIN_PROJECT=ai-incident-response
SLACK_BOT_TOKEN=xoxb-... # Optional -- real HITL
SLACK_HITL_CHANNEL=#incident-approvals
### Docker
docker-compose -f infra/docker-compose.yml up --build
# API available at http://localhost:8000
### Frontend dashboard
cd frontend
npm install
npm run dev
# Dashboard at http://localhost:3000
## 9. Usage
### Single-command demo
python demo.py
Runs the full pipeline with a sample P2 incident, shows all 5 agents executing, sends HITL to Slack if applicable, and displays results with colored output.
### End-to-end test
from schemas.incident import IncidentAlert
from graph.workflow import compile_graph
alert = IncidentAlert(
alert_id="inc-001",
service="payment-service",
metric="http_request_duration_seconds_p99",
value=2.34,
threshold=0.5,
description="P99 latency spike -- possible DB connection pool exhaustion",
labels={"env": "production", "region": "eu-west-1"},
)
graph = compile_graph()
result = graph.invoke(
{"alert": alert, "diagnosis_attempts": 0, "resolved": False, "messages": []},
config={"configurable": {"thread_id": alert.alert_id}},
)
print(result["incident_report"].severity) # P2
print(result["diagnosis"].top_hypothesis.hypothesis) # root cause
print(result["diagnosis"].overall_confidence) # 0.90
print(result["remediation_plan"].requires_approval) # HIGH risk actions
### REST API
# Trigger incident
curl -X POST http://localhost:8000/incident \
-H "Content-Type: application/json" \
-d '{"alert_id": "inc-001", "service": "payment-service",
"metric": "error_rate", "value": 0.45, "threshold": 0.05,
"description": "Critical error rate spike"}'
# Get incident status
curl http://localhost:8000/incident/inc-001
# Health check
curl http://localhost:8000/health
### Evaluation
# Evaluate single incident
python -m evals.run_evals --incident hist-001
# Evaluate all 8 incidents
python -m evals.run_evals
# Results saved to evals/results_latest.json
### Tests
pytest tests/ -v # 71 tests
pytest tests/ --cov=. # with coverage
## 10. Observability
### LangSmith
Every graph execution generates a full trace with input/output per node, token usage per agent, latency, and errors. Set `LANGCHAIN_TRACING_V2=true` and `LANGSMITH_API_KEY` to enable.
Typical trace metrics:
- Total: ~5.5s, ~5.1K tokens
- monitor_triage: 0.82s, 643 tokens
- diagnostic_reasoner: 1.77s, 2.3K tokens
- remediation_planner: 1.02s, 879 tokens
- postmortem_writer: 1.51s, 1.3K tokens
### HITL -- Slack bot
When Agent 4 generates a HIGH risk action, the bot sends to `#incident-approvals`:
- Action description and risk level
- Diagnosis summary with confidence score
- Exact command to execute
- **Approve** / **Reject** buttons
- Auto-escalation after 10 minutes with no response
## 11. Roadmap
### Week 1 -- Core skeleton [DONE]
- [x] Repository structure and Pydantic v2 schemas
- [x] `IncidentState` TypedDict and `StateGraph` with conditional edges
- [x] All 5 agents with main logic
- [x] Service-specific mock tools
- [x] RAG pipeline (Contextual Retrieval)
- [x] 71 tests passing
### Week 2 -- Real integrations [DONE]
- [x] ChromaDB with 5 seeded runbooks
- [x] End-to-end test with real Groq
- [x] HITL with real Slack bot -- Approve/Reject buttons working
- [x] Learning loop: postmortem -> ChromaDB verified
- [x] FastAPI with REST and WebSocket endpoints
### Week 3 -- Evaluation + CI/CD + Docker [DONE]
- [x] LangSmith integrated -- full traces per execution
- [x] Dataset of 8 historical incidents with ground truth
- [x] Evaluation runner with custom metrics
- [x] GitHub Actions CI/CD
- [x] Docker working end-to-end
### Week 4 -- Frontend + Documentation [DONE]
- [x] React + Vite dashboard with live pipeline log
- [x] HITL queue with Approve/Reject from dashboard
- [x] Evaluation metrics tab
- [x] 5 complete ADRs in `/docs/`
- [x] Railway deployment configured
## Project Summary
**For recruiters:** AI-powered incident response system that automates the diagnosis and remediation of production infrastructure outages. 5 specialized AI agents work together to detect, investigate, diagnose, remediate, and document incidents automatically. High-risk actions require human approval via a Slack bot before execution. Reduces time-to-diagnose from 40-60 minutes to under 10 seconds. Evaluated on 8 real historical incidents, 71 automated tests, deployed with Docker.
**For technical interviewers:** 5 LangGraph agents with cyclic state and conditional edges -- the Diagnostic Reasoner loops back to the Data Collector when confidence is low. RAG over internal runbooks with Contextual Retrieval (67% fewer retrieval errors vs classic RAG). Pydantic v2 structured outputs, LangSmith tracing, Slack HITL with Approve/Reject buttons. Top-3 diagnostic accuracy of 62% with Groq in development; architecture prepared for Claude Sonnet in production.
## Author
**David Font Munoz** -- AI/ML Engineer
[GitHub](https://github.com/Dafomu96) · [GitLab](https://gitlab.com/Dafomu96) · [LinkedIn](https://www.linkedin.com/in/davidfontmunoz/)
*Weeks 1-4 completed.*