BasitS-hash/incident-response-agent
GitHub: BasitS-hash/incident-response-agent
Stars: 0 | Forks: 0
# Incident Response Agent
An AI-powered incident response system that automates triage, root cause analysis, and on-call notification. Built with LangGraph, MCP, Mem0, Langfuse, and AG-UI.
## What It Does
When an incident fires, an SRE typically has to manually:
- Pull logs and metrics
- Figure out what changed (deployments, config)
- Write up a root cause analysis
- Notify the on-call team
This agent does all of that automatically — and pauses for human approval before sending any notifications.
**Alert fires → Agent investigates → Human approves → Team notified**
## The 5-Agent Workflow
Intake → Triage → RCA → Approval (HITL pause) → Notify
| Agent | What it does |
|-------|-------------|
| **Intake** | Fetches incident details from Jira (title, description, reporter, priority) |
| **Triage** | Assigns severity (P1–P4), identifies affected systems, queries Mem0 for similar past incidents |
| **RCA** | Queries logs, metrics, and deployment history — LLM identifies root cause and recommends a fix |
| **Approval** | Human-in-the-loop interrupt — workflow pauses until an authorized approver accepts or rejects |
| **Notify** | Sends email notification with full RCA summary to the on-call team |
## Tech Stack
| Layer | Technology |
|-------|-----------|
| **Orchestration** | [LangGraph](https://langchain-ai.github.io/langgraph/) — stateful multi-agent workflow with MemorySaver checkpointing for HITL persistence |
| **LLM** | Google Gemini 2.5 Flash via `langchain-google-genai` (swap to OpenAI via `LLM_PROVIDER=openai`) |
| **Tools** | [MCP](https://modelcontextprotocol.io/) — modular tool server (Jira, logs, metrics, email) |
| **Memory** | [Mem0](https://mem0.ai/) — remembers past incidents and resolutions across runs |
| **Observability** | [Langfuse](https://langfuse.com/) — full LLM tracing, token usage, latency per agent |
| **Streaming** | [AG-UI](https://github.com/ag-ui-protocol/ag-ui) — SSE protocol for real-time frontend updates |
| **Backend** | FastAPI + uvicorn |
| **Frontend** | React 19 + TypeScript + Vite |
| **Audit log** | SQLite — persists every run with agent outputs, approval decision, and timestamps |
| **Rate limiting** | slowapi — 10 req/min on POST /incident, 20 req/min on POST /approve |
## Demo Incidents
| ID | Service | Issue | Severity |
|----|---------|-------|----------|
| `INC-101` | Auth service | Redis connection pool reduced by deployment → 97% cache miss rate → OOMKill | P1 |
| `INC-205` | Payment service | ORM migration didn't port connection pool config → PostgreSQL connection exhaustion | P2 |
| `INC-312` | Notification service | Marketing deploy accidentally applied unsubscribe logic to transactional emails → AWS SES suspended | P2 |
## Project Structure
incident-response-agent/
├── backend/
│ ├── agents/
│ │ ├── intake_agent.py # Fetches incident from Jira
│ │ ├── triage_agent.py # Assigns severity and affected systems
│ │ ├── rca_agent.py # Root cause analysis via logs + LLM
│ │ ├── notify_agent.py # Sends email notification
│ │ └── llm_factory.py # Gemini / OpenAI provider switch
│ ├── api/
│ │ └── main.py # FastAPI routes + SSE streaming + rate limiting
│ ├── audit/
│ │ └── log.py # SQLite audit log — persists every run
│ ├── data/ # Runtime SQLite DB (gitignored)
│ ├── graph/
│ │ ├── workflow.py # LangGraph graph definition
│ │ ├── nodes.py # Node wrappers + routing logic
│ │ └── state.py # IncidentState schema
│ ├── mcp_server/
│ │ └── tools/
│ │ ├── jira_tools.py # Jira integration (mock → real via API token)
│ │ ├── log_tools.py # Log/metrics/deployment data (mock → Splunk/Loki)
│ │ └── email_tools.py # Email sending (mock → SMTP/SendGrid)
│ ├── memory/
│ │ └── mem0_client.py # Mem0 for cross-run incident memory
│ ├── observability/
│ │ └── langfuse_client.py # Langfuse tracing
│ ├── config.py # Env var loading
│ └── requirements.txt
├── frontend/
│ └── src/
│ ├── components/
│ │ ├── ChatUI.tsx # Real-time chat event log
│ │ ├── WorkflowStepper.tsx # Visual 5-step progress indicator
│ │ ├── IncidentDetails.tsx # Live state panel
│ │ ├── HITLApprovalModal.tsx # Human review modal
│ │ └── RunHistory.tsx # Audit log table with search/filter
│ ├── hooks/
│ │ └── useAgentStream.ts # SSE event consumer
│ └── api/
│ └── client.ts # Axios API client
├── tests/
│ ├── test_audit_log.py # SQLite audit log — 19 tests
│ ├── test_agents.py # Agent helper functions — 18 tests
│ └── test_api.py # Input validators + safe-state — 20 tests
├── docker-compose.yml # Local Langfuse + Postgres observability stack
├── start.sh # macOS one-click startup
├── start.ps1 # Windows one-click startup
└── .env # Secrets — never committed
## Running Locally
### Prerequisites
- Python 3.11+
- Node.js 18+
- A `.env` file in the project root (see below)
### Quick start (macOS)
./start.sh
Opens both servers in separate Terminal windows automatically.
### Quick start (Windows)
.\start.ps1
### Manual start
**Backend** (run from project root):
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1
pip install -r backend/requirements.txt
uvicorn backend.api.main:api --reload --port 8000
**Frontend:**
cd frontend
npm install
npm run dev
Open **http://localhost:5173** (or 5174 if 5173 is in use).
### Running Tests
source .venv/bin/activate
python -m pytest tests/ -v
57 tests, no external dependencies required.
## Environment Variables (`.env` in project root)
# LLM — required
GEMINI_API_KEY=your_gemini_key
# LLM provider — "gemini" (default) or "openai"
LLM_PROVIDER=gemini
OPENAI_API_KEY= # required if LLM_PROVIDER=openai
# API auth — leave blank to run in dev mode (no key required)
API_KEY=
# Jira (optional — mocked if not set)
JIRA_URL=https://yourorg.atlassian.net
JIRA_EMAIL=you@yourorg.com
JIRA_TOKEN=your_jira_api_token
# Email / SMTP (optional — mocked if not set)
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=you@gmail.com
SMTP_PASSWORD=your_app_password
# Observability (optional — graceful fallback if not set)
MEM0_API_KEY=your_mem0_key
LANGFUSE_PUBLIC_KEY=your_public_key
LANGFUSE_SECRET_KEY=your_secret_key
LANGFUSE_HOST=http://localhost:3001 # local Docker instance
## API Endpoints
| Method | Path | Auth | Description |
|--------|------|------|-------------|
| `POST` | `/incident` | API key | Start a new incident workflow |
| `GET` | `/stream/{run_id}` | — | SSE stream of AG-UI events |
| `POST` | `/approve/{run_id}` | API key | Submit approval / rejection |
| `GET` | `/runs` | API key | List all past runs (audit log) |
| `GET` | `/runs/{run_id}` | API key | Full detail for a single run |
| `GET` | `/incidents/search` | API key | Semantic search over Mem0 memory |
| `GET` | `/health` | — | Health check |
## Swapping Mocks for Real Integrations
Everything is built with a clear swap point — the agent logic doesn't change, only the data source.
| Mock | Real integration | What to change |
|------|-----------------|----------------|
| Jira mock | Jira REST API | Set `JIRA_URL`, `JIRA_EMAIL`, `JIRA_TOKEN` in `.env` |
| Log mock | Splunk / Loki / Datadog | Replace `query_system_logs()` in `log_tools.py` |
| Email mock | Gmail / SendGrid / SES | Set `SMTP_*` vars in `.env` |
| SQLite checkpointer | PostgreSQL | Swap `MemorySaver` for `PostgresSaver` in `workflow.py` |
## Security
- **Auth** — POST endpoints and audit log protected by `X-API-Key` header; blank `API_KEY` enables dev mode
- **Rate limiting** — 10 req/min on `POST /incident`, 20 req/min on `POST /approve`
- **Input validation** — incident ID format enforced (`INC-NNNN`), approver name and notes sanitized against prompt injection (control chars stripped, max length enforced)
- **State allowlist** — internal fields (`notification_recipients`, `messages`, `similar_incidents`) are never sent to the browser
- **Memory injection cap** — Mem0 context truncated to 500 chars per entry to prevent RAG data poisoning
- **Security headers** — `X-Content-Type-Options`, `X-Frame-Options`, `Referrer-Policy` on every response
- **Langfuse TLS** — warns if `LANGFUSE_HOST` points to a remote host over plain HTTP
## Roadmap
- [ ] PagerDuty / Prometheus webhook trigger (zero-touch incident creation)
- [ ] Slack / Teams notification channel
- [ ] SSO on the approval modal (restrict to authorized SREs)
- [ ] Reject flow re-runs RCA with approver feedback
- [ ] Real Jira, email, and log integrations (credentials pending)
## Author
Basit Sherazi — DMI LLC