BasitS-hash/incident-response-agent

GitHub: BasitS-hash/incident-response-agent

Stars: 0 | Forks: 0

# Incident Response Agent An AI-powered incident response system that automates triage, root cause analysis, and on-call notification. Built with LangGraph, MCP, Mem0, Langfuse, and AG-UI. ## What It Does When an incident fires, an SRE typically has to manually: - Pull logs and metrics - Figure out what changed (deployments, config) - Write up a root cause analysis - Notify the on-call team This agent does all of that automatically — and pauses for human approval before sending any notifications. **Alert fires → Agent investigates → Human approves → Team notified** ## The 5-Agent Workflow Intake → Triage → RCA → Approval (HITL pause) → Notify | Agent | What it does | |-------|-------------| | **Intake** | Fetches incident details from Jira (title, description, reporter, priority) | | **Triage** | Assigns severity (P1–P4), identifies affected systems, queries Mem0 for similar past incidents | | **RCA** | Queries logs, metrics, and deployment history — LLM identifies root cause and recommends a fix | | **Approval** | Human-in-the-loop interrupt — workflow pauses until an authorized approver accepts or rejects | | **Notify** | Sends email notification with full RCA summary to the on-call team | ## Tech Stack | Layer | Technology | |-------|-----------| | **Orchestration** | [LangGraph](https://langchain-ai.github.io/langgraph/) — stateful multi-agent workflow with MemorySaver checkpointing for HITL persistence | | **LLM** | Google Gemini 2.5 Flash via `langchain-google-genai` (swap to OpenAI via `LLM_PROVIDER=openai`) | | **Tools** | [MCP](https://modelcontextprotocol.io/) — modular tool server (Jira, logs, metrics, email) | | **Memory** | [Mem0](https://mem0.ai/) — remembers past incidents and resolutions across runs | | **Observability** | [Langfuse](https://langfuse.com/) — full LLM tracing, token usage, latency per agent | | **Streaming** | [AG-UI](https://github.com/ag-ui-protocol/ag-ui) — SSE protocol for real-time frontend updates | | **Backend** | FastAPI + uvicorn | | **Frontend** | React 19 + TypeScript + Vite | | **Audit log** | SQLite — persists every run with agent outputs, approval decision, and timestamps | | **Rate limiting** | slowapi — 10 req/min on POST /incident, 20 req/min on POST /approve | ## Demo Incidents | ID | Service | Issue | Severity | |----|---------|-------|----------| | `INC-101` | Auth service | Redis connection pool reduced by deployment → 97% cache miss rate → OOMKill | P1 | | `INC-205` | Payment service | ORM migration didn't port connection pool config → PostgreSQL connection exhaustion | P2 | | `INC-312` | Notification service | Marketing deploy accidentally applied unsubscribe logic to transactional emails → AWS SES suspended | P2 | ## Project Structure incident-response-agent/ ├── backend/ │ ├── agents/ │ │ ├── intake_agent.py # Fetches incident from Jira │ │ ├── triage_agent.py # Assigns severity and affected systems │ │ ├── rca_agent.py # Root cause analysis via logs + LLM │ │ ├── notify_agent.py # Sends email notification │ │ └── llm_factory.py # Gemini / OpenAI provider switch │ ├── api/ │ │ └── main.py # FastAPI routes + SSE streaming + rate limiting │ ├── audit/ │ │ └── log.py # SQLite audit log — persists every run │ ├── data/ # Runtime SQLite DB (gitignored) │ ├── graph/ │ │ ├── workflow.py # LangGraph graph definition │ │ ├── nodes.py # Node wrappers + routing logic │ │ └── state.py # IncidentState schema │ ├── mcp_server/ │ │ └── tools/ │ │ ├── jira_tools.py # Jira integration (mock → real via API token) │ │ ├── log_tools.py # Log/metrics/deployment data (mock → Splunk/Loki) │ │ └── email_tools.py # Email sending (mock → SMTP/SendGrid) │ ├── memory/ │ │ └── mem0_client.py # Mem0 for cross-run incident memory │ ├── observability/ │ │ └── langfuse_client.py # Langfuse tracing │ ├── config.py # Env var loading │ └── requirements.txt ├── frontend/ │ └── src/ │ ├── components/ │ │ ├── ChatUI.tsx # Real-time chat event log │ │ ├── WorkflowStepper.tsx # Visual 5-step progress indicator │ │ ├── IncidentDetails.tsx # Live state panel │ │ ├── HITLApprovalModal.tsx # Human review modal │ │ └── RunHistory.tsx # Audit log table with search/filter │ ├── hooks/ │ │ └── useAgentStream.ts # SSE event consumer │ └── api/ │ └── client.ts # Axios API client ├── tests/ │ ├── test_audit_log.py # SQLite audit log — 19 tests │ ├── test_agents.py # Agent helper functions — 18 tests │ └── test_api.py # Input validators + safe-state — 20 tests ├── docker-compose.yml # Local Langfuse + Postgres observability stack ├── start.sh # macOS one-click startup ├── start.ps1 # Windows one-click startup └── .env # Secrets — never committed ## Running Locally ### Prerequisites - Python 3.11+ - Node.js 18+ - A `.env` file in the project root (see below) ### Quick start (macOS) ./start.sh Opens both servers in separate Terminal windows automatically. ### Quick start (Windows) .\start.ps1 ### Manual start **Backend** (run from project root): python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\Activate.ps1 pip install -r backend/requirements.txt uvicorn backend.api.main:api --reload --port 8000 **Frontend:** cd frontend npm install npm run dev Open **http://localhost:5173** (or 5174 if 5173 is in use). ### Running Tests source .venv/bin/activate python -m pytest tests/ -v 57 tests, no external dependencies required. ## Environment Variables (`.env` in project root) # LLM — required GEMINI_API_KEY=your_gemini_key # LLM provider — "gemini" (default) or "openai" LLM_PROVIDER=gemini OPENAI_API_KEY= # required if LLM_PROVIDER=openai # API auth — leave blank to run in dev mode (no key required) API_KEY= # Jira (optional — mocked if not set) JIRA_URL=https://yourorg.atlassian.net JIRA_EMAIL=you@yourorg.com JIRA_TOKEN=your_jira_api_token # Email / SMTP (optional — mocked if not set) SMTP_HOST=smtp.gmail.com SMTP_PORT=587 SMTP_USER=you@gmail.com SMTP_PASSWORD=your_app_password # Observability (optional — graceful fallback if not set) MEM0_API_KEY=your_mem0_key LANGFUSE_PUBLIC_KEY=your_public_key LANGFUSE_SECRET_KEY=your_secret_key LANGFUSE_HOST=http://localhost:3001 # local Docker instance ## API Endpoints | Method | Path | Auth | Description | |--------|------|------|-------------| | `POST` | `/incident` | API key | Start a new incident workflow | | `GET` | `/stream/{run_id}` | — | SSE stream of AG-UI events | | `POST` | `/approve/{run_id}` | API key | Submit approval / rejection | | `GET` | `/runs` | API key | List all past runs (audit log) | | `GET` | `/runs/{run_id}` | API key | Full detail for a single run | | `GET` | `/incidents/search` | API key | Semantic search over Mem0 memory | | `GET` | `/health` | — | Health check | ## Swapping Mocks for Real Integrations Everything is built with a clear swap point — the agent logic doesn't change, only the data source. | Mock | Real integration | What to change | |------|-----------------|----------------| | Jira mock | Jira REST API | Set `JIRA_URL`, `JIRA_EMAIL`, `JIRA_TOKEN` in `.env` | | Log mock | Splunk / Loki / Datadog | Replace `query_system_logs()` in `log_tools.py` | | Email mock | Gmail / SendGrid / SES | Set `SMTP_*` vars in `.env` | | SQLite checkpointer | PostgreSQL | Swap `MemorySaver` for `PostgresSaver` in `workflow.py` | ## Security - **Auth** — POST endpoints and audit log protected by `X-API-Key` header; blank `API_KEY` enables dev mode - **Rate limiting** — 10 req/min on `POST /incident`, 20 req/min on `POST /approve` - **Input validation** — incident ID format enforced (`INC-NNNN`), approver name and notes sanitized against prompt injection (control chars stripped, max length enforced) - **State allowlist** — internal fields (`notification_recipients`, `messages`, `similar_incidents`) are never sent to the browser - **Memory injection cap** — Mem0 context truncated to 500 chars per entry to prevent RAG data poisoning - **Security headers** — `X-Content-Type-Options`, `X-Frame-Options`, `Referrer-Policy` on every response - **Langfuse TLS** — warns if `LANGFUSE_HOST` points to a remote host over plain HTTP ## Roadmap - [ ] PagerDuty / Prometheus webhook trigger (zero-touch incident creation) - [ ] Slack / Teams notification channel - [ ] SSO on the approval modal (restrict to authorized SREs) - [ ] Reject flow re-runs RCA with approver feedback - [ ] Real Jira, email, and log integrations (credentials pending) ## Author Basit Sherazi — DMI LLC