anshikapundeel/incident-pilot

GitHub: anshikapundeel/incident-pilot

Stars: 0 | Forks: 0

# incident-pilot An **AI-augmented incident response service** in Go. Receives Alertmanager-shaped alerts, correlates them with recent deploys, logs, and traces via a deterministic rule engine, produces a structured incident with cited evidence, and optionally rephrases it with a locally-configured LLM. Routes the result to Slack (or stdout). This is the runnable implementation of the design in [`ai-ops-design`](https://github.com/anshikapundeel/ai-ops-design). The architecture, failure modes, threat model, and cost analysis live there; this repo is the v0 code that brings the design to life. Alertmanager webhook ──> incident-pilot ──> Slack / stdout │ rule engine + LLM │ (cited findings → rephrased prose) **Pure Go stdlib. Zero runtime dependencies. No API keys baked in.** ## Why this exists When an alert fires at 3 AM, the on-call engineer needs three things, in this order: 1. **Has this just happened?** (a deploy, a config change, a partial outage of a dependency) 2. **Is this already being worked on?** (a duplicate of an open incident) 3. **What's the most likely fix?** incident-pilot answers all three before the page arrives. It does this without using an LLM for the analysis (LLMs hallucinate causal claims on time-pressure data; see the [llm-integration design doc](https://github.com/anshikapundeel/ai-ops-design/blob/main/docs/08-llm-integration.md)). The analysis is deterministic rule code. The LLM, if configured, rephrases the rule output into prose. ## Features - **Alertmanager-compatible webhook** at `POST /alerts`. Existing Prometheus setups can route to it without changes. - **5 correlation rules** in v0: recent-deploy correlation, error-log spike clustering, trace tail-latency, duplicate open-incident detection, team-routing decision. - **Pluggable enrichment** — push deploys / traces / logs into `/context/*` endpoints; rules pick up the data on the next alert. - **Templated narration always works.** LLM narration is optional and gated behind explicit env vars. - **Slack output** via incoming webhook URL (env-driven; no secret in code). - **Rule contract enforced:** every finding cites at least one Evidence; broken rules become a LOW finding rather than crashing the pipeline. - **Race-detector clean** under concurrent ingest + query. ## Build and run Requires Go 1.22+. git clone https://github.com/anshikapundeel/incident-pilot cd incident-pilot go build -o incident-pilot ./cmd/incident-pilot ./incident-pilot # listens on :9099 ## Quick demo In one terminal: ./incident-pilot In another terminal — record a deploy, feed in some error logs, then fire the alert: # Tell the platform a deploy happened 3 minutes ago curl -X POST localhost:9099/context/deploys -d '{ "service": "api-gateway", "commit": "a39bf2c", "author": "alice", "when": "'$(date -u -d "3 minutes ago" +"%Y-%m-%dT%H:%M:%SZ" 2>/dev/null || date -u -v-3M +"%Y-%m-%dT%H:%M:%SZ")'" }' # Add some error logs for i in 1 2 3 4 5 6 7 8 9 10; do curl -X POST localhost:9099/context/logs -d "{ \"service\":\"api-gateway\",\"level\":\"ERROR\", \"message\":\"upstream timeout request $i\" }" done # Fire the alert curl -X POST localhost:9099/alerts \ -H 'Content-Type: application/json' \ -d @examples/sample_alert.json # See the structured incident curl -s "localhost:9099/api/incidents?limit=1" | jq . In the first terminal you'll see the templated incident report appear, with a deploy-correlation finding (HIGH severity, since the deploy was 3 minutes before the alert) and an error-spike finding (MEDIUM, with the clustered error pattern). ## The rules shipped in v0 | Rule ID | What it catches | |-------------------------------|----------------| | `deploy.recent_correlated` | Alert started shortly after a deploy to the same service. Strongest single signal. | | `logs.error_spike` | Many ERROR-level log lines for the alerting service; surfaces the dominant error pattern by clustering. | | `traces.tail_latency` | p99 latency over recent traces for the service exceeds threshold. | | `incident.duplicate_open` | Same fingerprint, or same service+alertname, already has an open incident. Prevents alert-storm page-floods. | | `routing.team` | Surfaces the routing decision as a visible finding so it's auditable. | Each rule lives in its own file in `internal/rules/`, and is ~100 lines of obvious Go. Adding a sixth rule is a small change — see [ADDING_RULES.md](docs/ADDING_RULES.md). Each finding includes: - `severity` (info / low / medium / high / critical) - `confidence` (low / medium / high) — orthogonal to severity - `summary` — one paragraph - `suggestion` — concrete next command - `evidence` — pointer back to the data that triggered it ## API | Endpoint | Purpose | |----------|---------| | `POST /alerts` | Alertmanager webhook (envelope or bare alert) | | `POST /context/deploys` | Record a deploy event | | `POST /context/traces` | Record a trace sample (TraceID, service, duration) | | `POST /context/logs` | Record an error log line | | `GET /api/incidents?limit=N` | List recent incidents | | `GET /api/incidents/{id}` | Fetch one incident in full | | `POST /api/incidents/{id}/resolve` | Mark an incident resolved | | `GET /api/stats` | Counters: alerts in, incidents opened, etc. | | `GET /healthz` | Liveness probe | ## Optional LLM narration incident-pilot ships with **no API keys, no default provider, and no outbound calls.** Templated narration is always used. To enable an LLM rephrasing layer, set environment variables before starting: # Local Ollama (recommended for data-sensitive environments): export INCIDENT_PILOT_LLM_PROVIDER=ollama export INCIDENT_PILOT_LLM_BASE_URL=http://localhost:11434 export INCIDENT_PILOT_LLM_MODEL=llama3.1:8b ./incident-pilot # Or any OpenAI-compatible endpoint: export INCIDENT_PILOT_LLM_PROVIDER=openai-compatible export INCIDENT_PILOT_LLM_BASE_URL=https://your-endpoint.example.com export INCIDENT_PILOT_LLM_MODEL=gpt-4o-mini export INCIDENT_PILOT_LLM_API_KEY=YOUR_KEY ./incident-pilot The LLM is **strictly a presentation layer**. The system prompt explicitly forbids the model from introducing new findings, changing severity, or inventing actions. The structured incident (the authoritative data) is always preserved alongside the narration. When the LLM is unavailable or fails, incident-pilot silently falls back to templated narration. The alert still gets routed. For the full reasoning behind this design see [`ai-ops-design/docs/08-llm-integration.md`](https://github.com/anshikapundeel/ai-ops-design/blob/main/docs/08-llm-integration.md). ## Slack output Set `INCIDENT_PILOT_SLACK_WEBHOOK` to a Slack incoming-webhook URL: export INCIDENT_PILOT_SLACK_WEBHOOK=https://hooks.slack.com/services/T.../B.../... ./incident-pilot The Slack sink is added in addition to (not instead of) the stdout sink. Use `-quiet` to suppress stdout if you only want Slack. ## What this is and isn't **Is:** - A complete, runnable incident-correlation service. Drop it in, point Alertmanager at `/alerts`, push deploys into `/context/deploys`, and you get structured incident reports on every fire. - A reference implementation of the [ai-ops-design](https://github.com/anshikapundeel/ai-ops-design) intelligence-plane subsystem, specifically the incident pipeline documented in [docs/02-incident-pipeline.md](https://github.com/anshikapundeel/ai-ops-design/blob/main/docs/02-incident-pipeline.md). - A demonstration that AI-augmented incident response can be built with a deterministic core and an *optional* LLM layer, rather than an LLM-as-judge that hallucinates causal claims. **Isn't:** - A storage backend. Incidents live in memory; they're lost on restart. Production deployments would back the store with Postgres or similar. - An alert manager. Use Prometheus Alertmanager (or any other) for rule evaluation. incident-pilot starts where Alertmanager hands off. - A search engine. Querying past incidents by content is not implemented. The full design covers this via vector retrieval (see ai-ops-design); not in v0. - A full observability stack. Pair it with [`flow-trace`](https://github.com/anshikapundeel/flow-trace) for tracing, [`sched-trace`](https://github.com/anshikapundeel/sched-trace) for kernel telemetry, [`redfish-exporter`](https://github.com/anshikapundeel/redfish-exporter) for hardware events, and [`perf-advisor`](https://github.com/anshikapundeel/perf-advisor) for performance rule analysis. ## Project layout internal/model/ Alert, Context, Finding, Incident — domain types internal/correlate/ Rule engine: runs rules, enforces evidence contract internal/rules/ Five v0 detection rules (one file each) internal/store/ In-memory incident store, sync.RWMutex-guarded internal/narrate/ Templated narration + optional LLM hook internal/route/ Stdout + Slack delivery sinks internal/server/ HTTP API (alerts, context, query endpoints) cmd/incident-pilot/ Server binary examples/ Sample alert + deploy JSON payloads docs/ DESIGN.md, ADDING_RULES.md ## Tests go test ./... # 27 tests, all green go test -race ./... # also race-clean CI matrix: Go 1.22 and 1.23, both with `-race`, plus an end-to-end smoke that fires an alert against a live server and verifies the incident shows up. ## Roadmap - [x] Alertmanager-shape ingest, bare-alert fallback - [x] 5 correlation rules with cited evidence - [x] Optional LLM hook (Ollama / OpenAI-compatible) - [x] Slack + stdout delivery sinks - [x] In-memory store + REST query API - [x] 27 unit + e2e tests, race-clean - [ ] Past-incident retrieval (vector similarity over resolved incidents) - [ ] Postgres-backed store for durability - [ ] Hallucination-check on LLM output (claim → finding grounding) - [ ] PagerDuty / Opsgenie sinks - [ ] Prometheus `/metrics` endpoint - [ ] More rules: cgroup OOM, leader-election thrash, dependency-graph cascade ## License MIT.
标签:EVTX分析