nimblenitin/sre-incident-copilot
GitHub: nimblenitin/sre-incident-copilot
Stars: 0 | Forks: 0
# sre-incident-copilot
`sre-incident-copilot` is a governed, read-only SRE incident assistant. It helps on-call engineers triage alerts, reason about SLO and error-budget impact, retrieve runbooks, suggest safe diagnostic commands, draft Slack-style updates, and write audit logs.
The project is motivated by a common SRE failure mode: real incidents getting buried under noisy pages. In Splunk's [State of Observability 2025](https://www.splunk.com/en-us/blog/observability/state-of-observability-2025.html), based on a survey of 1,855 ITOps and engineering professionals, 73% of respondents reported outages due to ignored or suppressed alerts. On-call engineers usually are not ignoring alerts because they are careless; they are desensitized because their attention has been stretched past its limit by low-value pages. The result is that the one alert that actually matters can get lost in the pile.
It is not an autonomous remediation bot. It never restarts pods, scales services, rolls back deployments, deletes data, applies Terraform, or permanently silences alerts. When an alert or runbook suggests an irreversible action, the copilot blocks it and returns `requires_human_approval: true`.
## SRE Workflow
When an alert arrives, the service:
1. Classifies severity from `config/severity_matrix.yaml`.
2. Checks SLO impact from `config/services.yaml`.
3. Retrieves a matching YAML runbook from `config/runbooks/`.
4. Filters runbook actions through `config/policies.yaml`.
5. Recommends escalation when severity or burn rate warrants it.
6. Drafts an incident update suitable for Slack.
7. Appends a traceable JSONL audit event to `logs/audit.jsonl`.
Trace fields include `incident_id`, `timestamp`, `agent_version`, `policy_version`, `decision_reason`, and `human_owner`.
## Seven Habits
The project demonstrates the Seven Habits of Effective Agentic Systems:
## Setup
cd sre-incident-copilot
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --reload
Health check:
curl http://127.0.0.1:8000/health
## API Examples
Triage a payment error-rate alert:
curl -s -X POST http://127.0.0.1:8000/triage \
-H "Content-Type: application/json" \
--data @examples/alert_error_rate.json
List runbooks for a service:
curl http://127.0.0.1:8000/runbooks/payment-api
Draft a postmortem:
curl -s -X POST http://127.0.0.1:8000/postmortem/draft \
-H "Content-Type: application/json" \
-d '{
"incident_id": "inc-demo",
"service": "payment-api",
"severity": "sev1",
"summary": "Elevated payment 5xx rate affected checkout completion.",
"timeline": [
{"time": "10:00", "event": "Alert fired"},
{"time": "10:05", "event": "On-call began read-only diagnostics"}
],
"customer_impact": "Some customers could not complete payment."
}'
## Sample Output
{
"incident_id": "inc-1234567890",
"service": "payment-api",
"severity": "sev1",
"probable_cause": "Payment processor dependency errors or recent deployment regression.",
"error_budget_status": {
"status": "critical",
"remaining_percent": 0.0,
"burn_rate": 70.0
},
"recommended_runbook": "payment-api-error-rate",
"diagnostic_commands": [
"kubectl get pods -n payments -l app=payment-api",
"kubectl logs -n payments deploy/payment-api --since=15m | grep ERROR",
"curl -s https://payment-api.example.com/health"
],
"blocked_actions": [
{
"action": "rollback deployment payment-api",
"reason": "Read-only policy blocks destructive or irreversible action matching pattern: \\brollback\\b",
"requires_human_approval": true
}
],
"requires_human_approval": true
}
## Tests
pytest
The tests cover severity classification, policy blocking for rollback/restart/scale, SLO burn-rate escalation, audit logging, and blameless postmortem structure.