shubhamaher8/OpsAgent

GitHub: shubhamaher8/OpsAgent

Stars: 0 | Forks: 0

# 🚨 AI Incident Response Agent ## The Problem | Pain Point | Impact | |---|---| | On-call engineer woken at 3 AM | 15–40 min MTTR before even opening a laptop | | Runbooks exist, rarely followed | Manual → inconsistent → slow | | Alert fatigue | Engineers ignore alerts; real incidents get missed | | RCA never written | Same incidents repeat every month | ## The Solution No human needed for common failure modes: **OOM kills**, **traffic spikes**, **unhealthy ECS tasks**. ## Architecture flowchart TB subgraph AWS ["AWS Infrastructure"] ECS["ECS Service
payment-service"] CWL["CloudWatch Logs
/ecs/payment-service"] CWM["CloudWatch Metrics
CPU, Memory, Errors"] CWA["CloudWatch Alarm
threshold breach"] SNS["SNS Topic
incident-alerts"] LAMBDA["Lambda Function
ai-incident-agent"] end subgraph AGENT ["Agent Pipeline"] direction TB LH["lambda_handler.py
Parse SNS event"] AG["agent.py
Orchestrate flow"] TL["tools.py
get_logs"] TM["tools.py
get_metrics"] PR["prompts.py
build_prompt"] TA["tools.py
restart / scale"] SL["slack.py
post_slack"] end subgraph EXTERNAL ["External Services"] OR["OpenRouter API
Claude 3.5 Sonnet"] SLACK["Slack
incidents channel"] end ECS -->|emits logs| CWL ECS -->|emits metrics| CWM CWM -->|threshold breach| CWA CWA -->|ALARM state| SNS SNS -->|invoke| LAMBDA LAMBDA --> LH LH -->|alert dict| AG AG --> TL AG --> TM TL -->|boto3| CWL TM -->|boto3| CWM AG --> PR PR -->|prompt| OR OR -->|JSON diagnosis| AG AG -->|confidence above 0.75| TA TA -->|boto3| ECS AG --> SL SL -->|webhook| SLACK ### Data Flow — 5 Phases sequenceDiagram participant CW as CloudWatch participant SNS as SNS Topic participant LH as lambda_handler participant AG as agent.py participant TL as tools.py participant LLM as OpenRouter participant ECS as ECS API participant SL as Slack Note over CW,SNS: TRIGGER CW->>SNS: Alarm state change SNS->>LH: Invoke Lambda Note over LH,TL: OBSERVE LH->>AG: handle_incident(alert) AG->>TL: get_logs(service, 5min) TL-->>AG: last 50 log lines AG->>TL: get_metrics(service) TL-->>AG: cpu, memory, error_rate Note over AG,LLM: DIAGNOSE AG->>LLM: SRE prompt + logs + metrics LLM-->>AG: root_cause, action, confidence Note over AG,ECS: ACT alt confidence above 0.75 and action is restart AG->>ECS: stop_task - ECS auto-replaces else confidence above 0.75 and action is scale AG->>ECS: update_service desired +2 else low confidence or action is none AG->>AG: Skip action, mark escalated end Note over AG,SL: REPORT AG->>SL: Post incident summary + RCA ## Tech Stack | Layer | Technology | Purpose | |---|---|---| | **Trigger** | CloudWatch Alarm → SNS → Lambda | Detect incident, wake agent | | **Agent Logic** | Python 3.11 | Orchestrate entire flow | | **LLM** | OpenRouter API (`anthropic/claude-3.5-sonnet`) | Root cause diagnosis | | **LLM SDK** | `openai` Python SDK | OpenAI-compatible interface to OpenRouter | | **AWS Actions** | `boto3` | Logs, metrics, ECS restart, ECS scale | | **Notifications** | Slack Incoming Webhook | Incident report delivery | | **Packaging** | Docker → AWS ECR | Lambda container image | ## Project Structure ai-incident-agent/ ├── lambda_handler.py # AWS Lambda entry point — parses SNS events ├── agent.py # Core orchestration — observe → diagnose → act → report ├── tools.py # All boto3 AWS interactions (logs, metrics, ECS) ├── prompts.py # LLM prompt builder ├── slack.py # Slack webhook poster (resolved / escalated / failure) ├── Dockerfile # Container image for Lambda via ECR ├── requirements.txt # Python dependencies └── .env.example # Environment variable template ### Module Dependency Graph graph TD LH["lambda_handler.py"] --> AG["agent.py"] AG --> TL["tools.py
boto3 wrappers"] AG --> PR["prompts.py
prompt builder"] AG --> SL["slack.py
webhook poster"] style LH fill:#e94560,stroke:#333,color:#fff style AG fill:#533483,stroke:#333,color:#fff style TL fill:#0f3460,stroke:#333,color:#fff style PR fill:#0f3460,stroke:#333,color:#fff style SL fill:#0f3460,stroke:#333,color:#fff ## Quick Start ### 1. Clone & Configure git clone https://github.com/shubhamaher8/OpsAgent.git cd OpsAgent/ai-incident-agent cp .env.example .env # Edit .env with your actual keys ### 2. Environment Variables # LLM OPENROUTER_API_KEY=sk-or-... # Required — no default, fails loudly if missing LLM_MODEL=anthropic/claude-3.5-sonnet # AWS AWS_REGION=ap-south-1 ECS_CLUSTER=prod-cluster LOG_GROUP_PREFIX=/ecs/ # Slack SLACK_WEBHOOK_URL=https://hooks.slack.com/services/... ### 3. Build & Deploy # Build Docker image docker build -t ai-incident-agent . # Tag for ECR aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin .dkr.ecr.ap-south-1.amazonaws.com docker tag ai-incident-agent:latest .dkr.ecr.ap-south-1.amazonaws.com/ai-incident-agent:latest docker push .dkr.ecr.ap-south-1.amazonaws.com/ai-incident-agent:latest # Create/update Lambda to use the image aws lambda create-function \ --function-name ai-incident-agent \ --package-type Image \ --code ImageUri=.dkr.ecr.ap-south-1.amazonaws.com/ai-incident-agent:latest \ --role arn:aws:iam:::role/lambda-incident-agent-role \ --timeout 300 \ --memory-size 256 ## How It Works ### Agent Decision Table | LLM Action | Confidence | Result | |---|---|---| | `restart` | > 0.75 | `stop_task()` → ECS auto-launches replacement | | `scale` | > 0.75 | `update_service(desired_count + 2)` | | `none` | any | Skip action → escalate to Slack | | any | ≤ 0.75 | Skip action → escalate to Slack | ### Guardrails | Guardrail | Mechanism | |---|---| | **Confidence Threshold** | LLM confidence < 0.75 → no action taken → escalate to human | | **Action Whitelist** | Only `restart` and `scale` are executable. `none` always skips. | | **Error Fallback** | Any exception → post failure message to Slack. Never silently fail. | | **Lambda Timeout** | Set to 300s (5 min). Agent typically completes in ~15s. | ### Slack Notifications **✅ Resolved:** 🚨 Incident: payment-service | P2 Root Cause: Memory leak from unclosed DB connection pool Action Taken: restart (confidence: 0.91) Status: ✅ Resolved RCA: OOM errors + connection pool exhaustion = classic leak pattern **⚠️ Escalated:** 🚨 Incident: payment-service | P2 Root Cause: Intermittent network timeout to upstream API Action Taken: None — confidence too low (0.62) Status: ⚠️ Escalated to on-call engineer RCA: Could be transient issue or upstream degradation, insufficient signal **❌ Agent Failure:** 🚨 Incident: payment-service Status: ❌ Agent failed — manual intervention required Error: ConnectionTimeout: openrouter.ai ## Failure Modes | Failure | Behavior | |---|---| | OpenRouter API down | Catch exception → post failure to Slack → human escalation | | boto3 action fails | Catch exception → post failure to Slack → human escalation | | LLM returns invalid JSON | `JSONDecodeError` caught → escalate with error context | | Low confidence diagnosis | Skip action → post escalation message to Slack | | Lambda timeout | Lambda runtime handles → CloudWatch logs the timeout | | Slack webhook fails | Log error → agent continues (acceptable failure) | ## IAM Permissions Required The Lambda execution role needs these permissions: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:FilterLogEvents", "logs:DescribeLogGroups" ], "Resource": "arn:aws:logs:*:*:*" }, { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricStatistics" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "ecs:ListTasks", "ecs:StopTask", "ecs:DescribeServices", "ecs:UpdateService" ], "Resource": "*" } ] } ## Demo ### Test Event (Simple Format) { "Records": [{ "Sns": { "Message": "{\"service\": \"payment-service\", \"metric\": \"MemoryUtilization\", \"value\": 92, \"threshold\": 85}" } }] } ### Expected Timeline T+00s Lambda invoked by SNS T+02s boto3 fetches real OOM logs from CloudWatch T+03s boto3 fetches CPU/Memory/Error metrics T+08s OpenRouter returns diagnosis (confidence: 0.93) T+09s Guardrail passes → action: restart T+12s boto3 stops ECS task → replacement auto-starts T+45s Service healthy T+15s Slack RCA posted ## Design Decisions | Question | Answer | |---|---| | **Why OpenRouter over Anthropic SDK?** | Model flexibility — swap models via env var without code changes | | **Why not LangChain?** | Linear flow doesn't need an orchestration framework. Fewer deps = faster Lambda cold start | | **How does the LLM interact with AWS?** | It doesn't. LLM returns JSON advice. `boto3` acts on it. LLM = advisor, boto3 = executor | | **How do you prevent bad actions?** | Confidence threshold > 0.75 + only safe actions (restart/scale) exposed | | **What's the failure mode?** | Low confidence → skip → escalate to Slack → human takes over | | **How would you improve it?** | Post-action verification, multi-service correlation, HITL approval for rollbacks | ## Author **Shubham Aher** ## License This project is licensed under the MIT License — see the [LICENSE](LICENSE) file for details.