shubhamaher8/OpsAgent
GitHub: shubhamaher8/OpsAgent
Stars: 0 | Forks: 0
# 🚨 AI Incident Response Agent
## The Problem
| Pain Point | Impact |
|---|---|
| On-call engineer woken at 3 AM | 15–40 min MTTR before even opening a laptop |
| Runbooks exist, rarely followed | Manual → inconsistent → slow |
| Alert fatigue | Engineers ignore alerts; real incidents get missed |
| RCA never written | Same incidents repeat every month |
## The Solution
No human needed for common failure modes: **OOM kills**, **traffic spikes**, **unhealthy ECS tasks**.
## Architecture
flowchart TB
subgraph AWS ["AWS Infrastructure"]
ECS["ECS Service
payment-service"] CWL["CloudWatch Logs
/ecs/payment-service"] CWM["CloudWatch Metrics
CPU, Memory, Errors"] CWA["CloudWatch Alarm
threshold breach"] SNS["SNS Topic
incident-alerts"] LAMBDA["Lambda Function
ai-incident-agent"] end subgraph AGENT ["Agent Pipeline"] direction TB LH["lambda_handler.py
Parse SNS event"] AG["agent.py
Orchestrate flow"] TL["tools.py
get_logs"] TM["tools.py
get_metrics"] PR["prompts.py
build_prompt"] TA["tools.py
restart / scale"] SL["slack.py
post_slack"] end subgraph EXTERNAL ["External Services"] OR["OpenRouter API
Claude 3.5 Sonnet"] SLACK["Slack
incidents channel"] end ECS -->|emits logs| CWL ECS -->|emits metrics| CWM CWM -->|threshold breach| CWA CWA -->|ALARM state| SNS SNS -->|invoke| LAMBDA LAMBDA --> LH LH -->|alert dict| AG AG --> TL AG --> TM TL -->|boto3| CWL TM -->|boto3| CWM AG --> PR PR -->|prompt| OR OR -->|JSON diagnosis| AG AG -->|confidence above 0.75| TA TA -->|boto3| ECS AG --> SL SL -->|webhook| SLACK ### Data Flow — 5 Phases sequenceDiagram participant CW as CloudWatch participant SNS as SNS Topic participant LH as lambda_handler participant AG as agent.py participant TL as tools.py participant LLM as OpenRouter participant ECS as ECS API participant SL as Slack Note over CW,SNS: TRIGGER CW->>SNS: Alarm state change SNS->>LH: Invoke Lambda Note over LH,TL: OBSERVE LH->>AG: handle_incident(alert) AG->>TL: get_logs(service, 5min) TL-->>AG: last 50 log lines AG->>TL: get_metrics(service) TL-->>AG: cpu, memory, error_rate Note over AG,LLM: DIAGNOSE AG->>LLM: SRE prompt + logs + metrics LLM-->>AG: root_cause, action, confidence Note over AG,ECS: ACT alt confidence above 0.75 and action is restart AG->>ECS: stop_task - ECS auto-replaces else confidence above 0.75 and action is scale AG->>ECS: update_service desired +2 else low confidence or action is none AG->>AG: Skip action, mark escalated end Note over AG,SL: REPORT AG->>SL: Post incident summary + RCA ## Tech Stack | Layer | Technology | Purpose | |---|---|---| | **Trigger** | CloudWatch Alarm → SNS → Lambda | Detect incident, wake agent | | **Agent Logic** | Python 3.11 | Orchestrate entire flow | | **LLM** | OpenRouter API (`anthropic/claude-3.5-sonnet`) | Root cause diagnosis | | **LLM SDK** | `openai` Python SDK | OpenAI-compatible interface to OpenRouter | | **AWS Actions** | `boto3` | Logs, metrics, ECS restart, ECS scale | | **Notifications** | Slack Incoming Webhook | Incident report delivery | | **Packaging** | Docker → AWS ECR | Lambda container image | ## Project Structure ai-incident-agent/ ├── lambda_handler.py # AWS Lambda entry point — parses SNS events ├── agent.py # Core orchestration — observe → diagnose → act → report ├── tools.py # All boto3 AWS interactions (logs, metrics, ECS) ├── prompts.py # LLM prompt builder ├── slack.py # Slack webhook poster (resolved / escalated / failure) ├── Dockerfile # Container image for Lambda via ECR ├── requirements.txt # Python dependencies └── .env.example # Environment variable template ### Module Dependency Graph graph TD LH["lambda_handler.py"] --> AG["agent.py"] AG --> TL["tools.py
boto3 wrappers"] AG --> PR["prompts.py
prompt builder"] AG --> SL["slack.py
webhook poster"] style LH fill:#e94560,stroke:#333,color:#fff style AG fill:#533483,stroke:#333,color:#fff style TL fill:#0f3460,stroke:#333,color:#fff style PR fill:#0f3460,stroke:#333,color:#fff style SL fill:#0f3460,stroke:#333,color:#fff ## Quick Start ### 1. Clone & Configure git clone https://github.com/shubhamaher8/OpsAgent.git cd OpsAgent/ai-incident-agent cp .env.example .env # Edit .env with your actual keys ### 2. Environment Variables # LLM OPENROUTER_API_KEY=sk-or-... # Required — no default, fails loudly if missing LLM_MODEL=anthropic/claude-3.5-sonnet # AWS AWS_REGION=ap-south-1 ECS_CLUSTER=prod-cluster LOG_GROUP_PREFIX=/ecs/ # Slack SLACK_WEBHOOK_URL=https://hooks.slack.com/services/... ### 3. Build & Deploy # Build Docker image docker build -t ai-incident-agent . # Tag for ECR aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin.dkr.ecr.ap-south-1.amazonaws.com
docker tag ai-incident-agent:latest .dkr.ecr.ap-south-1.amazonaws.com/ai-incident-agent:latest
docker push .dkr.ecr.ap-south-1.amazonaws.com/ai-incident-agent:latest
# Create/update Lambda to use the image
aws lambda create-function \
--function-name ai-incident-agent \
--package-type Image \
--code ImageUri=.dkr.ecr.ap-south-1.amazonaws.com/ai-incident-agent:latest \
--role arn:aws:iam:::role/lambda-incident-agent-role \
--timeout 300 \
--memory-size 256
## How It Works
### Agent Decision Table
| LLM Action | Confidence | Result |
|---|---|---|
| `restart` | > 0.75 | `stop_task()` → ECS auto-launches replacement |
| `scale` | > 0.75 | `update_service(desired_count + 2)` |
| `none` | any | Skip action → escalate to Slack |
| any | ≤ 0.75 | Skip action → escalate to Slack |
### Guardrails
| Guardrail | Mechanism |
|---|---|
| **Confidence Threshold** | LLM confidence < 0.75 → no action taken → escalate to human |
| **Action Whitelist** | Only `restart` and `scale` are executable. `none` always skips. |
| **Error Fallback** | Any exception → post failure message to Slack. Never silently fail. |
| **Lambda Timeout** | Set to 300s (5 min). Agent typically completes in ~15s. |
### Slack Notifications
**✅ Resolved:**
🚨 Incident: payment-service | P2
Root Cause: Memory leak from unclosed DB connection pool
Action Taken: restart (confidence: 0.91)
Status: ✅ Resolved
RCA: OOM errors + connection pool exhaustion = classic leak pattern
**⚠️ Escalated:**
🚨 Incident: payment-service | P2
Root Cause: Intermittent network timeout to upstream API
Action Taken: None — confidence too low (0.62)
Status: ⚠️ Escalated to on-call engineer
RCA: Could be transient issue or upstream degradation, insufficient signal
**❌ Agent Failure:**
🚨 Incident: payment-service
Status: ❌ Agent failed — manual intervention required
Error: ConnectionTimeout: openrouter.ai
## Failure Modes
| Failure | Behavior |
|---|---|
| OpenRouter API down | Catch exception → post failure to Slack → human escalation |
| boto3 action fails | Catch exception → post failure to Slack → human escalation |
| LLM returns invalid JSON | `JSONDecodeError` caught → escalate with error context |
| Low confidence diagnosis | Skip action → post escalation message to Slack |
| Lambda timeout | Lambda runtime handles → CloudWatch logs the timeout |
| Slack webhook fails | Log error → agent continues (acceptable failure) |
## IAM Permissions Required
The Lambda execution role needs these permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:FilterLogEvents",
"logs:DescribeLogGroups"
],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:GetMetricStatistics"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ecs:ListTasks",
"ecs:StopTask",
"ecs:DescribeServices",
"ecs:UpdateService"
],
"Resource": "*"
}
]
}
## Demo
### Test Event (Simple Format)
{
"Records": [{
"Sns": {
"Message": "{\"service\": \"payment-service\", \"metric\": \"MemoryUtilization\", \"value\": 92, \"threshold\": 85}"
}
}]
}
### Expected Timeline
T+00s Lambda invoked by SNS
T+02s boto3 fetches real OOM logs from CloudWatch
T+03s boto3 fetches CPU/Memory/Error metrics
T+08s OpenRouter returns diagnosis (confidence: 0.93)
T+09s Guardrail passes → action: restart
T+12s boto3 stops ECS task → replacement auto-starts
T+45s Service healthy
T+15s Slack RCA posted
## Design Decisions
| Question | Answer |
|---|---|
| **Why OpenRouter over Anthropic SDK?** | Model flexibility — swap models via env var without code changes |
| **Why not LangChain?** | Linear flow doesn't need an orchestration framework. Fewer deps = faster Lambda cold start |
| **How does the LLM interact with AWS?** | It doesn't. LLM returns JSON advice. `boto3` acts on it. LLM = advisor, boto3 = executor |
| **How do you prevent bad actions?** | Confidence threshold > 0.75 + only safe actions (restart/scale) exposed |
| **What's the failure mode?** | Low confidence → skip → escalate to Slack → human takes over |
| **How would you improve it?** | Post-action verification, multi-service correlation, HITL approval for rollbacks |
## Author
**Shubham Aher**
## License
This project is licensed under the MIT License — see the [LICENSE](LICENSE) file for details.
payment-service"] CWL["CloudWatch Logs
/ecs/payment-service"] CWM["CloudWatch Metrics
CPU, Memory, Errors"] CWA["CloudWatch Alarm
threshold breach"] SNS["SNS Topic
incident-alerts"] LAMBDA["Lambda Function
ai-incident-agent"] end subgraph AGENT ["Agent Pipeline"] direction TB LH["lambda_handler.py
Parse SNS event"] AG["agent.py
Orchestrate flow"] TL["tools.py
get_logs"] TM["tools.py
get_metrics"] PR["prompts.py
build_prompt"] TA["tools.py
restart / scale"] SL["slack.py
post_slack"] end subgraph EXTERNAL ["External Services"] OR["OpenRouter API
Claude 3.5 Sonnet"] SLACK["Slack
incidents channel"] end ECS -->|emits logs| CWL ECS -->|emits metrics| CWM CWM -->|threshold breach| CWA CWA -->|ALARM state| SNS SNS -->|invoke| LAMBDA LAMBDA --> LH LH -->|alert dict| AG AG --> TL AG --> TM TL -->|boto3| CWL TM -->|boto3| CWM AG --> PR PR -->|prompt| OR OR -->|JSON diagnosis| AG AG -->|confidence above 0.75| TA TA -->|boto3| ECS AG --> SL SL -->|webhook| SLACK ### Data Flow — 5 Phases sequenceDiagram participant CW as CloudWatch participant SNS as SNS Topic participant LH as lambda_handler participant AG as agent.py participant TL as tools.py participant LLM as OpenRouter participant ECS as ECS API participant SL as Slack Note over CW,SNS: TRIGGER CW->>SNS: Alarm state change SNS->>LH: Invoke Lambda Note over LH,TL: OBSERVE LH->>AG: handle_incident(alert) AG->>TL: get_logs(service, 5min) TL-->>AG: last 50 log lines AG->>TL: get_metrics(service) TL-->>AG: cpu, memory, error_rate Note over AG,LLM: DIAGNOSE AG->>LLM: SRE prompt + logs + metrics LLM-->>AG: root_cause, action, confidence Note over AG,ECS: ACT alt confidence above 0.75 and action is restart AG->>ECS: stop_task - ECS auto-replaces else confidence above 0.75 and action is scale AG->>ECS: update_service desired +2 else low confidence or action is none AG->>AG: Skip action, mark escalated end Note over AG,SL: REPORT AG->>SL: Post incident summary + RCA ## Tech Stack | Layer | Technology | Purpose | |---|---|---| | **Trigger** | CloudWatch Alarm → SNS → Lambda | Detect incident, wake agent | | **Agent Logic** | Python 3.11 | Orchestrate entire flow | | **LLM** | OpenRouter API (`anthropic/claude-3.5-sonnet`) | Root cause diagnosis | | **LLM SDK** | `openai` Python SDK | OpenAI-compatible interface to OpenRouter | | **AWS Actions** | `boto3` | Logs, metrics, ECS restart, ECS scale | | **Notifications** | Slack Incoming Webhook | Incident report delivery | | **Packaging** | Docker → AWS ECR | Lambda container image | ## Project Structure ai-incident-agent/ ├── lambda_handler.py # AWS Lambda entry point — parses SNS events ├── agent.py # Core orchestration — observe → diagnose → act → report ├── tools.py # All boto3 AWS interactions (logs, metrics, ECS) ├── prompts.py # LLM prompt builder ├── slack.py # Slack webhook poster (resolved / escalated / failure) ├── Dockerfile # Container image for Lambda via ECR ├── requirements.txt # Python dependencies └── .env.example # Environment variable template ### Module Dependency Graph graph TD LH["lambda_handler.py"] --> AG["agent.py"] AG --> TL["tools.py
boto3 wrappers"] AG --> PR["prompts.py
prompt builder"] AG --> SL["slack.py
webhook poster"] style LH fill:#e94560,stroke:#333,color:#fff style AG fill:#533483,stroke:#333,color:#fff style TL fill:#0f3460,stroke:#333,color:#fff style PR fill:#0f3460,stroke:#333,color:#fff style SL fill:#0f3460,stroke:#333,color:#fff ## Quick Start ### 1. Clone & Configure git clone https://github.com/shubhamaher8/OpsAgent.git cd OpsAgent/ai-incident-agent cp .env.example .env # Edit .env with your actual keys ### 2. Environment Variables # LLM OPENROUTER_API_KEY=sk-or-... # Required — no default, fails loudly if missing LLM_MODEL=anthropic/claude-3.5-sonnet # AWS AWS_REGION=ap-south-1 ECS_CLUSTER=prod-cluster LOG_GROUP_PREFIX=/ecs/ # Slack SLACK_WEBHOOK_URL=https://hooks.slack.com/services/... ### 3. Build & Deploy # Build Docker image docker build -t ai-incident-agent . # Tag for ECR aws ecr get-login-password --region ap-south-1 | docker login --username AWS --password-stdin