SushanthKS06/opsgenie-slack-agent
GitHub: SushanthKS06/opsgenie-slack-agent
Stars: 0 | Forks: 0
# OpsGenie — Slack Workflow Autonomous Response Machine
### AI Incident Command Center
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Project Structure](#project-structure)
3. [Prerequisites](#prerequisites)
4. [Environment Variables](#environment-variables)
5. [Installation & Setup](#installation--setup)
6. [Deployment](#deployment)
7. [Configuring Inbound Webhooks](#configuring-inbound-webhooks)
8. [Incident Lifecycle](#incident-lifecycle)
9. [MCP Client Reference](#mcp-client-reference)
10. [Block Kit UI Reference](#block-kit-ui-reference)
11. [Running Tests](#running-tests)
12. [Extending SWARM](#extending-swarm)
13. [Troubleshooting](#troubleshooting)
## Architecture Overview
┌──────────────────────────────────────────────────────────────────────────┐
│ SWARM Architecture │
│ │
│ PagerDuty / Datadog │
│ │ Webhook POST │
│ ▼ │
│ ┌─────────────┐ /incident declare ┌──────────────────┐ │
│ │ Webhook │◄───────────────────────►│ Slash Command │ │
│ │ Trigger │ │ Trigger │ │
│ └──────┬──────┘ └────────┬─────────┘ │
│ │ │ │
│ └────────────────┬────────────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ incident_workflow │ ← Slack Orchestrator │
│ └───────────┬───────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌─────────────┐ ┌────────────┐ │
│ │ triage.ts │ │channel_ │ │ (parallel) │ │
│ │ Claude │ │ create.ts │ │ │ │
│ │ 3.5 │ │ │ │notify_ + │ │
│ │ Sonnet │ │#inc-{svc}- │ │enrich_ │ │
│ │ P1–P4 │ │{timestamp} │ │context.ts │ │
│ └───────────┘ └─────────────┘ └────────────┘ │
│ │ │
│ ┌───────────────────┤ │
│ │ │ │
│ ┌───────▼──────┐ ┌───────▼──────┐ │
│ │ PagerDuty │ │ Datadog MCP │ │
│ │ MCP Client │ │ + Slack RTS │ │
│ │ + GitHub MCP │ │ + GitHub │ │
│ └──────────────┘ │ Deployments │ │
│ └──────────────┘ │
│ │
│ "Resolve" button ──► post_mortem.ts │
│ │ │
│ ┌───────────────┴──────────────┐ │
│ │ │ │
│ ┌──────▼──────┐ ┌────────▼──────┐ │
│ │ GitHub MCP │ │ Jira MCP │ │
│ │ Post-Mortem │ │ Follow-up │ │
│ │ PR (Draft) │ │ Ticket │ │
│ └─────────────┘ └───────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
### Key Design Decisions
| Decision | Rationale |
|---|---|
| **Post-mortem outside main workflow** | Incidents can last hours; Slack workflows have time limits. The `post_mortem` function is triggered by the "Resolve" button action, not the workflow, keeping both paths short-lived. |
| **Steps 3a + 3b run in parallel** | `notify_responders` and `enrich_context` have no data dependency on each other. The Slack platform detects this and executes them concurrently, shaving ~10s off P1 response time. |
| **All enrichment sources are non-fatal** | If Datadog, GitHub Deployments, or Slack RTS fails, the incident channel still opens and responders are still paged. Partial enrichment is always better than no incident channel. |
| **MCP clients are HTTP wrappers now** | Real MCP servers are still maturing. Each client is structured identically to a future MCP tool call so the `fetch()` internals can be swapped with `mcpCall()` with zero API surface changes. |
## Project Structure
opsgenie-slack-agent/
├── manifest.ts # App manifest: scopes, functions, workflow, domains
├── slack.json # Runtime config: Deno permissions, env var declarations
├── import_map.json # Pinned Deno module versions
│
├── functions/
│ ├── triage.ts # Claude 3.5 Sonnet → P1–P4 + routing_team
│ ├── channel_create.ts # channels.create → #inc-{service}-{timestamp}
│ ├── notify_responders.ts # PagerDuty Events API + Slack @mention + incident card
│ ├── enrich_context.ts # Datadog MCP + Slack RTS + GitHub Deployments
│ ├── timeline_updater.ts # Pinned timeline: init / ack / escalate / resolve
│ └── post_mortem.ts # Claude RCA doc → GitHub PR + Jira ticket
│
├── triggers/
│ ├── webhook_trigger.ts # PagerDuty/Datadog inbound webhook + filter rules
│ └── slash_command.ts # /incident declare shortcut
│
├── workflows/
│ └── incident_workflow.ts # 5-step orchestrator (steps 3a+3b parallel)
│
├── mcp/
│ ├── pagerduty.ts # 6 tools: trigger/ack/resolve/oncall/detail/alerts
│ ├── github.ts # 3 tools: createPostMortemPR/commits/deployments
│ ├── jira.ts # 4 tools: create/search/transition/comment
│ └── datadog.ts # 5 tools: monitors/dashboards/metrics/events/sendEvent
│
├── blocks/
│ ├── incident_card.ts # Incident header + metadata + 3 action buttons w/ confirms
│ └── postmortem_card.ts # Resolution summary + conditional PR/Jira buttons
│
└── tests/
├── triage.test.ts # 20 unit tests: parsing, routing, sanitisation
└── e2e_incident.test.ts # 18 E2E tests: full lifecycle with fetch stubs
## Prerequisites
| Tool | Version | Install |
|---|---|---|
| [Slack CLI](https://api.slack.com/automation/cli/install) | ≥ 2.x | `curl -fsSL https://downloads.slack-edge.com/slack-cli/install.sh \| bash` |
| [Deno](https://deno.land/) | ≥ 1.40 | `curl -fsSL https://deno.land/install.sh \| sh` |
| Slack Workspace | Admin access | Required for app deployment |
## Environment Variables
Set all secrets via the Slack CLI — **never commit values to source control**:
slack env add ANTHROPIC_API_KEY sk-ant-...
slack env add PAGERDUTY_API_KEY your-pd-api-key
slack env add PAGERDUTY_ROUTING_KEY your-pd-routing-key
slack env add DATADOG_API_KEY your-dd-api-key
slack env add DATADOG_APP_KEY your-dd-app-key
slack env add GITHUB_TOKEN ghp_...
slack env add GITHUB_ORG your-org-name
slack env add GITHUB_POSTMORTEM_REPO postmortems
slack env add JIRA_API_TOKEN your-jira-token
slack env add JIRA_EMAIL you@your-org.com
slack env add JIRA_BASE_URL https://your-org.atlassian.net
slack env add JIRA_PROJECT_KEY OPS
**Optional — PagerDuty escalation policy IDs per team:**
slack env add PD_POLICY_PLATFORM PXXXXXX
slack env add PD_POLICY_CHECKOUT PXXXXXX
slack env add PD_POLICY_PAYMENTS PXXXXXX
slack env add PD_POLICY_DATA PXXXXXX
slack env add PD_POLICY_INFRA PXXXXXX
slack env add PD_POLICY_SECURITY PXXXXXX
slack env add PD_POLICY_ML PXXXXXX
slack env add PD_POLICY_FRONTEND PXXXXXX
## Installation & Setup
# 1. Clone the repo
git clone https://github.com/your-org/opsgenie-slack-agent
cd opsgenie-slack-agent
# 2. Login to Slack CLI
slack login
# 3. Run locally (hot-reload dev mode)
slack run
# 4. In a new terminal, check the trigger URL
slack triggers list --app
### First-time workspace setup
After `slack run` succeeds, configure your Slack usergroups so SWARM can @mention them:
1. Create usergroups matching the handles in `triage.ts`:
`oncall-platform`, `oncall-checkout`, `oncall-payments`, `oncall-data`,
`oncall-infra`, `oncall-security`, `oncall-ml`, `oncall-frontend`
2. Add the relevant engineers to each usergroup.
3. Update `PAGERDUTY_POLICY_MAP` in `notify_responders.ts` with your actual escalation policy IDs from PagerDuty.
## Deployment
# Deploy to production
slack deploy
# Verify deployed functions
slack functions list
# Check environment variables are set
slack env list
# View logs
slack activity --tail
## Configuring Inbound Webhooks
### PagerDuty
1. Go to **PagerDuty → Services → [Your Service] → Integrations → Add Integration**
2. Select **Generic Webhook (v3)**
3. Set the **Webhook URL** to the URL from `slack triggers list`
4. Under **Events to send**, enable: `incident.triggered`, `incident.acknowledged`, `incident.resolved`
5. Set **Method**: POST, **Content-Type**: application/json
PagerDuty will send a body with this structure:
{
"event": {
"event_type": "incident.triggered",
"data": {
"title": "High error rate on checkout-api",
"status": "triggered",
"service": { "name": "checkout-api" },
...
}
}
}
### Datadog
1. Go to **Datadog → Integrations → Webhooks → New Webhook**
2. Set **URL** to the webhook trigger URL
3. Set **Payload** to:
{
"service_name": "$SERVICE",
"error_message": "$ALERT_TITLE",
"raw_payload": "$EVENT_MSG",
"alert_id": "$ALERT_ID",
"alert_status": "$ALERT_STATUS",
"priority": "$PRIORITY",
"tags": "$TAGS",
"alert_url": "$LINK"
}
4. On your monitors, add `@webhook-swarm` to the **Notify your team** section.
## Incident Lifecycle
1. Alert fires (PagerDuty/Datadog webhook) OR /incident declare
│
▼
2. [triage.ts] Claude 3.5 Sonnet classifies → P1/P2/P3/P4 + routing_team
│
▼
3. [channel_create.ts] #inc-{service}-{YYYYMMDD-HHMM} created + topic set
│
┌────┴────┐
▼ ▼ (parallel)
4a. [notify_responders.ts] 4b. [enrich_context.ts]
PagerDuty Events API v2 Datadog monitors in Alert
→ @mention routing_team Slack RTS: recent mentions
→ Post incident card GitHub: recent deploys
│ │
└────────────┬──────────────────┘
▼
5. [timeline_updater.ts] init
Pinned timeline header posted + pinned to channel
│
│ < Responders join, investigate >
│
┌────┴──────────────────┐
│ Button Actions │
│ (async handlers) │
│ Acknowledge │ → timeline entry
│ Escalate │ → re-pages, severity bump, timeline entry
│ Resolve │ → triggers post_mortem.ts
└───────────────────────┘
│
▼
6. [post_mortem.ts]
Claude generates 8-section RCA doc
→ GitHub PR (draft) in postmortems repo
→ Jira follow-up ticket created
→ Resolution summary card posted to channel
## MCP Client Reference
Each client in `mcp/` is structured identically for future MCP server compatibility.
### `PagerDutyMCPClient`
| Method | Vendor API | Description |
|---|---|---|
| `triggerIncident(payload)` | Events API v2 | Fire an incident event |
| `acknowledgeIncident(dedupKey)` | Events API v2 | Acknowledge via dedup key |
| `resolveIncident(dedupKey)` | Events API v2 | Resolve via dedup key |
| `getOncallUsers(policyId)` | REST v2 `/oncalls` | Fetch on-call engineers |
| `getIncidentDetail(id)` | REST v2 `/incidents/{id}` | Full incident metadata |
| `listRecentAlerts(serviceId)` | REST v2 `/alerts` | Recent alerts for a service |
### `GitHubMCPClient`
| Method | Description |
|---|---|
| `createPostMortemPR(params)` | 8-step: branch → commit → labels → PR → reviewers |
| `getRecentCommits(repo, branch)` | Recent commits on a branch |
| `listDeployments(repo, env)` | Production deployments for a repo |
### `DatadogMCPClient`
| Method | Datadog API | Description |
|---|---|---|
| `searchMonitors(params)` | `/api/v1/monitor/search` | Monitors in Alert for a service |
| `searchDashboards(query)` | `/api/v1/dashboard` | Find dashboards by keyword |
| `queryMetrics(params)` | `/api/v1/query` | Time-series metric query |
| `getEventStream(params)` | `/api/v1/events` | Event stream for a service |
| `sendEvent(params)` | `/api/v1/events` POST | Annotate dashboards with incident markers |
### `JiraMCPClient`
| Method | Jira API | Description |
|---|---|---|
| `createIssue(params)` | `/rest/api/3/issue` | Create issue with ADF description |
| `searchIssues(jql)` | `/rest/api/3/search` | JQL search |
| `transitionIssue(key, name)` | `/rest/api/3/issue/{key}/transitions` | Move to new status |
| `addComment(key, text)` | `/rest/api/3/issue/{key}/comment` | Add ADF comment |
## Block Kit UI Reference
### `buildIncidentCard(params)` → `KnownBlock[]`
P1 CRITICAL — INC-20240315-1423-A3F2
─────────────────────────────────────────
Service: checkout-api │ Team: @oncall-checkout
Declared: 14:23Z │ Status: Triaged · Paged
─────────────────────────────────────────
AI Triage Summary:
> Critical: checkout-api is experiencing a 40% error rate...
─────────────────────────────────────────
Paged: alice@example.com, bob@example.com
─────────────────────────────────────────
[ Acknowledge] [ Escalate] [ Resolve]
─────────────────────────────────────────
ID: INC-20240315-1423-A3F2 · 📟 PagerDuty Incident
### `buildPostMortemCard(params)` → `KnownBlock[]`
RESOLVED — INC-20240315-1423-A3F2
─────────────────────────────────────────
Service: checkout-api │ Severity: P1
Duration: 47m │ Resolved By: @alice
─────────────────────────────────────────
Resolved At: 15:10Z │ Team: @oncall-checkout
─────────────────────────────────────────
Incident Summary:
> Critical: checkout-api experienced...
─────────────────────────────────────────
Resolution Notes:
Rolled back deployment v2.5.1...
─────────────────────────────────────────
[ Review Post-Mortem PR] [ Jira Follow-Up] [ Copy Key]
─────────────────────────────────────────
Generated by SWARM · Post-mortem PR: view · Jira: view
## Running Tests
# Unit tests (no network, no env vars required)
deno test --allow-env tests/triage.test.ts
# E2E integration tests (uses fetch stubs, no real APIs called)
deno test --allow-env tests/e2e_incident.test.ts
# All tests
deno test --allow-env tests/
# With verbose output
deno test --allow-env --reporter=verbose tests/
### Test Coverage Summary
| Test File | Suites | Tests | Network Calls |
|---|---|---|---|
| `triage.test.ts` | 8 | 20 | None (pure functions) |
| `e2e_incident.test.ts` | 8 | 18 | Stubbed via mock fetch |
## Extending SWARM
### Add a new routing team
1. Add the team to `TEAM_ROUTING_MAP` in `functions/triage.ts`
2. Add the PagerDuty policy ID to `PAGERDUTY_POLICY_MAP` in `functions/notify_responders.ts`
3. Run `slack env add PD_POLICY_ `
4. Create the Slack usergroup `oncall-`
### Add a new MCP integration
1. Create `mcp/.ts` following the pattern of existing clients
2. Add the vendor domain to `outgoingDomains` in `manifest.ts`
3. Add the domain to `net` permissions in `slack.json`
4. Import and call the client from the relevant function
### Change the AI model
In `functions/triage.ts` and `functions/post_mortem.ts`:
const CLAUDE_MODEL = "claude-3-5-sonnet-20241022"; // ← change this
Supported values: `claude-3-5-sonnet-20241022`, `claude-3-opus-20240229`, `claude-3-haiku-20240307`
## Troubleshooting
| Symptom | Likely Cause | Fix |
|---|---|---|
| `ANTHROPIC_API_KEY not set` | Env var missing | `slack env add ANTHROPIC_API_KEY sk-ant-...` |
| Channel not created | Bot missing `channels:manage` scope | Reinstall app: `slack install` |
| PagerDuty not paging | Wrong routing key or policy ID | Check `PD_POLICY_*` env vars |
| `outgoing_domain_not_allowed` | Missing domain in manifest | Add to `outgoingDomains` in `manifest.ts` |
| Timeline not pinned | Bot missing `pins:write` scope | Reinstall app |
| GitHub PR not created | Repo doesn't exist or wrong org | Check `GITHUB_ORG` + `GITHUB_POSTMORTEM_REPO` |
| Claude returns non-JSON | Model changed behaviour | Check `parseTriageDecision` fence-stripping |
| `name_taken` on channel create | Rapid duplicate webhooks | Handled automatically via `conversations.list` fallback |
### Viewing live logs
# Stream all function execution logs
slack activity --tail
# Filter by function
slack activity --tail | grep triage
# View a specific run's output
slack activity --run
## License
MIT — see [LICENSE](LICENSE)
*Built with using Slack Next-Gen Platform, Anthropic Claude, and the Model Context Protocol.*
标签:自动化攻击