SushanthKS06/opsgenie-slack-agent

GitHub: SushanthKS06/opsgenie-slack-agent

Stars: 0 | Forks: 0

# OpsGenie — Slack Workflow Autonomous Response Machine ### AI Incident Command Center ## Table of Contents 1. [Architecture Overview](#architecture-overview) 2. [Project Structure](#project-structure) 3. [Prerequisites](#prerequisites) 4. [Environment Variables](#environment-variables) 5. [Installation & Setup](#installation--setup) 6. [Deployment](#deployment) 7. [Configuring Inbound Webhooks](#configuring-inbound-webhooks) 8. [Incident Lifecycle](#incident-lifecycle) 9. [MCP Client Reference](#mcp-client-reference) 10. [Block Kit UI Reference](#block-kit-ui-reference) 11. [Running Tests](#running-tests) 12. [Extending SWARM](#extending-swarm) 13. [Troubleshooting](#troubleshooting) ## Architecture Overview ┌──────────────────────────────────────────────────────────────────────────┐ │ SWARM Architecture │ │ │ │ PagerDuty / Datadog │ │ │ Webhook POST │ │ ▼ │ │ ┌─────────────┐ /incident declare ┌──────────────────┐ │ │ │ Webhook │◄───────────────────────►│ Slash Command │ │ │ │ Trigger │ │ Trigger │ │ │ └──────┬──────┘ └────────┬─────────┘ │ │ │ │ │ │ └────────────────┬────────────────────────┘ │ │ ▼ │ │ ┌───────────────────────┐ │ │ │ incident_workflow │ ← Slack Orchestrator │ │ └───────────┬───────────┘ │ │ │ │ │ ┌───────────────┼───────────────┐ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌───────────┐ ┌─────────────┐ ┌────────────┐ │ │ │ triage.ts │ │channel_ │ │ (parallel) │ │ │ │ Claude │ │ create.ts │ │ │ │ │ │ 3.5 │ │ │ │notify_ + │ │ │ │ Sonnet │ │#inc-{svc}- │ │enrich_ │ │ │ │ P1–P4 │ │{timestamp} │ │context.ts │ │ │ └───────────┘ └─────────────┘ └────────────┘ │ │ │ │ │ ┌───────────────────┤ │ │ │ │ │ │ ┌───────▼──────┐ ┌───────▼──────┐ │ │ │ PagerDuty │ │ Datadog MCP │ │ │ │ MCP Client │ │ + Slack RTS │ │ │ │ + GitHub MCP │ │ + GitHub │ │ │ └──────────────┘ │ Deployments │ │ │ └──────────────┘ │ │ │ │ "Resolve" button ──► post_mortem.ts │ │ │ │ │ ┌───────────────┴──────────────┐ │ │ │ │ │ │ ┌──────▼──────┐ ┌────────▼──────┐ │ │ │ GitHub MCP │ │ Jira MCP │ │ │ │ Post-Mortem │ │ Follow-up │ │ │ │ PR (Draft) │ │ Ticket │ │ │ └─────────────┘ └───────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ ### Key Design Decisions | Decision | Rationale | |---|---| | **Post-mortem outside main workflow** | Incidents can last hours; Slack workflows have time limits. The `post_mortem` function is triggered by the "Resolve" button action, not the workflow, keeping both paths short-lived. | | **Steps 3a + 3b run in parallel** | `notify_responders` and `enrich_context` have no data dependency on each other. The Slack platform detects this and executes them concurrently, shaving ~10s off P1 response time. | | **All enrichment sources are non-fatal** | If Datadog, GitHub Deployments, or Slack RTS fails, the incident channel still opens and responders are still paged. Partial enrichment is always better than no incident channel. | | **MCP clients are HTTP wrappers now** | Real MCP servers are still maturing. Each client is structured identically to a future MCP tool call so the `fetch()` internals can be swapped with `mcpCall()` with zero API surface changes. | ## Project Structure opsgenie-slack-agent/ ├── manifest.ts # App manifest: scopes, functions, workflow, domains ├── slack.json # Runtime config: Deno permissions, env var declarations ├── import_map.json # Pinned Deno module versions │ ├── functions/ │ ├── triage.ts # Claude 3.5 Sonnet → P1–P4 + routing_team │ ├── channel_create.ts # channels.create → #inc-{service}-{timestamp} │ ├── notify_responders.ts # PagerDuty Events API + Slack @mention + incident card │ ├── enrich_context.ts # Datadog MCP + Slack RTS + GitHub Deployments │ ├── timeline_updater.ts # Pinned timeline: init / ack / escalate / resolve │ └── post_mortem.ts # Claude RCA doc → GitHub PR + Jira ticket │ ├── triggers/ │ ├── webhook_trigger.ts # PagerDuty/Datadog inbound webhook + filter rules │ └── slash_command.ts # /incident declare shortcut │ ├── workflows/ │ └── incident_workflow.ts # 5-step orchestrator (steps 3a+3b parallel) │ ├── mcp/ │ ├── pagerduty.ts # 6 tools: trigger/ack/resolve/oncall/detail/alerts │ ├── github.ts # 3 tools: createPostMortemPR/commits/deployments │ ├── jira.ts # 4 tools: create/search/transition/comment │ └── datadog.ts # 5 tools: monitors/dashboards/metrics/events/sendEvent │ ├── blocks/ │ ├── incident_card.ts # Incident header + metadata + 3 action buttons w/ confirms │ └── postmortem_card.ts # Resolution summary + conditional PR/Jira buttons │ └── tests/ ├── triage.test.ts # 20 unit tests: parsing, routing, sanitisation └── e2e_incident.test.ts # 18 E2E tests: full lifecycle with fetch stubs ## Prerequisites | Tool | Version | Install | |---|---|---| | [Slack CLI](https://api.slack.com/automation/cli/install) | ≥ 2.x | `curl -fsSL https://downloads.slack-edge.com/slack-cli/install.sh \| bash` | | [Deno](https://deno.land/) | ≥ 1.40 | `curl -fsSL https://deno.land/install.sh \| sh` | | Slack Workspace | Admin access | Required for app deployment | ## Environment Variables Set all secrets via the Slack CLI — **never commit values to source control**: slack env add ANTHROPIC_API_KEY sk-ant-... slack env add PAGERDUTY_API_KEY your-pd-api-key slack env add PAGERDUTY_ROUTING_KEY your-pd-routing-key slack env add DATADOG_API_KEY your-dd-api-key slack env add DATADOG_APP_KEY your-dd-app-key slack env add GITHUB_TOKEN ghp_... slack env add GITHUB_ORG your-org-name slack env add GITHUB_POSTMORTEM_REPO postmortems slack env add JIRA_API_TOKEN your-jira-token slack env add JIRA_EMAIL you@your-org.com slack env add JIRA_BASE_URL https://your-org.atlassian.net slack env add JIRA_PROJECT_KEY OPS **Optional — PagerDuty escalation policy IDs per team:** slack env add PD_POLICY_PLATFORM PXXXXXX slack env add PD_POLICY_CHECKOUT PXXXXXX slack env add PD_POLICY_PAYMENTS PXXXXXX slack env add PD_POLICY_DATA PXXXXXX slack env add PD_POLICY_INFRA PXXXXXX slack env add PD_POLICY_SECURITY PXXXXXX slack env add PD_POLICY_ML PXXXXXX slack env add PD_POLICY_FRONTEND PXXXXXX ## Installation & Setup # 1. Clone the repo git clone https://github.com/your-org/opsgenie-slack-agent cd opsgenie-slack-agent # 2. Login to Slack CLI slack login # 3. Run locally (hot-reload dev mode) slack run # 4. In a new terminal, check the trigger URL slack triggers list --app ### First-time workspace setup After `slack run` succeeds, configure your Slack usergroups so SWARM can @mention them: 1. Create usergroups matching the handles in `triage.ts`: `oncall-platform`, `oncall-checkout`, `oncall-payments`, `oncall-data`, `oncall-infra`, `oncall-security`, `oncall-ml`, `oncall-frontend` 2. Add the relevant engineers to each usergroup. 3. Update `PAGERDUTY_POLICY_MAP` in `notify_responders.ts` with your actual escalation policy IDs from PagerDuty. ## Deployment # Deploy to production slack deploy # Verify deployed functions slack functions list # Check environment variables are set slack env list # View logs slack activity --tail ## Configuring Inbound Webhooks ### PagerDuty 1. Go to **PagerDuty → Services → [Your Service] → Integrations → Add Integration** 2. Select **Generic Webhook (v3)** 3. Set the **Webhook URL** to the URL from `slack triggers list` 4. Under **Events to send**, enable: `incident.triggered`, `incident.acknowledged`, `incident.resolved` 5. Set **Method**: POST, **Content-Type**: application/json PagerDuty will send a body with this structure: { "event": { "event_type": "incident.triggered", "data": { "title": "High error rate on checkout-api", "status": "triggered", "service": { "name": "checkout-api" }, ... } } } ### Datadog 1. Go to **Datadog → Integrations → Webhooks → New Webhook** 2. Set **URL** to the webhook trigger URL 3. Set **Payload** to: { "service_name": "$SERVICE", "error_message": "$ALERT_TITLE", "raw_payload": "$EVENT_MSG", "alert_id": "$ALERT_ID", "alert_status": "$ALERT_STATUS", "priority": "$PRIORITY", "tags": "$TAGS", "alert_url": "$LINK" } 4. On your monitors, add `@webhook-swarm` to the **Notify your team** section. ## Incident Lifecycle 1. Alert fires (PagerDuty/Datadog webhook) OR /incident declare │ ▼ 2. [triage.ts] Claude 3.5 Sonnet classifies → P1/P2/P3/P4 + routing_team │ ▼ 3. [channel_create.ts] #inc-{service}-{YYYYMMDD-HHMM} created + topic set │ ┌────┴────┐ ▼ ▼ (parallel) 4a. [notify_responders.ts] 4b. [enrich_context.ts] PagerDuty Events API v2 Datadog monitors in Alert → @mention routing_team Slack RTS: recent mentions → Post incident card GitHub: recent deploys │ │ └────────────┬──────────────────┘ ▼ 5. [timeline_updater.ts] init Pinned timeline header posted + pinned to channel │ │ < Responders join, investigate > │ ┌────┴──────────────────┐ │ Button Actions │ │ (async handlers) │ │ Acknowledge │ → timeline entry │ Escalate │ → re-pages, severity bump, timeline entry │ Resolve │ → triggers post_mortem.ts └───────────────────────┘ │ ▼ 6. [post_mortem.ts] Claude generates 8-section RCA doc → GitHub PR (draft) in postmortems repo → Jira follow-up ticket created → Resolution summary card posted to channel ## MCP Client Reference Each client in `mcp/` is structured identically for future MCP server compatibility. ### `PagerDutyMCPClient` | Method | Vendor API | Description | |---|---|---| | `triggerIncident(payload)` | Events API v2 | Fire an incident event | | `acknowledgeIncident(dedupKey)` | Events API v2 | Acknowledge via dedup key | | `resolveIncident(dedupKey)` | Events API v2 | Resolve via dedup key | | `getOncallUsers(policyId)` | REST v2 `/oncalls` | Fetch on-call engineers | | `getIncidentDetail(id)` | REST v2 `/incidents/{id}` | Full incident metadata | | `listRecentAlerts(serviceId)` | REST v2 `/alerts` | Recent alerts for a service | ### `GitHubMCPClient` | Method | Description | |---|---| | `createPostMortemPR(params)` | 8-step: branch → commit → labels → PR → reviewers | | `getRecentCommits(repo, branch)` | Recent commits on a branch | | `listDeployments(repo, env)` | Production deployments for a repo | ### `DatadogMCPClient` | Method | Datadog API | Description | |---|---|---| | `searchMonitors(params)` | `/api/v1/monitor/search` | Monitors in Alert for a service | | `searchDashboards(query)` | `/api/v1/dashboard` | Find dashboards by keyword | | `queryMetrics(params)` | `/api/v1/query` | Time-series metric query | | `getEventStream(params)` | `/api/v1/events` | Event stream for a service | | `sendEvent(params)` | `/api/v1/events` POST | Annotate dashboards with incident markers | ### `JiraMCPClient` | Method | Jira API | Description | |---|---|---| | `createIssue(params)` | `/rest/api/3/issue` | Create issue with ADF description | | `searchIssues(jql)` | `/rest/api/3/search` | JQL search | | `transitionIssue(key, name)` | `/rest/api/3/issue/{key}/transitions` | Move to new status | | `addComment(key, text)` | `/rest/api/3/issue/{key}/comment` | Add ADF comment | ## Block Kit UI Reference ### `buildIncidentCard(params)` → `KnownBlock[]` P1 CRITICAL — INC-20240315-1423-A3F2 ───────────────────────────────────────── Service: checkout-api │ Team: @oncall-checkout Declared: 14:23Z │ Status: Triaged · Paged ───────────────────────────────────────── AI Triage Summary: > Critical: checkout-api is experiencing a 40% error rate... ───────────────────────────────────────── Paged: alice@example.com, bob@example.com ───────────────────────────────────────── [ Acknowledge] [ Escalate] [ Resolve] ───────────────────────────────────────── ID: INC-20240315-1423-A3F2 · 📟 PagerDuty Incident ### `buildPostMortemCard(params)` → `KnownBlock[]` RESOLVED — INC-20240315-1423-A3F2 ───────────────────────────────────────── Service: checkout-api │ Severity: P1 Duration: 47m │ Resolved By: @alice ───────────────────────────────────────── Resolved At: 15:10Z │ Team: @oncall-checkout ───────────────────────────────────────── Incident Summary: > Critical: checkout-api experienced... ───────────────────────────────────────── Resolution Notes: Rolled back deployment v2.5.1... ───────────────────────────────────────── [ Review Post-Mortem PR] [ Jira Follow-Up] [ Copy Key] ───────────────────────────────────────── Generated by SWARM · Post-mortem PR: view · Jira: view ## Running Tests # Unit tests (no network, no env vars required) deno test --allow-env tests/triage.test.ts # E2E integration tests (uses fetch stubs, no real APIs called) deno test --allow-env tests/e2e_incident.test.ts # All tests deno test --allow-env tests/ # With verbose output deno test --allow-env --reporter=verbose tests/ ### Test Coverage Summary | Test File | Suites | Tests | Network Calls | |---|---|---|---| | `triage.test.ts` | 8 | 20 | None (pure functions) | | `e2e_incident.test.ts` | 8 | 18 | Stubbed via mock fetch | ## Extending SWARM ### Add a new routing team 1. Add the team to `TEAM_ROUTING_MAP` in `functions/triage.ts` 2. Add the PagerDuty policy ID to `PAGERDUTY_POLICY_MAP` in `functions/notify_responders.ts` 3. Run `slack env add PD_POLICY_ ` 4. Create the Slack usergroup `oncall-` ### Add a new MCP integration 1. Create `mcp/.ts` following the pattern of existing clients 2. Add the vendor domain to `outgoingDomains` in `manifest.ts` 3. Add the domain to `net` permissions in `slack.json` 4. Import and call the client from the relevant function ### Change the AI model In `functions/triage.ts` and `functions/post_mortem.ts`: const CLAUDE_MODEL = "claude-3-5-sonnet-20241022"; // ← change this Supported values: `claude-3-5-sonnet-20241022`, `claude-3-opus-20240229`, `claude-3-haiku-20240307` ## Troubleshooting | Symptom | Likely Cause | Fix | |---|---|---| | `ANTHROPIC_API_KEY not set` | Env var missing | `slack env add ANTHROPIC_API_KEY sk-ant-...` | | Channel not created | Bot missing `channels:manage` scope | Reinstall app: `slack install` | | PagerDuty not paging | Wrong routing key or policy ID | Check `PD_POLICY_*` env vars | | `outgoing_domain_not_allowed` | Missing domain in manifest | Add to `outgoingDomains` in `manifest.ts` | | Timeline not pinned | Bot missing `pins:write` scope | Reinstall app | | GitHub PR not created | Repo doesn't exist or wrong org | Check `GITHUB_ORG` + `GITHUB_POSTMORTEM_REPO` | | Claude returns non-JSON | Model changed behaviour | Check `parseTriageDecision` fence-stripping | | `name_taken` on channel create | Rapid duplicate webhooks | Handled automatically via `conversations.list` fallback | ### Viewing live logs # Stream all function execution logs slack activity --tail # Filter by function slack activity --tail | grep triage # View a specific run's output slack activity --run ## License MIT — see [LICENSE](LICENSE) *Built with using Slack Next-Gen Platform, Anthropic Claude, and the Model Context Protocol.*
标签:自动化攻击