raman01211/incident-response-automation

GitHub: raman01211/incident-response-automation

Stars: 1 | Forks: 0

# Incident Response Automation ![Python](https://img.shields.io/badge/python-3.11-blue) ![PagerDuty](https://img.shields.io/badge/integration-PagerDuty-orange) ![Slack](https://img.shields.io/badge/integration-Slack-green) ![Docker](https://img.shields.io/badge/container-Docker-2496ED) ![Prometheus](https://img.shields.io/badge/monitoring-Prometheus-E6522C) ## Architecture ┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ │ Prometheus │────▶│ Alertmanager │────▶│ Incident API │ │ (alerts) │ │ (routing) │ │ (Flask) │ └─────────────┘ └──────────────┘ └────────┬─────────┘ │ ┌─────────────────────────────┤ │ │ │ ┌─────▼────┐ ┌─────▼────┐ ┌──────▼─────┐ │PostgreSQL│ │ Slack │ │ Grafana │ │(storage) │ │(notify) │ │(dashboard) │ └──────────┘ └──────────┘ └────────────┘ │ ┌─────▼────┐ │PagerDuty │ │(on-call) │ └──────────┘ ## Features - **Alert Ingestion**: Receives webhooks from Alertmanager and PagerDuty - **Incident Management**: Create, acknowledge, resolve with full lifecycle - **Slack Notifications**: Rich Block Kit messages for every state change - **Automated Runbooks**: K8s pod restart, scale-up, rollback, health checks - **Blameless Postmortems**: Auto-generated markdown reports with timeline & action items - **Dashboard**: Grafana dashboards for real-time incident visibility - **On-Call Integration**: PagerDuty schedule-aware escalation ## Quick Start # Start the stack make up # Simulate an alert make simulate-alert # Simulate a full incident lifecycle make simulate-incident # List incidents make list-incidents # Generate postmortem make generate-postmortem # Open Grafana dashboard make dashboard # Tear down make down ## Prerequisites - Docker & Docker Compose - Python 3.11+ - Kubernetes cluster (for runbook execution) - Slack webhook URL (optional, for notifications) - PagerDuty API key (optional, for on-call sync) ## Commands | Command | Description | |---------|-------------| | `make up` | Start all services | | `make down` | Stop all services | | `make simulate-alert` | Send test Alertmanager webhook | | `make simulate-incident` | Run full incident lifecycle simulation | | `make list-incidents` | List all incidents | | `make generate-postmortem` | Generate postmortem for latest incident | | `make dashboard` | Open Grafana | | `make clean` | Remove all data volumes |