akshayapannir-sre/sre-runbooks
GitHub: akshayapannir-sre/sre-runbooks
Stars: 0 | Forks: 0
# SRE Runbooks & Incident Playbooks
## 📁 Structure
sre-runbooks/
├── incidents/
│ ├── high-cpu-ec2.md
│ ├── pod-crashloopbackoff.md
│ ├── disk-pressure-node.md
│ └── jenkins-pipeline-failure.md
├── postmortems/
│ └── postmortem-template.md
├── kubernetes/
│ ├── node-not-ready.md
│ ├── pvc-stuck-terminating.md
│ └── cluster-upgrade-checklist.md
├── aws/
│ ├── ec2-instance-recovery.md
│ └── rds-failover-playbook.md
└── templates/
└── on-call-handoff.md
## 🚨 Incident Runbooks
| Runbook | Severity | Service |
|---|---|---|
| [High CPU on EC2](incidents/high-cpu-ec2.md) | P2 | EC2 |
| [Pod CrashLoopBackOff](incidents/pod-crashloopbackoff.md) | P1 | Kubernetes |
| [Disk Pressure on Node](incidents/disk-pressure-node.md) | P2 | Kubernetes |
| [Jenkins Pipeline Failure](incidents/jenkins-pipeline-failure.md) | P3 | CI/CD |
| [Node Not Ready](kubernetes/node-not-ready.md) | P1 | Kubernetes |
| [PVC Stuck Terminating](kubernetes/pvc-stuck-terminating.md) | P2 | Kubernetes |
## 📋 Postmortem Template
Every incident gets a postmortem. See [postmortem-template.md](postmortems/postmortem-template.md)
## 🔧 Based On Real Experience
- Self-managed Kubernetes cluster operations (v1.32 → v1.34)
- AWS infrastructure (EC2, EKS, RDS, VPC)
- Jenkins CI/CD pipeline management
- GCP → AWS migration incident handling
- SOC 2 compliance operations
## 📖 References
- [Google SRE Book](https://sre.google/sre-book/table-of-contents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)