Scoutflo/Scoutflo-SRE-Playbooks

GitHub: Scoutflo/Scoutflo-SRE-Playbooks

Stars: 71 | Forks: 18

# SRE Playbooks Repository [![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) [![Contributions Welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md) [![GitHub Issues](https://img.shields.io/github/issues/Scoutflo/scoutflo-SRE-Playbooks)](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/issues) [![GitHub Stars](https://img.shields.io/github/stars/Scoutflo/scoutflo-SRE-Playbooks)](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/stargazers) [![GitHub Forks](https://img.shields.io/github/forks/Scoutflo/scoutflo-SRE-Playbooks)](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/network/members) [![GitHub Discussions](https://img.shields.io/github/discussions/Scoutflo/scoutflo-SRE-Playbooks)](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/discussions) [![GitHub Contributors](https://img.shields.io/github/contributors/Scoutflo/scoutflo-SRE-Playbooks)](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/graphs/contributors) ## Table of Contents - [Overview](#overview) - [Repository Structure](#repository-structure) - [Contents](#contents) - [Getting Started](#getting-started) - [Usage](#usage) - [Terminology & Glossary](#terminology--glossary) - [Quick Reference](#quick-reference) - [Troubleshooting Guide](#troubleshooting-guide) - [Examples & Use Cases](#examples--use-cases) - [FAQ](#faq) - [Video Tutorials](#video-tutorials) - [Roadmap](#roadmap) - [Contributing](#contributing) - [Connect with Us](#connect-with-us) - [Support](#support) - [Related Resources](#related-resources) - [License](#license) ## Overview This repository contains **414 comprehensive incident response playbooks** designed to help Site Reliability Engineers (SREs) systematically diagnose and resolve common infrastructure and application issues in AWS, Kubernetes, and Sentry environments. ### Why This Repository? - **Systematic Approach**: Each playbook follows a consistent structure with clear diagnostic steps - **Time-Saving**: Quickly identify root causes with correlation analysis frameworks - **Community-Driven**: Continuously improved by the open-source community - **Production-Ready**: Based on real-world incident response scenarios - **Comprehensive Coverage**: 232 Kubernetes playbooks + 157 AWS playbooks + 25 Sentry playbooks - **Proactive Monitoring**: 56 K8s + 65 AWS proactive playbooks for capacity planning and compliance ### Diagnosis Improvements All playbooks use an **events-first approach** for root cause analysis: - Diagnosis sections prioritize checking recent events and changes before diving into configuration details - Conditional logic patterns help narrow down causes based on observed symptoms - Time-based correlation analysis connects events to failures systematically ### Use Cases - **During Incidents**: Quick reference for troubleshooting common issues - **On-Call Rotation**: Essential runbook collection for on-call engineers - **Knowledge Sharing**: Standardize troubleshooting procedures across teams - **Training**: Learn systematic incident response methodologies - **Documentation**: Build your own runbook library ## Repository Structure scoutflo-SRE-Playbooks/ ├── AWS Playbooks/ # 157 AWS playbooks │ ├── 01-Compute/ # 27 playbooks (EC2, Lambda, ECS, EKS) │ ├── 02-Database/ # 8 playbooks (RDS, DynamoDB) │ ├── 03-Storage/ # 7 playbooks (S3) │ ├── 04-Networking/ # 17 playbooks (VPC, ELB, Route53) │ ├── 05-Security/ # 16 playbooks (IAM, KMS, GuardDuty) │ ├── 06-Monitoring/ # 8 playbooks (CloudTrail, CloudWatch) │ ├── 07-CI-CD/ # 9 playbooks (CodePipeline, CodeBuild) │ ├── 08-Proactive/ # 65 proactive monitoring playbooks │ └── README.md ├── K8s Playbooks/ # 232 Kubernetes playbooks │ ├── 01-Control-Plane/ # 24 playbooks │ ├── 02-Nodes/ # 24 playbooks │ ├── 03-Pods/ # 41 playbooks │ ├── 04-Workloads/ # 25 playbooks │ ├── 05-Networking/ # 27 playbooks │ ├── 06-Storage/ # 9 playbooks │ ├── 07-RBAC/ # 6 playbooks │ ├── 08-Configuration/ # 6 playbooks │ ├── 09-Resource-Management/ # 8 playbooks │ ├── 10-Monitoring-Autoscaling/ # 3 playbooks │ ├── 11-Installation-Setup/ # 1 playbook │ ├── 12-Namespaces/ # 2 playbooks │ ├── 13-Proactive/ # 56 proactive monitoring playbooks │ └── README.md ├── Sentry Playbooks/ # 25 Sentry playbooks │ ├── 01-Error-Tracking/ # 19 playbooks │ ├── 02-Performance/ # 6 playbooks │ ├── 03-Release-Health/ # Placeholder │ └── README.md ├── CONTRIBUTING.md └── README.md ## Contents ### AWS Playbooks (`AWS Playbooks/`) **157 playbooks** covering 7 service categories + proactive monitoring: - **Compute Services** (27 playbooks): EC2, Lambda, ECS, EKS - **Database** (8 playbooks): RDS, DynamoDB - **Storage** (7 playbooks): S3 - **Networking** (17 playbooks): VPC, ELB, Route 53, NAT Gateway - **Security** (16 playbooks): IAM, KMS, GuardDuty, CloudTrail - **Monitoring** (8 playbooks): CloudTrail, CloudWatch - **CI/CD** (9 playbooks): CodePipeline, CodeBuild - **Proactive** (65 playbooks): Capacity planning, compliance, cost optimization **Key Topics:** - Connection timeouts and network issues - Access denied and permission problems - Resource unavailability and capacity issues - Security breaches and threat detection - Service integration failures - Proactive capacity and compliance monitoring See [AWS Playbooks/README.md](AWS%20Playbooks/README.md) for complete documentation and playbook list. ### Kubernetes Playbooks (`K8s Playbooks/`) **194 playbooks** organized into **13 categorized folders** covering Kubernetes cluster and workload issues: **Folder Structure:** - `01-Control-Plane/` (18 playbooks) - API Server, Scheduler, Controller Manager, etcd - `02-Nodes/` (12 playbooks) - Node readiness, kubelet issues, resource constraints - `03-Pods/` (31 playbooks) - Scheduling, lifecycle, health checks, resource limits - `04-Workloads/` (23 playbooks) - Deployments, StatefulSets, DaemonSets, Jobs, HPA - `05-Networking/` (19 playbooks) - Services, Ingress, DNS, Network Policies, kube-proxy - `06-Storage/` (9 playbooks) - PersistentVolumes, PersistentVolumeClaims, StorageClasses - `07-RBAC/` (6 playbooks) - ServiceAccounts, Roles, RoleBindings, authorization - `08-Configuration/` (6 playbooks) - ConfigMaps and Secrets access issues - `09-Resource-Management/` (8 playbooks) - Resource Quotas, overcommit, compute resources - `10-Monitoring-Autoscaling/` (3 playbooks) - Metrics Server, Cluster Autoscaler - `11-Installation-Setup/` (1 playbook) - Helm and installation issues - `12-Namespaces/` (2 playbooks) - Namespace management issues - `13-Proactive/` (56 playbooks) - Proactive monitoring, capacity planning, compliance **Key Topics:** - Pod lifecycle issues (CrashLoopBackOff, Pending, Terminating) - Control plane component failures - Network connectivity and DNS resolution - Storage and volume mounting problems - RBAC and permission errors - Resource quota and capacity constraints - Proactive capacity and compliance monitoring See [K8s Playbooks/README.md](K8s%20Playbooks/README.md) for complete documentation and playbook list. ### Sentry Playbooks (`Sentry Playbooks/`) **25 playbooks** covering error tracking and performance monitoring: **Folder Structure:** - `01-Error-Tracking/` (19 playbooks) - Error capture, grouping, alerting, and debugging - `02-Performance/` (6 playbooks) - Transaction monitoring, performance issues, tracing - `03-Release-Health/` - Release tracking and health monitoring (placeholder) **Key Topics:** - Error capture and reporting issues - Issue grouping and deduplication - Alert configuration and routing - Performance transaction monitoring - SDK integration troubleshooting - Release health tracking See [Sentry Playbooks/README.md](Sentry%20Playbooks/README.md) for complete documentation and playbook list. ## Getting Started ### Prerequisites - Basic knowledge of AWS services, Kubernetes, or Sentry - Access to AWS Console, Kubernetes cluster, or Sentry dashboard (for using playbooks) - Git (for cloning the repository) ### Installation #### Option 1: Clone the Repository # Clone the repository git clone https://github.com/Scoutflo/scoutflo-SRE-Playbooks.git # Navigate to the repository cd scoutflo-SRE-Playbooks # View available playbooks ls AWS\ Playbooks/ ls K8s\ Playbooks/ ls Sentry\ Playbooks/ #### Option 2: Use as Git Submodule Include playbooks in your own projects: git submodule add https://github.com/Scoutflo/scoutflo-SRE-Playbooks.git playbooks #### Option 3: Download Specific Playbooks Browse and download individual playbooks directly from GitHub web interface. ### Quick Start 1. **Identify Your Issue**: Determine if it's an AWS, Kubernetes, or Sentry issue 2. **Navigate to Playbooks**: - AWS issues -> `AWS Playbooks/` - K8s issues -> `K8s Playbooks/[category-folder]/` - Sentry issues -> `Sentry Playbooks/[category-folder]/` 3. **Find the Playbook**: Match your symptoms to a playbook title 4. **Follow the Steps**: Execute diagnostic steps in order 5. **Use Diagnosis Section**: Apply correlation analysis for root cause identification ### Learn More - **Watch Tutorials**: Check our [YouTube channel](https://www.youtube.com/@scoutflo6727) for video walkthroughs and best practices - **AI SRE Demo**: Watch the [Scoutflo AI SRE Demo](https://youtu.be/P6xzFUtRqRc?si=0VN9oMV05rNzXFs8) to see AI-powered incident response - **Scoutflo Documentation**: Visit [Scoutflo Documentation](https://scoutflo-documentation.gitbook.io/scoutflo-documentation) for platform guides - **Join the Community**: Connect with other SREs in our [Slack workspace](https://scoutflo.slack.com) ### Example Usage **Scenario**: EC2 instance SSH connection timeout 1. Navigate to `AWS Playbooks/` 2. Open `Connection-Timeout-SSH-Issues-EC2.md` 3. Follow the Playbook steps, replacing `` with your actual instance ID 4. Use the Diagnosis section to correlate events with failures 5. Apply the identified fix ## Usage ### How Playbooks Work **Important**: These playbooks are designed for **AI agents** using natural language processing (NLP). They use natural language instructions that AI agents interpret and execute using available tools (like AWS MCP tools, Kubernetes MCP tools, or kubectl). **Example Playbook Step:** - Natural Language: "Retrieve logs from pod `` in namespace `` and analyze error messages" - AI Agent Action: Interprets this instruction and uses appropriate tools to fetch and analyze pod logs **For Manual Use:** - While playbooks are optimized for AI agents, you can also use them manually - The README files in each category folder include equivalent kubectl/AWS CLI commands for manual verification - Replace placeholders with actual resource identifiers when following steps manually ### Playbook Structure 1. **Title** - Clear, descriptive issue identification 2. **Meaning** - What the issue means, triggers, symptoms, root causes 3. **Impact** - Business and technical implications 4. **Playbook** - 8-10 numbered diagnostic steps in natural language (ordered from common to specific) 5. **Diagnosis** - Correlation analysis framework with time windows using events-first approach and conditional logic patterns ### Best Practices - **For AI Agents**: Playbooks are optimized for AI interpretation - use natural language instructions - **For Manual Use**: See category README files for equivalent kubectl/AWS CLI commands - **Replace Placeholders**: All playbooks use placeholders (e.g., ``, ``) that must be replaced with actual values - **Follow Order**: Execute steps sequentially unless you have strong evidence pointing to a specific step - **Correlate Timestamps**: Use the Diagnosis section to correlate events with failures - **Extend Windows**: If initial correlations don't reveal causes, extend time windows as suggested ### Placeholder Reference **AWS Playbooks:** - ``, ``, ``, ``, ``, ``, ``, ``, ``, `` **Kubernetes Playbooks:** - ``, ``, ``, ``, ``, ``, ``, ``, `` **Sentry Playbooks:** - ``, ``, ``, ``, ``, `` ## Terminology & Glossary Understanding the terms used in these playbooks will help you use them more effectively. For detailed glossaries, see: - [AWS Terminology](AWS%20Playbooks/README.md#terminology--glossary) - [Kubernetes Terminology](K8s%20Playbooks/README.md#terminology--glossary) ### Quick Reference **SRE (Site Reliability Engineering)** - A discipline combining software engineering and operations to build reliable systems. **Playbook / Runbook** - A step-by-step guide for diagnosing and resolving specific issues. **Incident** - An event that disrupts or degrades a service, requiring immediate attention. **On-Call** - Engineers available to respond to incidents outside normal business hours. **MTTR (Mean Time To Recovery)** - Average time to restore a service after an incident. Playbooks help reduce MTTR. **Correlation Analysis** - Finding relationships between events (like configuration changes) and symptoms (like service failures) by comparing timestamps. **Root Cause** - The underlying reason why an issue occurred, as opposed to just the symptoms. **Placeholder** - A value in playbooks (like ``) that you replace with your actual resource identifier. **Diagnosis Section** - Part of each playbook that helps you correlate events with failures using time-based analysis. ### Common Abbreviations - **K8s**: Kubernetes (K + 8 letters + s) - **SRE**: Site Reliability Engineering - **MTTR**: Mean Time To Recovery - **API**: Application Programming Interface - **DNS**: Domain Name System - **RBAC**: Role-Based Access Control - **PVC**: PersistentVolumeClaim - **HPA**: Horizontal Pod Autoscaler **For detailed explanations of AWS and Kubernetes terms, see the respective README files above.** ## Quick Reference Need a quick cheat sheet? Check out our [Quick Reference Card](QUICK_REFERENCE.md) for: - One-page overview - Common commands - Quick lookup tables - Essential links ## Troubleshooting Guide Not sure which playbook to use? Use our [Troubleshooting Decision Tree](TROUBLESHOOTING_FLOWCHART.md) to: - Quickly identify the right playbook - Navigate by issue type - Look up by error message or alert name ## Examples & Use Cases See real-world scenarios in [EXAMPLES.md](EXAMPLES.md): - Step-by-step examples - Common workflows - Success stories - Best practices ## FAQ Have questions? Check our [FAQ](FAQ.md) for answers to: - General questions - Usage questions - Technical questions - Contributing questions ## Video Tutorials Learn how to use these playbooks effectively: **Coming Soon**: Video tutorials for: - How to use playbooks effectively - Common troubleshooting scenarios - Contributing to playbooks - Advanced correlation analysis ## Roadmap Check out our [ROADMAP.md](ROADMAP.md) to see: - Planned features and new playbook categories - Short-term and long-term goals - How to suggest new features - Release history #### 1. Reporting Issues Found a bug, unclear instruction, or have a suggestion? 1. **Check Existing Issues**: Search [GitHub Issues](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/issues) first 2. **Create a New Issue**: - Use clear, descriptive title - Describe the problem or suggestion - Include relevant service/component, error messages, or examples - Tag with appropriate labels (`aws-playbook`, `k8s-playbook`, `sentry-playbook`, `bug`, `enhancement`, etc.) #### 2. Improving Existing Playbooks To fix or enhance existing playbooks: 1. **Fork the Repository**: Create your own fork 2. **Create a Branch**: git checkout -b fix/playbook-name-improvement 3. **Make Your Changes**: - Follow the established playbook structure - Maintain consistency with existing formatting - Update placeholders and examples as needed 4. **Test Your Changes**: Verify the playbook is accurate and helpful 5. **Commit and Push**: git add . git commit -m "Fix: Improve [playbook-name] with [description]" git push origin fix/playbook-name-improvement 6. **Create a Pull Request**: - Provide clear description of changes - Reference any related issues - Request review from maintainers #### 3. Adding New Playbooks To add a new playbook for an uncovered issue: 1. **Check for Duplicates**: Ensure a similar playbook doesn't already exist 2. **Follow the Structure**: Use existing playbooks as templates 3. **Choose the Right Location**: - AWS playbooks -> `AWS Playbooks/` - K8s playbooks -> Appropriate category folder in `K8s Playbooks/` - Sentry playbooks -> Appropriate category folder in `Sentry Playbooks/` 4. **Follow Naming Conventions**: - AWS: `-.md` - K8s: `-.md` - Sentry: `-.md` 5. **Include All Sections**: Title, Meaning, Impact, Playbook (8-10 steps), Diagnosis (5 correlations) 6. **Update README**: Add the new playbook to the appropriate README's playbook list 7. **Create Pull Request**: Follow standard contribution process ### Contribution Guidelines ### Review Process 1. All contributions require review from maintainers 2. Feedback will be provided within 2-3 business days 3. Address any requested changes promptly 4. Once approved, your contribution will be merged ## Connect with Us We'd love to hear from you! Here are the best ways to connect: ### Feedback & Feature Requests Have an idea for improvement or a new playbook topic? - **GitHub Issues**: Create a [feature request](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/issues/new?template=feature_request.md) - **Slack**: Share your ideas in our `#playbooks` channel ### Bug Reports Found a bug or error in a playbook? - **GitHub Issues**: Create a [bug report](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/issues/new?template=bug_report.md) - **Slack**: Report in our `#playbooks` channel for quick response ### Scoutflo Resources - **Official Documentation**: [Scoutflo Documentation](https://scoutflo-documentation.gitbook.io/scoutflo-documentation) - Complete guide to Scoutflo platform - **Website**: [scoutflo.com](https://scoutflo.com/) - Learn more about Scoutflo - **AI SRE Tool**: [ai.scoutflo.com](https://ai.scoutflo.com/get-started) - AI-powered SRE assistant - **Infra Management Tool**: [deploy.scoutflo.com](https://deploy.scoutflo.com/) - Kubernetes deployment platform - **YouTube Channel**: [@scoutflo6727](https://www.youtube.com/@scoutflo6727) - Tutorials and demos - **AI SRE Demo**: [Watch Demo Video](https://youtu.be/P6xzFUtRqRc?si=0VN9oMV05rNzXFs8) - See Scoutflo AI SRE in action - **Blog**: [scoutflo.com/blog](https://scoutflo.com/blog) and [blog.scoutflo.com](https://blog.scoutflo.com/) - Latest articles and insights - **Pricing**: [scoutflo.com/pricing](https://scoutflo.com/pricing) - Pricing information ### Additional Resources - **Roadmap**: Check out our [project roadmap](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/projects) to see what's coming - **Documentation**: Visit our [wiki](https://github.com/Scoutflo/scoutflo-SRE-Playbooks/wiki) for detailed guides - **Legal**: [Privacy Policy](https://blog.scoutflo.com/privacy/) | [Terms of Service](https://blog.scoutflo.com/terms/) ## Related Resources ### AWS Resources **Official Documentation:** - [AWS Documentation](https://docs.aws.amazon.com/) - Complete AWS service documentation - [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/) - Best practices for building on AWS - [AWS Troubleshooting Guides](https://docs.aws.amazon.com/general/latest/gr/aws_troubleshooting.html) - Official troubleshooting guides - [AWS Service Health Dashboard](https://status.aws.amazon.com/) - Check AWS service status **Learning & Best Practices:** - [AWS Architecture Center](https://aws.amazon.com/architecture/) - Reference architectures - [AWS Security Best Practices](https://aws.amazon.com/security/security-resources/) - Security guidelines - [AWS re:Post](https://repost.aws/) - AWS community Q&A - [AWS Training](https://aws.amazon.com/training/) - Free and paid training courses **Tools & Utilities:** - [AWS CLI Documentation](https://docs.aws.amazon.com/cli/latest/userguide/) - Command-line interface - [AWS CloudShell](https://aws.amazon.com/cloudshell/) - Browser-based shell - [AWS Systems Manager](https://docs.aws.amazon.com/systems-manager/) - Operations management - [AWS CloudWatch](https://docs.aws.amazon.com/cloudwatch/) - Monitoring and observability ### Kubernetes Resources **Official Documentation:** - [Kubernetes Documentation](https://kubernetes.io/docs/) - Complete Kubernetes documentation - [kubectl Cheat Sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/) - Quick command reference - [Kubernetes Troubleshooting](https://kubernetes.io/docs/tasks/debug/) - Official troubleshooting guide - [Kubernetes API Reference](https://kubernetes.io/docs/reference/kubernetes-api/) - API documentation **Learning & Best Practices:** - [Kubernetes Best Practices](https://kubernetes.io/docs/concepts/cluster-administration/) - Cluster administration - [Kubernetes Security Best Practices](https://kubernetes.io/docs/concepts/security/) - Security guidelines - [CNCF Cloud Native Trail Map](https://github.com/cncf/trailmap) - Learning path - [Kubernetes.io Blog](https://kubernetes.io/blog/) - Latest updates and tutorials **Tools & Utilities:** - [k9s](https://k9scli.io/) - Terminal UI for Kubernetes - [Lens](https://k8slens.dev/) - Kubernetes IDE - [Helm](https://helm.sh/) - Package manager for Kubernetes - [kubectx & kubens](https://github.com/ahmetb/kubectx) - Context and namespace switching ### SRE Resources **Books & Guides:** - [Google SRE Book](https://sre.google/books/) - Site Reliability Engineering book - [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/) - SRE practices - [The Site Reliability Workbook](https://sre.google/workbook/table-of-contents/) - Practical SRE guide - [Building Secure & Reliable Systems](https://sre.google/books/building-secure-reliable-systems/) - Security and reliability **Learning Resources:** - [SRE Foundation Course](https://www.cncf.io/certification/training/) - CNCF training - [SRE Weekly](https://sreweekly.com/) - Weekly newsletter - [SREcon](https://www.usenix.org/conferences/byname/srecon) - SRE conferences - [Incident Response Guide](https://response.pagerduty.com/) - PagerDuty's incident response guide **Tools & Platforms:** - [Prometheus](https://prometheus.io/) - Monitoring and alerting - [Grafana](https://grafana.com/) - Visualization and dashboards - [Jaeger](https://www.jaegertracing.io/) - Distributed tracing - [ELK Stack](https://www.elastic.co/what-is/elk-stack) - Logging and analysis ### Incident Response & Runbooks **Runbook Resources:** - [PagerDuty Incident Response](https://response.pagerduty.com/) - Incident response best practices - [Atlassian Incident Management](https://www.atlassian.com/incident-management) - Incident management guide - [GitLab Runbooks](https://about.gitlab.com/handbook/engineering/infrastructure/runbooks/) - Example runbooks - [Google's SRE Runbook Template](https://sre.google/workbook/runbooks/) - Runbook structure **Incident Management:** - [Incident.io](https://incident.io/) - Incident management platform - [FireHydrant](https://www.firehydrant.com/) - Incident response platform - [Statuspage](https://www.statuspage.io/) - Status page management ## Statistics - **Total Playbooks**: 376 - AWS: 157 playbooks (92 reactive + 65 proactive) - Kubernetes: 194 playbooks (138 reactive + 56 proactive) - Sentry: 25 playbooks - **Coverage**: Major AWS services, Kubernetes components, and Sentry monitoring - **Format**: Markdown with structured sections - **Language**: English - **Community**: Open source, community-driven ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Maintainers This project is maintained by: - [@AtharvaBondreScoutflo](https://github.com/AtharvaBondreScoutflo) - [@Vedant-Vyawahare](https://github.com/Vedant-Vyawahare) For maintainer information, see [MAINTAINERS.md](MAINTAINERS.md). ## Acknowledgments