23seriy/devops-ai-workflows
GitHub: 23seriy/devops-ai-workflows
Stars: 1 | Forks: 0
# devops-ai-workflows
A growing collection of **AI-agent workflows, prompts, and rules** for day-to-day DevOps / SRE / platform work.
## What's inside
| Folder | Purpose | Audience |
|---|---|---|
| [`workflows/`](./workflows) | Workflow definitions, grouped by domain | Everyone |
| [`prompts/`](./prompts) | Reusable system / task prompts (incident triage, code review, post-mortem, etc.) | Any LLM |
| [`rules/`](./rules) | Editor / agent rule files (`.windsurfrules`, `.cursorrules`, Copilot instructions) | Per-tool |
| [`scripts/`](./scripts) | Standalone shell scripts referenced by workflows | Anyone with a shell |
## Available workflows
### Kubernetes
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| [k8s-debug](./workflows/kubernetes/k8s-debug.md) | `/k8s-debug` | General-purpose, read-only cluster diagnostics across nodes, pods, workloads, networking, storage, RBAC, events, and resource pressure. | `kubectl`. Optional: `jq`, metrics-server. |
| [k8s-workload-debug](./workflows/kubernetes/k8s-workload-debug.md) | `/k8s-workload-debug` | Deep-dive on a single Deployment / StatefulSet / DaemonSet / Job / Pod: rollout, spec, probes, resources, logs, networking, storage, config. | `kubectl`. Optional: `jq`, metrics-server. |
| [k8s-rbac-audit](./workflows/kubernetes/k8s-rbac-audit.md) | `/k8s-rbac-audit` | RBAC risk audit — wildcards, cluster-admin bindings, risky verb/resource combos, over-privileged ServiceAccounts, anonymous access. | `kubectl`, `jq`. Optional: `kubectl-who-can`. |
| [k8s-cost-hotspots](./workflows/kubernetes/k8s-cost-hotspots.md) | `/k8s-cost-hotspots` | Find waste: over-provisioned workloads, missing requests/limits, idle workloads, orphan PVCs/PVs, idle LoadBalancers. | `kubectl`, `jq`, metrics-server. |
| [k8s-upgrade-readiness](./workflows/kubernetes/k8s-upgrade-readiness.md) | `/k8s-upgrade-readiness` | Pre-flight before a control-plane / node upgrade: deprecated APIs, version skew, PDB gaps, expiring certs, broken webhooks. | `kubectl`. Optional: `kubent` or `pluto`, `helm`. |
| [helm-release-debug](./workflows/kubernetes/helm-release-debug.md) | `/helm-release-debug` | Diagnose a stuck or failed Helm release: history, values diff, hook failures, rendered manifest vs cluster, workload health. | `helm` v3, `kubectl`. Optional: `jq`, `yq`. |
| [helm-chart-review](./workflows/kubernetes/helm-chart-review.md) | `/helm-chart-review` | Review a Helm chart for security, reliability, and best practices: resource specs, probes, security context, PDBs, anti-affinity, RBAC. | Helm chart source. Optional: `helm` CLI. |
### AWS / Cloud
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| [aws-account-audit](./workflows/aws/aws-account-audit.md) | `/aws-account-audit` | Read-only AWS account security & hygiene audit: IAM, S3, EC2, RDS, CloudTrail, encryption, GuardDuty, SecurityHub. | `aws` CLI. Optional: `jq`. |
| [aws-cost-quickscan](./workflows/aws/aws-cost-quickscan.md) | `/aws-cost-quickscan` | Find AWS cost waste: idle EC2/RDS, unattached EBS, old snapshots, expensive log groups, NAT data processing, missing Savings Plans. | `aws` CLI, Cost Explorer enabled. Optional: `jq`. |
| [aws-vpc-debug](./workflows/aws/aws-vpc-debug.md) | `/aws-vpc-debug` | Diagnose VPC connectivity: trace path across SGs, NACLs, route tables, NAT/IGW/TGW, VPC endpoints, DNS, and flow logs. | `aws` CLI. Optional: `jq`, `dig`. |
| [aws-iam-policy-review](./workflows/aws/aws-iam-policy-review.md) | `/aws-iam-policy-review` | Explain an IAM policy and flag risks: admin-equivalent access, privilege escalation paths, wildcard actions, missing conditions. | `aws` CLI. Optional: `jq`. |
### IaC
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| [terraform-plan-review](./workflows/iac/terraform-plan-review.md) | `/terraform-plan-review` | Explain a Terraform plan and flag risky changes: destroys, replacements, security group mutations, IAM changes, blast radius. | `terraform plan` output. Optional: `terraform` CLI, `jq`. |
### Containers & CI/CD
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| [ci-debug](./workflows/cicd/ci-debug.md) | `/ci-debug` | Diagnose a failing CI/CD pipeline: parse build logs from Jenkins, GitHub Actions, GitLab CI, or Bitbucket Pipelines. Root cause analysis and fix suggestions. | Build log output. Optional: repo source, CI config file. |
| [jenkins-pipeline-review](./workflows/cicd/jenkins-pipeline-review.md) | `/jenkins-pipeline-review` | Review Jenkinsfile / shared-library Groovy for security risks, anti-patterns, missing error handling, credential leaks, CPS issues, and build config cross-references. | Jenkinsfile(s) or `vars/*.groovy`. Optional: `repositories_v2.json`. |
| [release-checklist](./workflows/cicd/release-checklist.md) | `/release-checklist` | Pre-release safety gate: scope, deploy order, rollback, tests, monitoring, and communication before production release. | PR/diff summary. Optional: test results, plans, diffs. |
| [dockerfile-review](./workflows/containers/dockerfile-review.md) | `/dockerfile-review` | Review Dockerfiles for security, size, caching, and best practices. Flags CVE-prone bases, leaked secrets, missing health checks. | Dockerfile(s). Optional: `docker`, `trivy`. |
### Security
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| [secrets-leak-scan](./workflows/security/secrets-leak-scan.md) | `/secrets-leak-scan` | Scan git repo history for leaked secrets: API keys, passwords, tokens, private keys. Uses gitleaks, trufflehog, or regex fallback. | Git repo. Optional: `gitleaks`, `trufflehog`. |
| [repo-health](./workflows/security/repo-health.md) | `/repo-health` | Audit repository hygiene: README, license, CI, branch/release hygiene, tracked secrets, ownership, and automation gaps. | Local git repo. Optional: `gh`, `jq`. |
### Observability & Incident
| Workflow | Slash command | Description | Prerequisites |
|---|---|---|---|
| [incident-triage](./workflows/observability/incident-triage.md) | `/incident-triage` | Guided first 15 minutes of a production incident: timeline, blast radius, evidence gathering, mitigation suggestions. | Access to affected environment. |
More on the way — see [Roadmap](#roadmap).
## Prompts
Reusable system prompts you can paste into any AI agent for common DevOps tasks:
| Prompt | What it does |
|---|---|
| [incident-commander](./prompts/incident-commander.md) | Puts the AI in incident-commander mode: timeline, blast radius, action tracking, status updates. |
| [postmortem-writer](./prompts/postmortem-writer.md) | Generates a blameless post-mortem from incident notes: timeline, root cause, impact, action items. |
| [code-review-devops](./prompts/code-review-devops.md) | Reviews IaC / pipeline / Docker / K8s code with a security-first DevOps lens. |
| [pr-description](./prompts/pr-description.md) | Generates a PR description from a diff: what, why, how, testing, risk, rollback plan. |
| [explain-like-a-senior](./prompts/explain-like-a-senior.md) | Explains infrastructure code to junior engineers: what it does, why, gotchas, and how it fits together. |
| [runbook-from-incident](./prompts/runbook-from-incident.md) | Converts incident notes or post-mortems into reusable runbooks with diagnosis, mitigation, escalation, and follow-up steps. |
## Rules
Persistent instruction files that shape AI behavior. Copy into a project's `.windsurf/rules/` or use as `.windsurfrules`:
| Rule file | What it does |
|---|---|
| [devops-agent.windsurfrules](./rules/devops-agent.windsurfrules) | Safety guardrails for AI in DevOps repos: never modify prod without confirmation, prefer read-only, never hardcode secrets, always check context, GitOps awareness, multi-repo coordination. |
| [terraform.windsurfrules](./rules/terraform.windsurfrules) | Terraform-specific: state safety, ForceNew attribute warnings, provider/module pinning, workspace safety, import workflow, `prevent_destroy` reminders. |
| [kubernetes.windsurfrules](./rules/kubernetes.windsurfrules) | Kubernetes-specific: context verification, dry-run first, Helm safety, ArgoCD/GitOps awareness, secret handling, debugging approach, RBAC best practices. |
## Scripts
Standalone shell utilities referenced by workflows or useful on their own:
| Script | Usage |
|---|---|
| [k8s-snapshot.sh](./scripts/k8s-snapshot.sh) | `./k8s-snapshot.sh [namespace\|all] [output-dir]` — dump cluster state (nodes, pods, events, services, top) to a timestamped Markdown file. |
| [aws-whoami.sh](./scripts/aws-whoami.sh) | `./aws-whoami.sh [profile]` — quick AWS identity check: caller, region, account alias, org, SSO role. |
| [stale-branches.sh](./scripts/stale-branches.sh) | `./stale-branches.sh [days] [--remote]` — list git branches older than N days with last commit info. |
| [validate-repo.sh](./scripts/validate-repo.sh) | `./scripts/validate-repo.sh` — validate workflow frontmatter, README links, script executability, and optional lint checks. |
## Using a workflow
### In AI agents
Open the matching file in [`workflows/`](./workflows) and either:
- invoke it as a slash command if your agent supports workflow discovery from this repo,
- paste the relevant section into the agent's chat, or
- include the file as context and ask the agent to follow it.
### As a plain human workflow
Every workflow is just Markdown with shell commands. You can run the steps yourself in a terminal — no AI required.
## Repo layout
devops-ai-workflows/
├── workflows/
│ ├── kubernetes/ # Kubernetes workflow definitions
│ ├── aws/ # AWS / cloud workflow definitions
│ ├── iac/ # Infrastructure as Code workflows
│ ├── cicd/ # CI/CD pipeline workflows
│ ├── containers/ # Container & image workflows
│ ├── security/ # Security & repo hygiene workflows
│ └── observability/ # Observability & incident workflows
├── prompts/ # Reusable LLM prompts
├── rules/ # Editor/agent rule files
├── scripts/ # Standalone shell helpers
├── CONTRIBUTING.md
├── LICENSE
└── README.md
## Roadmap
Ideas I plan to add (PRs welcome):
**AWS / cloud**
- [ ] `/aws-eks-debug` — bridge EKS + Kubernetes: node groups, OIDC, add-ons, IAM roles for service accounts
- [ ] `/aws-rds-health` — RDS/Aurora diagnostics: events, metrics, parameter groups, replication lag
- [ ] `/aws-lambda-debug` — Lambda diagnostics: errors, throttles, DLQ, VPC/ENI, CloudWatch logs
- [ ] `/aws-ecs-service-debug` — ECS/Fargate service rollout failures: task events, target group health, IAM roles
**IaC**
- [ ] `/terraform-state-debug` — diagnose locks, drift, orphans
- [ ] `/iac-secrets-scan` — repo-wide hardcoded-secret sweep
**Containers & CI/CD**
- [ ] `/image-cve-triage` — prioritise CVE scanner output by exploitability + fix availability
- [ ] `/github-actions-review` — security review of GitHub Actions workflow files
**Observability & incident**
- [ ] `/prometheus-query-helper` — intent → PromQL with rationale
- [ ] `/log-pattern-extract` — cluster repeated errors out of a log dump
- [ ] `/postmortem` — blameless post-mortem from a transcript
- [ ] `/runbook-from-incident` — turn a resolved incident into a reusable runbook
**Networking / database**
- [ ] `/dns-debug` — multi-resolver dig, propagation, DNSSEC
- [ ] `/tls-cert-audit` — chain inspection, expiry, weak ciphers across a list of hosts
- [ ] `/postgres-health` — bloat, long queries, replication lag, missing indexes
- [ ] `/redis-health` — memory pressure, slow log, persistence config, eviction patterns
- [ ] `/db-migration-review` — flag risky migration patterns
**Security & repo hygiene**
- [ ] `/cve-impact-assessment` — given a CVE, check whether your stack is affected
- [ ] `/repo-health` — README, license, CI, branch protection, stale branches
- [ ] `/dependency-upgrade-plan` — group outdated deps by risk and suggest batching
## License
[MIT](./LICENSE) — use freely, attribution appreciated but not required.