nirmal84/aws-cloud-ops-power

GitHub: nirmal84/aws-cloud-ops-power

Stars: 0 | Forks: 0

# ⚡ AWS Cloud Operations 强力工具 ### 一个用于自主 AWS 云运维的 [Kiro](https://kiro.dev) Power [![Kiro Power](https://img.shields.io/badge/Kiro-Power-6C47FF?style=for-the-badge&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjQiIGhlaWdodD0iMjQiIHZpZXdCb3g9IjAgMCAyNCAyNCIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj48cGF0aCBkPSJNMTIgMkw0IDE0aDE2TDEyIDJ6IiBmaWxsPSJ3aGl0ZSIvPjwvc3ZnPg==)](https://kiro.dev) [![AWS](https://img.shields.io/badge/AWS-Cloud_Operations-FF9900?style=for-the-badge&logo=amazon-aws&logoColor=white)](https://aws.amazon.com) [![MCP](https://img.shields.io/badge/MCP-4_Servers-00D4AA?style=for-the-badge)](https://modelcontextprotocol.io) [![License](https://img.shields.io/badge/License-MIT-blue?style=for-the-badge)](LICENSE)

``` ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ "Production is down!" │ │ │ │ Without aws-cloud-ops: With aws-cloud-ops: │ │ ┌──────────────────────┐ ┌──────────────────────────┐ │ │ │ 🔍 CloudWatch tab │ │ You: "Lambda timeouts │ │ │ │ 📋 CloudTrail tab │ │ since 3:42 PM" │ │ │ │ 🔭 X-Ray tab │ VS │ │ │ │ │ 💰 Cost Explorer │ │ AI: Root cause found in │ │ │ │ 🛡️ GuardDuty tab │ │ 30 seconds: │ │ │ │ 🏗️ Config tab │ │ IAM policy removed by │ │ │ │ 📊 Trusted Advisor │ │ terraform at 3:38 PM ✅ │ │ │ │ 😱 3 Slack threads │ └──────────────────────────┘ │ │ │ ⏱️ 45 minutes later │ │ │ └──────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ```
## 🎯 本 Power 的功能 这是一个 **Kiro Power** —— 一个专门的上下文包,它让 Kiro 的 AI 代理能够按需访问 AWS API、可观测性数据和 SRE Runbook。当你提到与 AWS 相关的问题时,它会自动激活,并将 **8 个专家级工作流** 引入你的 IDE 对话中。
``` ┌─────────────────────────────────┐ │ aws-cloud-ops Power │ │ │ │ Keywords trigger activation: │ │ "incident" "outage" "deploy" │ │ "blast-radius" "chaos" "cost" │ │ "guardduty" "fis" "sre" ... │ └──────────────┬──────────────────┘ │ ┌───────────────────────┼───────────────────────┐ ▼ ▼ ▼ ┌───────────────┐ ┌────────────────┐ ┌───────────────┐ │ aws-api MCP │ │ aws-pricing MCP│ │ aws-iac MCP │ │ │ │ │ │ │ │ 100+ AWS APIs │ │ Cost lookups │ │ CDK / CFn │ │ CloudWatch │ │ RI savings │ │ Best practice │ │ CloudTrail │ │ Rightsizing │ │ Validation │ │ X-Ray, FIS │ │ SP analysis │ │ │ └───────────────┘ └────────────────┘ └───────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ aws-knowledge MCP (HTTP) │ │ AWS Documentation · Best Practices · Guides │ └─────────────────────────────────────────────────────────────┘ ```
## 🚀 8 大超能力
⚡ 能力 解决的问题 惊艳时刻
🔍 事件调查器 (Incident Investigator) 生产环境宕机,且原因不明 30 秒内通过 5-Why 分析找到根本原因 —— 自动遍历 CloudWatch → CloudTrail → X-Ray → Logs
💥 爆炸半径分析器 (Blast Radius Analyzer) 想要删除/修改某些东西,但担心后果 在触碰任何 AWS 资源之前获取完整的依赖关系图 —— 包括 EC2, SGs, RDS, Lambda, VPCs, IAM roles
🏆 韧性记分卡 (Resiliency Scorecard) 不确定你的架构到底有多强的韧性 实时生成等级评定 (A–F) 的 Well-Architected REL 评估,包含 15 项检查、修复成本 vs 故障成本估算
🌪️ 混沌工程工作室 (Chaos Engineering Studio) 从未进行过 Game Day,因为感觉风险太大 设计并执行 AWS FIS 实验,包含安全的假设、停止条件和实验后分析 —— 一键开启 Game Day
🚦 部署安全门 (Deployment Safety Gate) 每次部署到生产环境时都感到紧张 在部署任何代码之前,提供包含回滚命令的 7 项检查 GO / NO-GO 报告
💰 成本异常侦探 (Cost Anomaly Detective) AWS 账单激增,且原因不明 追踪每一笔支出的来源 —— 识别僵尸资源、调整规模的机会,并生成排名后的“成本清理列表”
🛡️ 安全态势扫描器 (Security Posture Scanner) 安全发现分散在 5 个不同的 AWS 控制台中 将 GuardDuty + Security Hub + IAM Access Analyzer + Config 聚合成通俗英语的顶级风险列表,并提供一键修复方案
🏢 多账户运维 (Multi-Account Operations) 管理组织中数十个 AWS 账户 组织范围的韧性评估、SCP 分析、跨账户事件调查、账户分发检查清单
## 📁 仓库结构 ``` aws-cloud-ops-power/ │ ├── aws-cloud-ops/ ← The Kiro Power │ ├── POWER.md ← Orchestrator: capability map, │ │ MCP tool catalog, activation rules │ ├── mcp.json ← 4 MCP servers, 100+ auto-approved │ │ read-only AWS API tools │ │ │ └── steering/ ← On-demand expert workflows │ ├── incident-investigator.md ← 6-phase root cause analysis │ ├── blast-radius-analyzer.md ← 7 resource type dependency maps │ ├── resiliency-scorecard.md ← 15-check A–F graded assessment │ ├── deployment-safety-gate.md ← 7-check GO/NO-GO pre-deploy gate │ ├── chaos-engineering.md ← 6 AWS FIS experiment templates │ ├── cost-anomaly-detective.md ← 10 zombie types + rightsizing │ ├── security-posture.md ← 6 security domain aggregation │ └── multi-account-ops.md ← Org / SCP / Control Tower │ └── .kiro/hooks/ ← IDE event-driven automations ├── aws-deployment-safety.kiro.hook ← Fires on .tf/.yaml/CDK changes ├── aws-incident-investigation.kiro.hook← Fires on incident/runbook files ├── aws-cost-sentinel.kiro.hook ← Fires on billing CSV/JSON files └── aws-chaos-game-day.kiro.hook ← Fires on FIS/chaos files ``` ## 🔧 安装 ### 前置条件 - 已安装 [Kiro IDE](https://kiro.dev) - AWS CLI 已配置 (`aws configure` 或 `aws sso login`) - Python `uvx` 可用 (`pip install uv`) ### 第一步 —— 添加 Power ``` # 克隆此 repo git clone https://github.com/nirmal84/aws-cloud-ops-power.git ``` 在 Kiro IDE 中: 1. 打开 **Powers** 面板 (侧边栏) 2. 点击 **Add Power** → **Local Directory** 3. 粘贴 `aws-cloud-ops/` 文件夹的路径 4. 点击 **Add** ### 第二步 —— 配置 AWS 凭证 编辑 `aws-cloud-ops/mcp.json` 并更新: ``` { "mcpServers": { "aws-api": { "env": { "AWS_PROFILE": "your-profile-name", ← change this "AWS_REGION": "us-east-1" ← change this } } } } ``` ``` # 验证凭证有效 aws sts get-caller-identity ``` ### 第三步 —— 安装 Hooks (可选但推荐) 将 Hooks 复制到你的工作区: ``` cp .kiro/hooks/*.kiro.hook /path/to/your/workspace/.kiro/hooks/ ``` ### 第四步 —— 测试 Power 打开 Kiro 代理聊天并尝试: ``` "My Lambda function started timing out 30 minutes ago. What happened?" ``` Power 将自动激活并开始调查。 🎉 ## 🎯 示例对话
🔍 事件调查 — “生产环境宕机” ``` You: My payments service Lambda is timing out since 3:42 PM UTC. Error rate went from 0.1% to 100%. What happened? Kiro: Loading incident-investigator... 📊 Phase 1: CloudWatch Timeline → Lambda Errors: 0% → 100% starting 15:42:03 UTC → Duration: spiking to max timeout (15,000ms) → Throttles: 0 (not the cause) 📋 Phase 2: CloudTrail — "What Changed?" → Found: DeleteRolePolicy at 15:38:47 UTC Actor: arn:aws:iam::123456789:user/ci-terraform-user Policy: payments-lambda-dynamodb-access 📝 Phase 3: Application Logs → Error: AccessDeniedException: User is not authorized to perform: dynamodb:GetItem on resource payments-table 🎯 ROOT CAUSE FOUND: A Terraform apply at 15:38 UTC removed the DynamoDB access policy from the Lambda execution role. The Lambda cannot read from DynamoDB → 100% errors. ✅ Immediate Fix: aws iam put-role-policy \ --role-name payments-lambda-role \ --policy-name payments-lambda-dynamodb-access \ --policy-document file://policy.json 📋 Follow-up: • Add Terraform plan review gate for IAM changes • Add CloudWatch alarm for AccessDeniedException errors ```
💥 爆炸半径 — “删除这个安全吗?” ``` You: I want to terminate EC2 instance i-0abc12345. What breaks? Kiro: Loading blast-radius-analyzer... 🗺️ Mapping dependencies for i-0abc12345 (payments-worker)... ┌─────────────────────────────────────────────────────┐ │ BLAST RADIUS REPORT │ │ Risk Score: 🔴 CRITICAL │ ├─────────────────────────────────────────────────────┤ │ Resource Impact │ │ ───────────────────────────────────────────── │ │ tg-payments (ALB) LAST healthy target → OUTAGE│ │ asg-payments Will replace in ~3 min │ │ eip-52.x.x.x Released → DNS updates needed│ │ hc-payments-prod Route53 failover triggers │ └─────────────────────────────────────────────────────┘ ⚠️ DO NOT terminate now — this is the last healthy target in your production load balancer. ✅ Safe Sequence: 1. Drain ALB target group connections (30 sec) 2. Terminate instance 3. Reassociate EIP to new ASG instance 4. Verify new target is healthy Proceed with safe sequence? (yes/no) ```
🏆 韧性记分卡 — “我的韧性如何?” ``` You: Score my production VPC for resiliency. Kiro: Loading resiliency-scorecard... Running 15 checks... ╔══════════════════════════════════════════╗ ║ AWS RESILIENCY SCORECARD ║ ║ Account: 123456789012 Region: us-east-1║ ╠══════════════════════════════════════════╣ ║ Overall Grade: B (78/100) ║ ╠══════════════════╦═══════╦══════════════╣ ║ Domain ║ Score ║ Grade ║ ╠══════════════════╬═══════╬══════════════╣ ║ Compute ║ 19/25 ║ B ║ ║ Data Durability ║ 22/25 ║ A ║ ║ Network ║ 14/20 ║ C ║ ║ Observability ║ 12/15 ║ B ║ ║ Deployment Safety║ 11/15 ║ B ║ ╚══════════════════╩═══════╩══════════════╝ 🔴 CRITICAL: Single NAT Gateway (AZ-level SPOF) Fix: Add NAT GW to 1b and 1c (+$65/month) Risk: Prevents $50,000+ outage cost 🔴 CRITICAL: RDS Multi-AZ disabled on payments-db Fix: Enable Multi-AZ (+$180/month) Risk: 10-30 min manual recovery on failure 🟡 HIGH: No CloudWatch alarm on Lambda Throttles 🟡 HIGH: ECS Deployment Circuit Breaker disabled ```
💰 成本侦探 — “为什么我的账单激增?” ``` You: My AWS bill jumped $4,200 last month. What happened? Kiro: Loading cost-anomaly-detective... 🔍 Analyzing Cost Explorer anomalies... 🔍 Hunting zombie resources... 🔍 Checking rightsizing opportunities... ═══════════════════════════════════════════ 💀 COST KILL LIST (Ranked by Savings) ═══════════════════════════════════════════ 1. 5× EC2 m5.4xlarge (load-test env) Owner: john.doe (created Feb 3 per CloudTrail) Monthly cost: $1,680 → Terminate immediately ✅ 2. 12 unattached EBS volumes Monthly cost: $240 → Delete (snapshots exist) ✅ 3. rds-analytics-dev (db.r5.2xlarge, 0 connections) Monthly cost: $580 → Verify with team, then delete 4. 3 idle NAT Gateways (0 bytes for 30 days) Monthly cost: $96 → Delete ✅ ───────────────────────────────────────── 💰 Total immediate savings: $2,596/month 💰 Annual savings: $31,152/year ═══════════════════════════════════════════ ```
## 🔌 包含的 MCP 服务器 | Server | Transport | 用途 | |--------|-----------|---------| | `aws-api` | `uvx` (stdio) | 直接访问 100+ AWS 服务 API — EC2, RDS, Lambda, CloudWatch, CloudTrail, X-Ray, FIS, GuardDuty, Security Hub, Cost Explorer, Organizations | | `aws-pricing` | `uvx` (stdio) | 实时 AWS 定价查询、预留实例分析、Savings Plans 建议 | | `aws-knowledge` | HTTP | AWS 文档、Well-Architected 最佳实践、服务限制 | | `aws-iac` | `uvx` (stdio) | CDK + CloudFormation 文档、模板验证、IaC 最佳实践 | ## 🪝 Hooks — 自动激活 | Hook | 触发条件 | 功能 | |------|--------------|--------------| | `aws-deployment-safety` | `.tf`, `*.yaml`, CDK 文件被编辑 | 自动运行预部署安全门 | | `aws-incident-investigation` | Runbook/Incident/Postmortem 文件被编辑 | 提供 RCA 或 Postmortem 编写协助 | | `aws-cost-sentinel` | 添加账单 CSV/JSON 文件 | 自动运行成本异常调查 | | `aws-chaos-game-day` | 创建 FIS 实验/混沌文件 | 指导包含停止条件的安全实验设计 | ## 🛡️ 安全与保障模型 ``` ┌──────────────────────────────────────────────────────────────────┐ │ SAFETY ARCHITECTURE │ │ │ │ ✅ AUTO-APPROVED (no confirmation needed) │ │ ┌────────────────────────────────────────┐ │ │ │ describe_* · list_* · get_* │ │ │ │ CloudWatch reads · CloudTrail lookups │ │ │ │ Cost Explorer queries · X-Ray traces │ │ │ └────────────────────────────────────────┘ │ │ │ │ ⚠️ REQUIRES EXPLICIT CONFIRMATION │ │ ┌────────────────────────────────────────┐ │ │ │ ec2:TerminateInstances │ │ │ │ rds:DeleteDBInstance │ │ │ │ fis:StartExperiment │ │ │ │ ssm:SendCommand │ │ │ │ Any write / delete / modify operation │ │ │ └────────────────────────────────────────┘ │ │ │ │ 🚨 PRODUCTION CHAOS EXPERIMENTS │ │ ┌────────────────────────────────────────┐ │ │ │ Require: hypothesis + stop conditions │ │ │ │ Require: blast radius analysis first │ │ │ │ Require: typed confirmation phrase │ │ │ │ Require: on-call engineer available │ │ │ └────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘ ``` **所需的最小 IAM 权限** (大多数功能的只读基线): ``` { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricData", "cloudwatch:DescribeAlarms", "logs:FilterLogEvents", "logs:StartQuery", "cloudtrail:LookupEvents", "xray:GetServiceGraph", "xray:GetTraceSummaries", "ec2:Describe*", "rds:Describe*", "lambda:Get*", "lambda:List*", "elasticloadbalancing:Describe*", "ce:GetCostAndUsage", "ce:GetAnomalies", "guardduty:List*", "guardduty:Get*", "securityhub:GetFindings", "iam:List*", "iam:Get*", "sts:GetCallerIdentity" ], "Resource": "*" } ``` ## 🗺️ Steering 文件架构 ``` POWER.md (Orchestrator) │ Detects user intent → routes to correct steering file │ ├─── "production is down" → incident-investigator.md │ 6-phase investigation: │ Metrics → CloudTrail → Logs → X-Ray → Infra → AWS Health │ ├─── "what breaks if I..." → blast-radius-analyzer.md │ 7 resource types mapped: │ EC2 · SG · RDS · Lambda · VPC · IAM · S3 │ ├─── "score my architecture" → resiliency-scorecard.md │ 5 domains, 15 checks, A–F grade: │ Compute · Data · Network · Observability · Deployments │ ├─── "chaos experiment / game day" → chaos-engineering.md │ 6 experiment templates: │ AZ failure · Latency · CPU · RDS failover · Throttle · Spot │ ├─── "safe to deploy?" → deployment-safety-gate.md │ 7-check pre-flight gate: │ Errors · Alarms · Dependencies · Traffic · History · Rollback · Health │ ├─── "why did my bill spike?" → cost-anomaly-detective.md │ 4-phase investigation: │ Anomaly · Zombie hunt · Rightsizing · Data transfer │ ├─── "security findings / posture" → security-posture.md │ 6 security domains: │ GuardDuty · Security Hub · Access Analyzer · Config · IAM · Network │ └─── "cross-account / org" → multi-account-ops.md Enterprise ops: Org map · SCP analysis · Control Tower · Cross-account RCA ``` ## 📊 加载内容与时机 Kiro Powers 的关键创新之一:**零上下文膨胀**。只有在需要时才加载相关的专家工作流。 ``` Without Powers: With aws-cloud-ops Power: ──────────────── ───────────────────────── Always loaded: On activation (keywords): • 5+ MCP servers • POWER.md capability map • 50,000+ tokens wasted • ~2,000 tokens • 40% context window gone On-demand (per conversation): • 1 steering file (~400 lines) • Only the expert workflow needed • 95% context window preserved ``` ## 🤝 贡献 欢迎提交 Pull Request!欢迎在以下领域做出贡献: - 🆕 新的 Steering 文件 (EKS 运维, API Gateway 调试, EventBridge 模式) - 🔧 额外的 MCP 服务器集成 (Datadog, PagerDuty, Jira) - 🧪 真实世界的示例对话 - 📖 额外的 Runbook 模式 - 🪝 更多的 Hook 自动化 ### 添加新的 Steering 文件 1. 创建 `aws-cloud-ops/steering/your-capability.md` 2. 在 `POWER.md` 的能力表中添加条目 3. 在 `POWER.md` frontmatter 的 `keywords` 数组中添加关键词 4. 提交 PR! ## 📖 了解更多 - [Kiro Powers 文档](https://kiro.dev/docs/powers) - [Kiro Powers 市场](https://kiro.dev/powers) - [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/) - [AWS Fault Injection Service](https://aws.amazon.com/fis/) - [AWS MCP Servers (awslabs)](https://github.com/awslabs/mcp)
**Built with ❤️ for the AWS community** *将被动救火转变为主动韧性工程* [![Star this repo](https://img.shields.io/github/stars/nirmal84/aws-cloud-ops-power?style=social)](https://github.com/nirmal84/aws-cloud-ops-power)
标签:AIOps, AWS, CloudTrail, CloudWatch, Config, DLL 劫持, DPI, GuardDuty, Kiro, Lambda, MCP, SRE, X-Ray, 云运维, 偏差过滤, 可靠性工程, 大语言模型, 安全合规, 安全态势, 弹性评分, 成本异常检测, 成本管理, 故障排查, 故障自愈, 根因分析, 混沌工程, 爆炸半径, 生产事故, 监控告警, 站点可靠性, 网络代理, 自主运维, 自动化运维, 运维自动化, 逆向工具, 部署安全门禁