nirmal84/aws-cloud-ops-power
GitHub: nirmal84/aws-cloud-ops-power
Stars: 0 | Forks: 0
# ⚡ AWS Cloud Operations 强力工具
### 一个用于自主 AWS 云运维的 [Kiro](https://kiro.dev) Power
[](https://kiro.dev)
[](https://aws.amazon.com)
[](https://modelcontextprotocol.io)
[](LICENSE)
``` ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ "Production is down!" │ │ │ │ Without aws-cloud-ops: With aws-cloud-ops: │ │ ┌──────────────────────┐ ┌──────────────────────────┐ │ │ │ 🔍 CloudWatch tab │ │ You: "Lambda timeouts │ │ │ │ 📋 CloudTrail tab │ │ since 3:42 PM" │ │ │ │ 🔭 X-Ray tab │ VS │ │ │ │ │ 💰 Cost Explorer │ │ AI: Root cause found in │ │ │ │ 🛡️ GuardDuty tab │ │ 30 seconds: │ │ │ │ 🏗️ Config tab │ │ IAM policy removed by │ │ │ │ 📊 Trusted Advisor │ │ terraform at 3:38 PM ✅ │ │ │ │ 😱 3 Slack threads │ └──────────────────────────┘ │ │ │ ⏱️ 45 minutes later │ │ │ └──────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ```
## 🎯 本 Power 的功能
这是一个 **Kiro Power** —— 一个专门的上下文包,它让 Kiro 的 AI 代理能够按需访问 AWS API、可观测性数据和 SRE Runbook。当你提到与 AWS 相关的问题时,它会自动激活,并将 **8 个专家级工作流** 引入你的 IDE 对话中。
``` ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ "Production is down!" │ │ │ │ Without aws-cloud-ops: With aws-cloud-ops: │ │ ┌──────────────────────┐ ┌──────────────────────────┐ │ │ │ 🔍 CloudWatch tab │ │ You: "Lambda timeouts │ │ │ │ 📋 CloudTrail tab │ │ since 3:42 PM" │ │ │ │ 🔭 X-Ray tab │ VS │ │ │ │ │ 💰 Cost Explorer │ │ AI: Root cause found in │ │ │ │ 🛡️ GuardDuty tab │ │ 30 seconds: │ │ │ │ 🏗️ Config tab │ │ IAM policy removed by │ │ │ │ 📊 Trusted Advisor │ │ terraform at 3:38 PM ✅ │ │ │ │ 😱 3 Slack threads │ └──────────────────────────┘ │ │ │ ⏱️ 45 minutes later │ │ │ └──────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ```
```
┌─────────────────────────────────┐
│ aws-cloud-ops Power │
│ │
│ Keywords trigger activation: │
│ "incident" "outage" "deploy" │
│ "blast-radius" "chaos" "cost" │
│ "guardduty" "fis" "sre" ... │
└──────────────┬──────────────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌────────────────┐ ┌───────────────┐
│ aws-api MCP │ │ aws-pricing MCP│ │ aws-iac MCP │
│ │ │ │ │ │
│ 100+ AWS APIs │ │ Cost lookups │ │ CDK / CFn │
│ CloudWatch │ │ RI savings │ │ Best practice │
│ CloudTrail │ │ Rightsizing │ │ Validation │
│ X-Ray, FIS │ │ SP analysis │ │ │
└───────────────┘ └────────────────┘ └───────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ aws-knowledge MCP (HTTP) │
│ AWS Documentation · Best Practices · Guides │
└─────────────────────────────────────────────────────────────┘
```
## 🚀 8 大超能力
| ⚡ 能力 | 解决的问题 | 惊艳时刻 |
|---|---|---|
| 🔍 事件调查器 (Incident Investigator) | 生产环境宕机,且原因不明 | 30 秒内通过 5-Why 分析找到根本原因 —— 自动遍历 CloudWatch → CloudTrail → X-Ray → Logs |
| 💥 爆炸半径分析器 (Blast Radius Analyzer) | 想要删除/修改某些东西,但担心后果 | 在触碰任何 AWS 资源之前获取完整的依赖关系图 —— 包括 EC2, SGs, RDS, Lambda, VPCs, IAM roles |
| 🏆 韧性记分卡 (Resiliency Scorecard) | 不确定你的架构到底有多强的韧性 | 实时生成等级评定 (A–F) 的 Well-Architected REL 评估,包含 15 项检查、修复成本 vs 故障成本估算 |
| 🌪️ 混沌工程工作室 (Chaos Engineering Studio) | 从未进行过 Game Day,因为感觉风险太大 | 设计并执行 AWS FIS 实验,包含安全的假设、停止条件和实验后分析 —— 一键开启 Game Day |
| 🚦 部署安全门 (Deployment Safety Gate) | 每次部署到生产环境时都感到紧张 | 在部署任何代码之前,提供包含回滚命令的 7 项检查 GO / NO-GO 报告 |
| 💰 成本异常侦探 (Cost Anomaly Detective) | AWS 账单激增,且原因不明 | 追踪每一笔支出的来源 —— 识别僵尸资源、调整规模的机会,并生成排名后的“成本清理列表” |
| 🛡️ 安全态势扫描器 (Security Posture Scanner) | 安全发现分散在 5 个不同的 AWS 控制台中 | 将 GuardDuty + Security Hub + IAM Access Analyzer + Config 聚合成通俗英语的顶级风险列表,并提供一键修复方案 |
| 🏢 多账户运维 (Multi-Account Operations) | 管理组织中数十个 AWS 账户 | 组织范围的韧性评估、SCP 分析、跨账户事件调查、账户分发检查清单 |
🔍 事件调查 — “生产环境宕机”
``` You: My payments service Lambda is timing out since 3:42 PM UTC. Error rate went from 0.1% to 100%. What happened? Kiro: Loading incident-investigator... 📊 Phase 1: CloudWatch Timeline → Lambda Errors: 0% → 100% starting 15:42:03 UTC → Duration: spiking to max timeout (15,000ms) → Throttles: 0 (not the cause) 📋 Phase 2: CloudTrail — "What Changed?" → Found: DeleteRolePolicy at 15:38:47 UTC Actor: arn:aws:iam::123456789:user/ci-terraform-user Policy: payments-lambda-dynamodb-access 📝 Phase 3: Application Logs → Error: AccessDeniedException: User is not authorized to perform: dynamodb:GetItem on resource payments-table 🎯 ROOT CAUSE FOUND: A Terraform apply at 15:38 UTC removed the DynamoDB access policy from the Lambda execution role. The Lambda cannot read from DynamoDB → 100% errors. ✅ Immediate Fix: aws iam put-role-policy \ --role-name payments-lambda-role \ --policy-name payments-lambda-dynamodb-access \ --policy-document file://policy.json 📋 Follow-up: • Add Terraform plan review gate for IAM changes • Add CloudWatch alarm for AccessDeniedException errors ```💥 爆炸半径 — “删除这个安全吗?”
``` You: I want to terminate EC2 instance i-0abc12345. What breaks? Kiro: Loading blast-radius-analyzer... 🗺️ Mapping dependencies for i-0abc12345 (payments-worker)... ┌─────────────────────────────────────────────────────┐ │ BLAST RADIUS REPORT │ │ Risk Score: 🔴 CRITICAL │ ├─────────────────────────────────────────────────────┤ │ Resource Impact │ │ ───────────────────────────────────────────── │ │ tg-payments (ALB) LAST healthy target → OUTAGE│ │ asg-payments Will replace in ~3 min │ │ eip-52.x.x.x Released → DNS updates needed│ │ hc-payments-prod Route53 failover triggers │ └─────────────────────────────────────────────────────┘ ⚠️ DO NOT terminate now — this is the last healthy target in your production load balancer. ✅ Safe Sequence: 1. Drain ALB target group connections (30 sec) 2. Terminate instance 3. Reassociate EIP to new ASG instance 4. Verify new target is healthy Proceed with safe sequence? (yes/no) ```🏆 韧性记分卡 — “我的韧性如何?”
``` You: Score my production VPC for resiliency. Kiro: Loading resiliency-scorecard... Running 15 checks... ╔══════════════════════════════════════════╗ ║ AWS RESILIENCY SCORECARD ║ ║ Account: 123456789012 Region: us-east-1║ ╠══════════════════════════════════════════╣ ║ Overall Grade: B (78/100) ║ ╠══════════════════╦═══════╦══════════════╣ ║ Domain ║ Score ║ Grade ║ ╠══════════════════╬═══════╬══════════════╣ ║ Compute ║ 19/25 ║ B ║ ║ Data Durability ║ 22/25 ║ A ║ ║ Network ║ 14/20 ║ C ║ ║ Observability ║ 12/15 ║ B ║ ║ Deployment Safety║ 11/15 ║ B ║ ╚══════════════════╩═══════╩══════════════╝ 🔴 CRITICAL: Single NAT Gateway (AZ-level SPOF) Fix: Add NAT GW to 1b and 1c (+$65/month) Risk: Prevents $50,000+ outage cost 🔴 CRITICAL: RDS Multi-AZ disabled on payments-db Fix: Enable Multi-AZ (+$180/month) Risk: 10-30 min manual recovery on failure 🟡 HIGH: No CloudWatch alarm on Lambda Throttles 🟡 HIGH: ECS Deployment Circuit Breaker disabled ```💰 成本侦探 — “为什么我的账单激增?”
``` You: My AWS bill jumped $4,200 last month. What happened? Kiro: Loading cost-anomaly-detective... 🔍 Analyzing Cost Explorer anomalies... 🔍 Hunting zombie resources... 🔍 Checking rightsizing opportunities... ═══════════════════════════════════════════ 💀 COST KILL LIST (Ranked by Savings) ═══════════════════════════════════════════ 1. 5× EC2 m5.4xlarge (load-test env) Owner: john.doe (created Feb 3 per CloudTrail) Monthly cost: $1,680 → Terminate immediately ✅ 2. 12 unattached EBS volumes Monthly cost: $240 → Delete (snapshots exist) ✅ 3. rds-analytics-dev (db.r5.2xlarge, 0 connections) Monthly cost: $580 → Verify with team, then delete 4. 3 idle NAT Gateways (0 bytes for 30 days) Monthly cost: $96 → Delete ✅ ───────────────────────────────────────── 💰 Total immediate savings: $2,596/month 💰 Annual savings: $31,152/year ═══════════════════════════════════════════ ```
**Built with ❤️ for the AWS community**
*将被动救火转变为主动韧性工程*
[](https://github.com/nirmal84/aws-cloud-ops-power)
标签:AIOps, AWS, CloudTrail, CloudWatch, Config, DLL 劫持, DPI, GuardDuty, Kiro, Lambda, MCP, SRE, X-Ray, 云运维, 偏差过滤, 可靠性工程, 大语言模型, 安全合规, 安全态势, 弹性评分, 成本异常检测, 成本管理, 故障排查, 故障自愈, 根因分析, 混沌工程, 爆炸半径, 生产事故, 监控告警, 站点可靠性, 网络代理, 自主运维, 自动化运维, 运维自动化, 逆向工具, 部署安全门禁