duyluann/aws-devops-agent-demo

GitHub: duyluann/aws-devops-agent-demo

一套生产就绪的 AWS 基础设施测试平台，用于模拟真实故障场景并验证 DevOps AI Agent 的事件检测与响应能力。

Stars: 0 | Forks: 0

# AWS DevOps Agent 演示 ![Version](https://img.shields.io/badge/version-1.5.3-blue) ![License](https://img.shields.io/badge/license-MIT-green) ![Terraform](https://img.shields.io/badge/terraform-%E2%89%A5%201.0-623CE4) ![AWS](https://img.shields.io/badge/AWS-Infrastructure-FF9900) 一个生产就绪的 AWS 基础设施平台，旨在**测试和验证 DevOps AI Agent 的事件响应能力**。该演示环境通过快速的 CloudWatch 警报、自动恢复机制和全面的监控来模拟真实世界的基础设施事件——非常适合训练和验证 AI 驱动的自动化 Agent。 ## 功能特性 - **交互式仪表板**：基于 Python 的 Web 应用程序，支持一键式事件模拟 - **6 种事件类型**：不健康主机、崩溃、响应缓慢、5xx 泛洪和关机 - **快速警报**：10 秒 CloudWatch 评估周期，用于快速事件检测（2-4 分钟） - **自动恢复**：内置 5 分钟恢复计时器，自动恢复健康状态 - **成本优化**：Auto-shutdown Lambda 在 2 小时后停止实例，每月节省 $50-100 - **GitHub Actions 集成**：通过 CI/CD 工作流自动触发事件 ## 目录 1. [快速入门](#quick-start) 2. [架构](#architecture) 3. [Web 应用程序](#the-web-application) 4. [事件模拟](#incident-simulation) 5. [监控与警报](#monitoring--alarms) 6. [DevOps Agent 集成](#devops-agent-integration) 7. [基础设施详情](#infrastructure-details) 8. [开发](#development) 9. [进阶主题](#advanced-topics) 10. [文档索引](#documentation-index) 11. [技术参考](#technical-reference) ## 快速入门在 5 分钟内启动演示基础设施。 ### 前置条件 - 已配置凭证的 AWS 账户 - Terraform >= 1.0（[安装指南](https://learn.hashicorp.com/tutorials/terraform/install-cli)） ### 部署 ``` # 1. 克隆仓库 git clone cd aws-devops-agent-demo # 2. 初始化 Terraform terraform init # 3. 部署到 dev 环境 terraform apply -var-file=environments/dev/dev.tfvars # 4. 获取应用 URL terraform output -raw alb_url ``` ### 首次事件测试 ``` # 获取您的 ALB URL ALB_URL=$(terraform output -raw alb_url) # 通过仪表板触发不健康主机事件 open ${ALB_URL} # Opens interactive dashboard in browser # 或通过 curl 触发 curl "${ALB_URL}/simulate/unhealthy" # 检查 CloudWatch 警报 (应在 2-3 分钟内触发) # 系统将在 5 分钟内自动恢复 ``` ## 架构该基础设施创建了一个完整的测试环境，包含 Application Load Balancer、EC2 实例、CloudWatch 监控和自动恢复机制。 ``` graph TB Internet[Internet Users] ALB[Application Load Balancer
Port 80] EC2_1[EC2 Instance #1
Python Web App
Port 80] EC2_2[EC2 Instance #2
Python Web App
Port 80] CW[CloudWatch Alarms
10-second periods] Lambda[Auto-Shutdown Lambda
2-hour timer] Internet -->|HTTP Requests| ALB ALB -->|Health Checks| EC2_1 ALB -->|Health Checks| EC2_2 EC2_1 -->|Custom Metrics
HealthStatus, Incidents| CW EC2_2 -->|Custom Metrics
HealthStatus, Incidents| CW Lambda -.->|Stop Instances
Cost Savings| EC2_1 Lambda -.->|Stop Instances
Cost Savings| EC2_2 subgraph VPC["VPC (10.0.0.0/16)"] ALB subgraph PublicSubnets["Public Subnets (Multi-AZ)"] EC2_1 EC2_2 end end style CW fill:#FF9900,color:#fff style Lambda fill:#FF9900,color:#fff style ALB fill:#0066CC,color:#fff style EC2_1 fill:#0066CC,color:#fff style EC2_2 fill:#0066CC,color:#fff ``` ### 组件 | 组件 | 描述 | 用途 | |-----------|-------------|---------| | **VPC** | 10.0.0.0/16 网络 | 隔离的网络环境 | | **Application Load Balancer** | 面向互联网的 ALB | 路由流量，健康检查 | | **EC2 实例** | t3.micro（默认 2 个） | 运行 Python Web 应用程序 | | **Python Web App** | 端口 80 上的 HTTP 服务器 | 模拟端点，指标 | | **CloudWatch 警报** | 10 秒评估周期 | 快速事件检测 | | **Auto-Shutdown Lambda** | 每 2 小时运行一次 | 节省空闲演示的成本 | | **IAM 角色** | EC2 实例配置文件 | CloudWatch 指标，SSM 访问 | ### 网络架构 - **公有子网**：多可用区部署以实现高可用性 - **互联网网关**：ALB 的直接互联网访问 - **安全组**：ALB（端口 80 入站） → EC2（仅允许来自 ALB 的端口 80 流量） - **路由表**：指向互联网网关的公有路由 ### 关键文件 - **main.tf**：VPC、ALB、EC2 实例、网络 - **monitoring.tf**：CloudWatch 警报（第 5-77 行） - **auto_shutdown.tf**：用于成本优化的 Lambda 函数（第 20-96 行） - **templates/userdata.sh.tpl**：Python 应用程序安装 ## Web 应用程序一个在端口 80 上作为 systemd 服务运行的 Python 3.12 HTTP 服务器，提供交互式仪表板和用于事件测试的模拟端点。 ### 概述 - **技术**：Python 3 内置 `http.server` 模块 - **部署**：具有自动重启功能的 Systemd 服务 - **指标**：通过 boto3 发布 CloudWatch 自定义指标 - **自动恢复**：用于自动健康恢复的 5 分钟计时器 ### 交互式仪表板访问您的 ALB URL 处的仪表板以： - 查看实时健康状态 - 监控实例信息（ID，环境） - 使用一键按钮触发事件 - 查看所有可用的 API 端点 - 每 5 秒自动刷新 ### 模拟端点 | 端点 | 方法 | 描述 | 自动恢复 | |----------|--------|-------------|---------------| | `/simulate/unhealthy` | GET | 健康检查失败（返回 503） | 5 分钟 | | `/simulate/healthy` | GET | 恢复健康状态（返回 200） | 不适用 | | `/simulate/crash` | GET | 10 秒后崩溃应用程序 | systemd 重启 | | `/simulate/slow-health` | GET | 触发响应缓慢/超时 | 5 分钟 | ### 健康检查端点 **`GET /health`** - ALB 健康检查端点 **响应**： - **200 OK**：`{"status": "healthy"}` - 实例健康 - **503 Service Unavailable**：`{"status": "unhealthy", "reason": "..."}` - 健康检查失败 ALB 健康检查配置： - **间隔**：30 秒 - **超时**：5 秒 - **健康阈值**：连续 2 次成功 - **不健康阈值**：连续 2 次失败 ### CloudWatch 指标应用程序每 60 秒发布一次自定义指标： | 指标 | 值 | 用途 | |--------|--------|---------| | **HealthStatus** | 1.0（健康）/ 0.0（不健康） | 跟踪随时间变化的健康状态 | | **IncidentSimulations** | 计数 | 跟踪事件触发频率 | **命名空间**：`CustomApp/HealthDemo` **维度**：InstanceId, Environment, IncidentType ### 自动恢复功能当触发 unhealthy 或 slow-health 事件时，应用程序会自动安排恢复： - **延迟**：300 秒（5 分钟） - **操作**：恢复 `healthy = true` 状态 - **取消**：调用 `/simulate/healthy` 可取消待处理的恢复 - **指标**：将更新的健康状态发布到 CloudWatch - **日志**：所有恢复事件记录到 `/var/log/user-data.log` 这允许在没有手动干预的情况下测试 Agent 检测和响应。 ### 实现细节 **源码**：`templates/userdata.sh.tpl`（第 33-650 行） **关键特性**： - Python 3.12 配合 boto3 作为 AWS SDK - 多线程：主 HTTP 服务器 + 后台指标发布器 - CloudWatch API 失败的优雅错误处理 - IMDSv2 用于安全的实例元数据检索 - Systemd 服务在崩溃时自动重启 **Systemd 服务**： ``` # 检查服务状态 sudo systemctl status webapp # 查看日志 sudo journalctl -u webapp -f # 重启服务 sudo systemctl restart webapp ``` ## 事件模拟通过 6 种真实的故障场景测试 DevOps Agent 的事件响应能力。 ### 概述可以通过以下方式触发事件： 1. **交互式仪表板**：Web UI 中的一键按钮 2. **直接 API**：对 ALB URL 执行 curl 命令 3. **GitHub Actions**：自动化工作流（`.github/workflows/trigger-incidents.yml`） 4. **AWS SSM**：直接向实例发送命令所有事件都会在 2-4 分钟内生成 CloudWatch 警报，用于快速的 Agent 测试。 ### 事件类型 #### 1. 不健康主机 (`unhealthy_host`) **行为**：单个实例健康检查失败 **触发**：`unhealthy-hosts` 警报（UnHealthyHostCount ≥ 1） **预期警报时间**：2-3 分钟 **自动恢复**：5 分钟 **用例**：测试单主机故障检测和恢复 ``` # 通过 API curl "${ALB_URL}/simulate/unhealthy" # 通过 GitHub Actions gh workflow run trigger-incidents.yml \ -f environment=dev \ -f incident_type=unhealthy_host ``` #### 2. 所有主机不健康 (`unhealthy_all`) **行为**：所有实例同时健康检查失败 **触发**：`unhealthy-hosts` 警报（UnHealthyHostCount ≥ instance_count） **预期警报时间**：2-3 分钟 **自动恢复**：5 分钟 **用例**：测试完全服务中断检测 ``` # 通过 GitHub Actions (需要 SSM) gh workflow run trigger-incidents.yml \ -f environment=dev \ -f incident_type=unhealthy_all ``` #### 3. 应用程序崩溃 (`crash_instance`) **行为**：应用程序在 10 秒后以代码 1 退出 **触发**：`unhealthy-hosts` 警报（systemd 在约 5 秒后重启应用） **预期警报时间**：2-4 分钟 **自动恢复**：systemd 自动重启 **用例**：测试应用程序崩溃检测和 systemd 恢复 ``` # 通过 API curl "${ALB_URL}/simulate/crash" # 注意：应用响应 "will crash in 10 seconds" 然后退出 ``` #### 4. 健康检查缓慢 (`slow_health`) **行为**：健康端点变得无响应/缓慢 **触发**：`high-response-time` 警报（TargetResponseTime > 2s） **预期警报时间**：2-3 分钟 **自动恢复**：5 分钟 **用例**：测试性能下降检测 ``` # 通过 API curl "${ALB_URL}/simulate/slow-health" ``` #### 5. 关闭实例 (`shutdown_instances`) **行为**：Lambda 函数停止所有 EC2 实例 **触发**：所有警报（不健康主机、响应时间、5xx） **预期警报时间**：3-5 分钟 **自动恢复**：手动（需要 `aws ec2 start-instances`） **用例**：测试基础设施关闭检测 **要求**：`enable_auto_shutdown = true` ``` # 通过 GitHub Actions gh workflow run trigger-incidents.yml \ -f environment=dev \ -f incident_type=shutdown_instances ``` #### 6. HTTP 5xx 错误泛洪 (`http_5xx_flood`) **行为**：生成 15 个并发的 HTTP 503 错误 **触发**：`5xx-errors` 警报（HTTPCode_Target_5XX_Count > 10） **预期警报时间**：2-3 分钟 **自动恢复**：无（错误是瞬态的） **用例**：测试错误率激增检测 ``` # 通过 API (生成多个请求) for i in {1..15}; do curl -s "${ALB_URL}/simulate/unhealthy" & done wait ``` ### GitHub Actions 工作流用于自动化、可重复的事件测试，具有警报监控和自动恢复功能。 **文件**：`.github/workflows/trigger-incidents.yml` **功能**： - 通过 Terraform 输出自动发现基础设施 - 预检健康检查 - CloudWatch 警报状态监控 - 可选的自动恢复，具有可配置的延迟 - 详细的执行摘要及时间线 **示例用法**： ``` # 通过 GitHub CLI 触发工作流 gh workflow run trigger-incidents.yml \ -f environment=dev \ -f incident_type=unhealthy_host \ -f target_instance_index=0 \ -f restore_after_incident=true \ -f restore_delay_seconds=300 \ -f wait_for_alarm=true ``` 有关全面的工作流文档，请参阅 [.github/workflows/README-incidents.md](.github/workflows/README-incidents.md)。 ### 通过 SSM 运行事件用于在没有 SSH 的情况下直接访问实例： ``` # 获取实例 ID INSTANCE_ID=$(terraform output -json instance_ids | jq -r '.[0]') # 通过 SSM 触发不健康状态 aws ssm send-command \ --instance-ids ${INSTANCE_ID} \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["curl http://localhost/simulate/unhealthy"]' # 恢复健康状态 aws ssm send-command \ --instance-ids ${INSTANCE_ID} \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["curl http://localhost/simulate/healthy"]' ``` ## 监控与警报为快速事件检测设计的快速 CloudWatch 警报（10 秒评估周期）。 ### CloudWatch 警报 **配置**：`monitoring.tf`（当 `enable_monitoring = true` 时启用） | 警报名称 | 指标 | 阈值 | 周期 | 评估 | 预期触发时间 | |------------|--------|-----------|--------|------------|----------------------| | **unhealthy-hosts** | UnHealthyHostCount | ≥ 1 | 10 秒 | 1 个周期 | 2-3 分钟 | | **high-response-time** | TargetResponseTime | > 2 秒 | 10 秒 | 1 个周期 | 2-3 分钟 | | **5xx-errors** | HTTPCode_Target_5XX_Count | > 10 | 10 秒 | 1 个周期 | 2-3 分钟 | **命名空间**：`AWS/ApplicationELB` **维度**：TargetGroup, LoadBalancer ### 为什么使用 10 秒周期？传统的 CloudWatch 警报使用 60-300 秒的周期。此演示使用 **10 秒周期** 以实现： - **更快的检测**：警报在 2-3 分钟内触发，而不是 5-10 分钟 - **更好的 Agent 测试**：AI Agent 验证的快速反馈循环 - **逼真的演示**：保持测试会话简短且互动 **权衡**：较高的 CloudWatch API 成本（约 $0.10/警报/月）。对于生产环境，请使用 60-300 秒的周期。 ### 自定义指标由 Python 应用程序每 60 秒发布一次： **命名空间**：`CustomApp/HealthDemo` | 指标 | 值 | 用例 | |--------|--------|----------| | **HealthStatus** | 1.0 / 0.0 | 随时间跟踪应用程序健康状况 | | **IncidentSimulations** | 按类型计数 |监控事件测试频率 | **查询示例**： ``` aws cloudwatch get-metric-statistics \ --namespace CustomApp/HealthDemo \ --metric-name HealthStatus \ --dimensions Name=Environment,Value=dev \ --start-time 2026-01-24T00:00:00Z \ --end-time 2026-01-24T23:59:59Z \ --period 300 \ --statistics Average ``` ### Agent 测试的监控最佳实践 1. **首先建立基线**：部署基础设施并观察正常指标 10 分钟 2. **记录时间线**：记录事件触发 → 警报转换时间 3. **一次测试一种类型**：隔离变量以进行准确的 Agent 验证 4. **使用自动恢复**：用于可重复的自动化测试 5. **禁用自动恢复**：在验证 Agent 修复操作时 6. **监控 CloudWatch 成本**：10 秒周期会增加 API 使用量 ### 查看警报 ``` # 列出所有警报 aws cloudwatch describe-alarms \ --query 'MetricAlarms[*].[AlarmName,StateValue]' \ --output table # 获取特定警报状态 ALARM_NAME=$(terraform output -json cloudwatch_alarm_names | jq -r '.unhealthy_hosts') aws cloudwatch describe-alarms --alarm-names ${ALARM_NAME} # 查看警报历史 aws cloudwatch describe-alarm-history \ --alarm-name ${ALARM_NAME} \ --max-records 10 ``` ## DevOps Agent 集成使用此演示基础设施配置和测试 AI 驱动的 DevOps Agent。 ### 设置 Agent 环境 #### 1. AWS 凭证（建议只读）授予您的 Agent IAM 权限： ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:DescribeAlarms", "cloudwatch:GetMetricStatistics", "ec2:DescribeInstances", "elasticloadbalancing:DescribeTargetHealth", "elasticloadbalancing:DescribeLoadBalancers", "logs:GetLogEvents" ], "Resource": "*" } ] } ``` 对于 Agent 修复测试，添加： ``` { "Effect": "Allow", "Action": [ "ssm:SendCommand", "ec2:StartInstances", "ec2:RebootInstances" ], "Resource": "*", "Condition": { "StringEquals": { "aws:RequestedRegion": "us-east-1" } } } ``` #### 2. 资源发现配置 Agent 通过标签发现基础设施： ``` # 获取环境标签进行过滤 ENVIRONMENT=$(terraform output -raw environment_tag) # 发现 EC2 实例 aws ec2 describe-instances \ --filters "Name=tag:Environment,Values=${ENVIRONMENT}" \ --query 'Reservations[*].Instances[*].[InstanceId,State.Name,Tags[?Key==`Name`].Value|[0]]' # 获取警报名称 terraform output -json cloudwatch_alarm_names ``` **资源标签**： - `Environment`：dev/qa/prod（用于过滤） - `ManagedBy`：Terraform（自动化配置） - `Name`：描述性资源名称 #### 3. 警报配置向 Agent 提供用于监控的警报名称： ``` terraform output cloudwatch_alarm_names # 示例输出： # { # "unhealthy_hosts": "dev-devops-demo-unhealthy-hosts", # "high_response_time": "dev-devops-demo-high-response-time", # "http_5xx_errors": "dev-devops-demo-5xx-errors" # } ``` ### 测试事件检测工作流 **目标**：验证 Agent 可以检测和诊断事件 #### 步骤 1：触发事件 ``` # 触发不健康主机事件 curl "${ALB_URL}/simulate/unhealthy" echo "Incident triggered at $(date)" ``` #### 步骤 2：等待警报（2-4 分钟） ``` # 监控警报状态 ALARM_NAME=$(terraform output -json cloudwatch_alarm_names | jq -r '.unhealthy_hosts') while true; do STATE=$(aws cloudwatch describe-alarms \ --alarm-names ${ALARM_NAME} \ --query 'MetricAlarms[0].StateValue' \ --output text) echo "[$(date +%H:%M:%S)] Alarm state: ${STATE}" [ "$STATE" == "ALARM" ] && break sleep 10 done ``` #### 步骤 3：观察 Agent 检测您的 Agent 应该： 1. **检测**：识别警报状态变化（轮询或 EventBridge） 2. **识别**：确定受影响的资源（EC2 实例、ALB 目标组） 3. **诊断**：将警报与健康检查失败相关联 4. **提议**：建议修复操作 #### 步骤 4：验证建议的修复 **预期的 Agent 输出**： ``` Incident Detected: Unhealthy Host Alarm - Alarm: dev-devops-demo-unhealthy-hosts - Metric: UnHealthyHostCount = 1 - Affected Instance: i-0123456789abcdef0 - Root Cause: Health check endpoint returning 503 Proposed Remediation: 1. Restart application: systemctl restart webapp 2. Or restore via API: curl http://localhost/simulate/healthy 3. Or wait for auto-recovery (5 minutes) ``` ### 预期的 Agent 能力 | 能力 | 描述 | 验证方法 | |------------|-------------|-------------------| | **警报检测** | 检测 CloudWatch 警报状态变化 | 触发事件，验证 Agent 日志 | | **资源关联** | 识别受影响的 EC2/ALB 资源 | 检查 Agent 是否识别正确的实例 | | **根因分析** | 诊断健康检查失败 | 验证 Agent 确定 503 响应 | | **修复提议** | 建议修复操作 | 审查提议的命令 | | **操作执行** | 执行 SSM 命令或 API 调用 | 使用读写 IAM 权限进行测试 | | **恢复验证** | 验证警报返回 OK 状态 | 确认 Agent 监控修复后状态 | ### 测试自动修复操作对于具有修复能力的 Agent（需要 IAM 权限）： #### 通过 SSM 重启应用程序 ``` # Agent 执行 INSTANCE_ID="" aws ssm send-command \ --instance-ids ${INSTANCE_ID} \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["sudo systemctl restart webapp"]' ``` #### 通过 API 恢复健康 ``` # Agent 执行 INSTANCE_ID="" aws ssm send-command \ --instance-ids ${INSTANCE_ID} \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["curl http://localhost/simulate/healthy"]' ``` #### 启动已停止的实例 ``` # 对于 shutdown_instances 事件 INSTANCE_IDS=$(terraform output -json instance_ids | jq -r '.[]') aws ec2 start-instances --instance-ids ${INSTANCE_IDS} ``` ### 验证恢复检查清单 Agent 修复后，验证： - [ ] 警报状态恢复为 OK（在 2-3 分钟内） - [ ] 目标健康状况在 ALB 中显示为 "healthy" - [ ] 应用程序在 `/health` 上响应 200 - [ ] CloudWatch 指标显示 HealthStatus = 1.0 - [ ] 未触发新警报 ### Agent 提示词示例用于测试具有自然语言界面的 AI Agent： ``` You are a DevOps agent monitoring AWS infrastructure in us-east-1. Environment: dev Available Alarms: - dev-devops-demo-unhealthy-hosts - dev-devops-demo-high-response-time - dev-devops-demo-5xx-errors Task: Monitor CloudWatch alarms and respond to incidents. When an alarm triggers: 1. Identify affected resources 2. Diagnose root cause 3. Propose remediation 4. Execute fix (if approved) 5. Validate recovery Start monitoring now. ``` ### 集成模式 #### 基于轮询 ``` # 每 30 秒轮询 CloudWatch while True: alarms = cloudwatch.describe_alarms(AlarmNames=alarm_list) for alarm in alarms['MetricAlarms']: if alarm['StateValue'] == 'ALARM': handle_incident(alarm) time.sleep(30) ``` #### 事件驱动 (EventBridge) ``` { "source": ["aws.cloudwatch"], "detail-type": ["CloudWatch Alarm State Change"], "detail": { "state": { "value": ["ALARM"] } } } ``` #### 基于 API (GitHub Actions Webhook) 配置 webhook 以在工作流触发事件时通知 Agent。 ## 基础设施详情 ### 环境配置三种具有不同资源配置文件的预配置环境： | 设置 | dev | qa | prod | |---------|-----|-----|------| | **实例类型** | t3.micro | t3.small | t3.medium | | **实例数量** | 2 | 2 | 3 | | **自动关闭** | 启用（2 小时） | 启用（2 小时） | 禁用 | | **监控** | 启用 | 启用 | 启用 | | **SSH 访问** | 可选 | 可选 | 禁用 | | **区域** | us-east-1 | us-east-1 | us-east-1 | **成本估算**（启用自动关闭）： - **dev**：约 $15-20/月（每天活动 8 小时） - **qa**：约 $25-30/月（每天活动 8 小时） - **prod**：约 $80-100/月（始终开启，更大的实例） ### 自定义编辑特定于环境的 tfvars 文件： ``` # environments/dev/dev.tfvars env = "dev" region = "us-east-1" instance_type = "t3.micro" instance_count = 2 enable_auto_shutdown = true enable_monitoring = true enable_ssh_access = false ``` **常见自定义**： ``` # 增加实例数量以进行多主机测试 instance_count = 4 # 使用更大的实例进行性能测试 instance_type = "t3.small" # 禁用自动关闭以进行长时间运行测试 enable_auto_shutdown = false # 启用 SSH 以直接访问实例 enable_ssh_access = true ssh_allowed_cidrs = ["1.2.3.4/32"] # Your IP key_pair_name = "my-key" # Existing EC2 key pair ``` 应用更改： ``` terraform apply -var-file=environments/dev/dev.tfvars ``` ### 成本优化 #### 自动关闭 Lambda **文件**：`auto_shutdown.tf` **默认启用**：dev、qa（在 prod 中禁用） **工作原理**： 1. EventBridge 规则每 2 小时触发一次 Lambda 2. Lambda 停止环境中的所有 EC2 实例 3. 为空闲演示节省约 60% 的 EC2 成本 **手动控制**： ``` # 手动停止实例 INSTANCE_IDS=$(terraform output -json instance_ids | jq -r '.[]') aws ec2 stop-instances --instance-ids ${INSTANCE_IDS} # 启动实例 aws ec2 start-instances --instance-ids ${INSTANCE_IDS} ``` **禁用自动关闭**： ``` # environments/dev/dev.tfvars enable_auto_shutdown = false ``` **成本节省示例**： - **不使用自动关闭**：2x t3.micro × 730 小时 = $15/月 - **使用自动关闭（8 小时/天）**：2x t3.micro × 240 小时 = $5/月 - **节省**：$10/月（67%） #### 不需要时销毁 ``` # 完全移除基础设施 terraform destroy -var-file=environments/dev/dev.tfvars ``` ### SSH 访问（可选） **默认**：SSH 禁用以确保安全 **启用步骤**： 1. 在 AWS 控制台中创建 EC2 密钥对 2. 更新 tfvars： ``` enable_ssh_access = true ssh_allowed_cidrs = ["YOUR_IP/32"] key_pair_name = "your-key-name" ``` 3. 应用更改： ``` terraform apply -var-file=environments/dev/dev.tfvars ``` 4. SSH 连接到实例： ``` INSTANCE_IP=$(aws ec2 describe-instances \ --instance-ids $(terraform output -json instance_ids | jq -r '.[0]') \ --query 'Reservations[0].Instances[0].PublicIpAddress' \ --output text) ssh -i ~/.ssh/your-key.pem ec2-user@${INSTANCE_IP} ``` ### Terraform 状态后端 **默认**：本地状态（`.terraform/terraform.tfstate`） **团队协作**：使用 S3 后端更新 `versions.tf`： ``` terraform { backend "s3" { bucket = "my-terraform-state-bucket" key = "aws-devops-demo/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-state-locks" encrypt = true } } ``` 创建后端资源： ``` # 创建 S3 存储桶 aws s3 mb s3://my-terraform-state-bucket --region us-east-1 # 创建 DynamoDB 表用于锁定 aws dynamodb create-table \ --table-name terraform-state-locks \ --attribute-definitions AttributeName=LockID,AttributeType=S \ --key-schema AttributeName=LockID,KeyType=HASH \ --billing-mode PAY_PER_REQUEST \ --region us-east-1 ``` ## 开发 ### 本地设置 ``` # 安装 pre-commit 钩子 (首次提交前必须执行) pre-commit install # 初始化 Terraform terraform init # 手动运行 pre-commit 检查 pre-commit run --all-files # 验证 Terraform terraform fmt -recursive terraform validate # 运行 TFLint (首次会下载 AWS 插件) tflint --init tflint --config=.tflint.hcl # 运行 Checkov 安全扫描 checkov --config-file .checkov.yml # 生成文档 terraform-docs markdown table --output-file README.md . ``` ### Pre-Commit 检查在每次提交前自动运行： | 检查 | 目的 | |-------|---------| | **terraform_fmt** | 格式化 .tf 文件 | | **terraform_validate** | 验证语法 | | **terraform_docs** | 生成 README 文档 | | **terraform_tflint** | 使用 AWS 规则进行 Lint | | **terraform_checkov** | 安全扫描 | | **gitleaks** | 机密检测 | | **trailing-whitespace** | 修复空白字符 | | **check-yaml** | 验证 YAML 文件 | **配置**：`.pre-commit-config.yaml` ### CI/CD 工作流 **位置**：`.github/workflows/` | 工作流 | 触发器 | 目的 | |----------|---------|---------| | **pre-commit-ci.yaml** | 拉取请求 | 运行所有质量检查 | | **terraform-aws.yml** | 手动（workflow_dispatch） | 部署基础设施 | | **trigger-incidents.yml** | 手动（workflow_dispatch） | 触发事件以进行测试 | | **release.yaml** | 推送到 main | 语义版本控制和变更日志 | ### 语义提交遵循 Conventional Commits 进行自动化版本控制： ``` # 补丁版本升级 (1.0.0 → 1.0.1) git commit -m "fix(monitoring): correct alarm threshold" # 次版本升级 (1.0.0 → 1.1.0) git commit -m "feat(incidents): add new slow-response incident type" # 主版本升级 (1.0.0 → 2.0.0) git commit -m "feat(api): change health endpoint path BREAKING CHANGE: health endpoint moved from /health to /api/v2/health" # 无版本升级 git commit -m "docs: update README with new examples" git commit -m "chore: update pre-commit hooks" ``` **类型**： - `feat`：新功能（次版本升级） - `fix`：Bug 修复（修订版本升级） - `perf`：性能改进（修订版本升级） - `refactor`：代码重构（修订版本升级） - `docs`：仅文档（无版本升级） - `chore`：维护任务（无版本升级） - `test`：测试添加（无版本升级） ### 分支策略 **关键规则**：永远不要直接提交到 `main` **工作流**： 1. 创建功能分支：`git checkout -b feat/your-feature` 2. 使用语义提交进行更改 3. 推送并创建拉取请求 4. 等待 CI/CD 检查通过 5. 通过拉取请求合并 **分支命名**： - `feat/description` - 新功能 - `fix/description` - Bug 修复 - `chore/description` - 维护任务 ### DevContainer 可用的预配置开发环境： ``` # 在 VSCode 中打开 code . # 提示："Reopen in Container" → 接受 # 或手动 docker-compose -f .devcontainer/docker-compose.yml up ``` **包含**：Terraform、TFLint、Checkov、AWS CLI、pre-commit 有关完整的开发指南，请参阅 [CLAUDE.md](CLAUDE.md)。 ## 进阶主题 ### 故障排除 #### Web App 无响应 **症状**：ALB 返回 502/503 错误，健康检查失败 **诊断**： ``` # 检查实例状态 INSTANCE_ID=$(terraform output -json instance_ids | jq -r '.[0]') aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \ --query 'Reservations[0].Instances[0].State.Name' # 通过 SSM 检查服务状态 aws ssm send-command \ --instance-ids ${INSTANCE_ID} \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["systemctl status webapp"]' # 查看应用日志 aws ssm send-command \ --instance-ids ${INSTANCE_ID} \ --document-name "AWS-RunShellScript" \ --parameters 'commands=["journalctl -u webapp -n 50"]' ``` **解决方案**： 1. 重启 webapp：`systemctl restart webapp` 2. 检查 UserData 日志：`/var/log/user-data.log` 3. 验证 boto3 已安装：`python3 -c "import boto3"` 4. 如有需要重启实例 #### 警报未触发 **症状**：事件已触发但警报保持 OK **诊断**： ``` # 检查警报配置 ALARM_NAME=$(terraform output -json cloudwatch_alarm_names | jq -r '.unhealthy_hosts') aws cloudwatch describe-alarms --alarm-names ${ALARM_NAME} # 检查实际指标值 aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name UnHealthyHostCount \ --dimensions Name=TargetGroup,Value=$(terraform output -raw alb_arn | cut -d: -f6) \ --start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 60 \ --statistics Maximum # 验证目标健康状态 aws elbv2 describe-target-health \ --target-group-arn $(aws elbv2 describe-target-groups \ --load-balancer-arn $(terraform output -raw alb_arn) \ --query 'TargetGroups[0].TargetGroupArn' --output text) ``` **解决方案**： 1. 等待完整的评估周期（10-60 秒） 2. 验证实例在 ALB 中确实处于不健康状态 3. 检查警报维度是否匹配资源 4. 确认 `enable_monitoring = true` #### 自动关闭问题 **症状**：实例在 2 小时后未停止 **诊断**： ``` # 检查 Lambda 函数 LAMBDA_ARN=$(terraform output -raw auto_shutdown_lambda_arn) aws lambda get-function --function-name ${LAMBDA_ARN} # 检查 EventBridge 规则 aws events list-rules --name-prefix "dev-devops-demo-auto-shutdown" # 查看 Lambda 日志 aws logs tail /aws/lambda/dev-devops-demo-auto-shutdown --follow ``` **解决方案**： 1. 手动调用 Lambda 进行测试 2. 检查 Lambda 是否具有 ec2:StopInstances 的 IAM 权限 3. 验证 EventBridge 规则已启用 4. 检查 Lambda 环境变量中的实例 ID ### 多区域部署使用 Terraform 工作区部署到多个区域： ``` # 为 us-west-2 创建工作区 terraform workspace new us-west-2 # 更新 provider 区域 export TF_VAR_region=us-west-2 # 部署 terraform apply -var-file=environments/dev/dev.tfvars # 切换回 us-east-1 terraform workspace select default ``` 或为每个区域使用单独的状态文件： ``` # 部署到 us-east-1 terraform apply -var-file=environments/dev/dev.tfvars -var="region=us-east-1" # 使用不同状态部署到 us-west-2 terraform apply -var-file=environments/dev/dev.tfvars -var="region=us-west-2" \ -state=terraform-us-west-2.tfstate ``` ### 自定义事件类型使用自定义事件扩展 Web 应用程序： **文件**：`templates/userdata.sh.tpl`（第 545+ 行） **示例**：添加数据库连接失败模拟 ``` elif self.path == '/simulate/db-timeout': health_status["healthy"] = False health_status["reason"] = "Database connection timeout" health_status["last_updated"] = datetime.now().isoformat() schedule_auto_recovery(300) publish_health_metric() publish_incident_metric('db-timeout') self.send_json_response(200, { "message": "Simulating database timeout (auto-recovery in 5 minutes)" }) ``` 重新部署： ``` terraform taint aws_instance.web[0] # Force UserData re-run terraform apply -var-file=environments/dev/dev.tfvars ``` ### 与外部监控集成将 CloudWatch 警报发送到外部系统： **SNS 主题**（添加到 `monitoring.tf`）： ``` resource "aws_sns_topic" "alarms" { name = "${local.name_prefix}-alarm-notifications" } resource "aws_sns_topic_subscription" "email" { topic_arn = aws_sns_topic.alarms.arn protocol = "email" endpoint = "your-email@example.com" } # 添加到警报 resource "aws_cloudwatch_metric_alarm" "unhealthy_hosts" { # ... existing config ... alarm_actions = [aws_sns_topic.alarms.arn] } ``` **Webhook 集成**： ``` resource "aws_sns_topic_subscription" "webhook" { topic_arn = aws_sns_topic.alarms.arn protocol = "https" endpoint = "https://your-agent-endpoint.com/alarms" } ``` ## 文档索引 | 文档 | 目的 | |----------|---------| | **README.md**（本文件） | 主要文档、快速入门、架构 | | **[CLAUDE.md](CLAUDE.md)** | 开发指南、分支策略、CI/CD | | **[BEST-PRACTICES.md](BEST-PRACTICES.md)** | Terraform 最佳实践、安全性、性能 | | **[.github/workflows/README-incidents.md](.github/workflows/README-incidents.md)** | 全面的事件工作流文档 | | **[CHANGELOG.md](CHANGELOG.md)** | 版本历史和发行说明 | | **[LICENSE](LICENSE)** | MIT 许可证条款 | ### 外部资源 - **Terraform AWS Provider**: https://registry.terraform.io/providers/hashicorp/aws/latest/docs - **CloudWatch Alarms**: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html - **ALB Health Checks**: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html - **AWS SSM Session Manager**: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html - **Semantic Commits**: https://www.conventionalcommits.org/ ## 技术参考自动生成的 Terraform 文档（由 `terraform-docs` pre-commit 钩子更新）。 ## 需求 | 名称 | 版本 | |------|---------| | [terraform](#requirement\_terraform) | >= 1.0 | | [archive](#requirement\_archive) | ~> 2.7 | | [aws](#requirement\_aws) | ~> 6.28 | ## 提供商 | 名称 | 版本 | |------|---------| | [archive](#provider\_archive) | 2.7.1 | | [aws](#provider\_aws) | 6..0 | ## 模块无模块。 ## 资源 | 名称 | 类型 | |------|------| | [aws_cloudwatch_event_rule.auto_shutdown](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_rule) | 资源 | | [aws_cloudwatch_event_target.auto_shutdown](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_event_target) | 资源 | | [aws_cloudwatch_log_group.auto_shutdown](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_log_group) | 资源 | | [aws_cloudwatch_metric_alarm.high_response_time](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_metric_alarm) | 资源 | | [aws_cloudwatch_metric_alarm.http_5xx_errors](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_metric_alarm) | 资源 | | [aws_cloudwatch_metric_alarm.unhealthy_hosts](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/cloudwatch_metric_alarm) | 资源 | | [aws_iam_instance_profile.ec2](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_instance_profile) | 资源 | | [aws_iam_role.ec2_instance](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | 资源 | | [aws_iam_role.lambda_auto_shutdown](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role) | 资源 | | [aws_iam_role_policy.lambda_ec2_stop](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy) | 资源 | | [aws_iam_role_policy_attachment.ec2_cloudwatch](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | 资源 | | [aws_iam_role_policy_attachment.ec2_ssm](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | 资源 | | [aws_iam_role_policy_attachment.lambda_basic](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/iam_role_policy_attachment) | 资源 | | [aws_instance.web](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance) | 资源 | | [aws_internet_gateway.main](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/internet_gateway) | 资源 | | [aws_lambda_function.auto_shutdown](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function) | 资源 | | [aws_lambda_permission.auto_shutdown](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_permission) | 资源 | | [aws_lb.main](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lb) | 资源 | | [aws_lb_listener.http](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lb_listener) | 资源 | | [aws_lb_target_group.main](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lb_target_group) | 资源 | | [aws_lb_target_group_attachment.web](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lb_target_group_attachment) | 资源 | | [aws_route.public_internet](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route) | 资源 | | [aws_route_table.public](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route_table) | 资源 | | [aws_route_table_association.public](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/route_table_association) | 资源 | | [aws_security_group.alb](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | 资源 | | [aws_security_group.ec2](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/security_group) | 资源 | | [aws_subnet.public](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/subnet) | 资源 | | [aws_vpc.main](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc) | 资源 | | [aws_vpc_security_group_egress_rule.alb_to_ec2](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_egress_rule) | 资源 | | [aws_vpc_security_group_egress_rule.ec2_all](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_egress_rule) | 资源 | | [aws_vpc_security_group_ingress_rule.alb_http](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_ingress_rule) | 资源 | | [aws_vpc_security_group_ingress_rule.ec2_http_from_alb](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_ingress_rule) | 资源 | | [aws_vpc_security_group_ingress_rule.ec2_ssh](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_ingress_rule) | 资源 | | [archive_file.auto_shutdown](https://registry.terraform.io/providers/hashicorp/archive/latest/docs/data-sources/file) | 数据源 | | [aws_ami.amazon_linux_2023](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/ami) | 数据源 | | [aws_availability_zones.available](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/availability_zones) | 数据源 | ## 输入 | 名称 | 描述 | 类型 | 默认值 | 必需 | |------|-------------|------|---------|:--------:| | [enable\_auto\_shutdown](#input\_enable\_auto\_shutdown) | 在 2 小时后启用实例自动关闭（演示成本节省） | `bool` | `true` | 否 | | [enable\_monitoring](#input\_enable\_monitoring) | 为 ALB 和目标启用 CloudWatch 警报 | `bool` | `true` | 否 | | [enable\_ssh\_access](#input\_enable\_ssh\_access) | 启用到 EC2 实例的 SSH 访问（需要 ssh\_allowed\_cidrs） | `bool` | `false` | 否 | | [env](#input\_env) | 部署资源的环境 | `string` | `"dev"` | 否 | | [instance\_count](#input\_instance\_count) | 要创建的 EC2 实例数量 | `number` | `2` | 否 | | [instance\_type](#input\_instance\_type) | 演示 Web 服务器的 EC2 实例类型 | `string` | `"t3.micro"` | 否 | | [key\_pair\_name](#input\_key\_pair\_name) | 用于 SSH 访问的 EC2 密钥对名称（可选，留空以禁用 SSH 密钥） | `string` | `""` | 否 | | [prefix](#input\_prefix) | 所有资源名称的前缀 | `string` | `"dev"` | 否 | | [region](#input\_region) | 部署资源的区域 | `string` | `"us-east-1"` | 否 | | [ssh\_allowed\_cidrs](#input\_ssh\_allowed\_cidrs) | 允许 SSH 到 EC2 实例的 CIDR 块列表 | `list(string)` | `[]` | 否 | | [vpc\_cidr](#input\_vpc\_cidr) | VPC 的 CIDR 块 | `string` | `"10.0.0.0/16"` | 否 | ## 输出 | 名称 | 描述 | |------|-------------| | [alb\_arn](#output\_alb\_arn) | Application Load Balancer 的 ARN | | [alb\_dns\_name](#output\_alb\_dns\_name) | Application Load Balancer 的 DNS 名称 | | [alb\_url](#output\_alb\_url) | 访问应用程序的 URL | | [auto\_shutdown\_enabled](#output\_auto\_shutdown\_enabled) | 是否启用自动关闭 | | [auto\_shutdown\_lambda\_arn](#output\_auto\_shutdown\_lambda\_arn) | 自动关闭 Lambda 函数的 ARN（如果启用） | | [cloudwatch\_alarm\_names](#output\_cloudwatch\_alarm\_names) | 用于事件测试的警报类型到警报名称的映射 | | [cloudwatch\_alarms](#output\_cloudwatch\_alarms) | CloudWatch 警报的名称（如果启用监控） | | [environment](#output\_environment) | 环境名称 | | [environment\_tag](#output\_environment\_tag) | 在 DevOps Agent Space 中用于资源发现的标签值 | | [health\_check\_url](#output\_health\_check\_url) | 检查健康端点的 URL | | [instance\_ids](#output\_instance\_ids) | EC2 实例的 ID | | [instance\_private\_ips](#output\_instance\_private\_ips) | EC2 实例的私有 IP 地址 | | [public\_subnet\_ids](#output\_public\_subnet\_ids) | 公有子网的 ID | | [region](#output\_region) | 部署资源的 AWS 区域 | | [resource\_prefix](#output\_resource\_prefix) | 用于资源命名的前缀 | | [restore\_health\_command](#output\_restore\_health\_command) | 恢复健康状态的命令 | | [trigger\_failure\_command](#output\_trigger\_failure\_command) | 触发健康检查失败的命令（通过 SSM 或 SSH 运行） | | [vpc\_id](#output\_vpc\_id) | VPC 的 ID |

标签：AIOps, AWS, CloudWatch, DPI, ECS, GitHub Actions, Lambda, LNA, Python, Terraform, 基础设施, 故障模拟, 无后门, 测试平台, 混沌工程, 漏洞利用检测, 能力评估, 自动化运维, 自动笔记, 蓝队演练, 足迹分析, 运维监控, 逆向工具