Likhith-BlueLotus/cloudops-intelligence

GitHub: Likhith-BlueLotus/cloudops-intelligence

这是一个面向LLM代理的云运维与安全运营训练环境，模拟真实云故障与安全事件，支持智能体进行日志调查、根因定位及修复验证。

Stars: 0 | Forks: 0

title: CloudOps Intelligence Environment emoji: ☁️ colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 license: bsd-3-clause short_description: CloudOps + SOC Analyst env for LLM agents tags: - reinforcement-learning - openenv - cloudops - finops - cloud-security - sre - devops - secops - soc-analyst - incident-response - terraform - threat-intel - aws # CloudOps Intelligence 环境 [![OpenEnv Spec Compliant](https://img.shields.io/badge/OpenEnv-≥0.2.2-blue)](https://github.com/openenv/openenv) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-green.svg)](https://python.org) [![Tests](https://img.shields.io/badge/tests-40%20passing-brightgreen)](#testing) ## 为什么选择此环境？现实世界的云运维团队**每天**都要应对两类截然不同的问题： | Domain | Challenge | Real impact | This environment | | ----------------------- | --------------------------- | ------------------------------------- | ------------------------------------ | | **CloudOps / FinOps** | Idle cloud resources | $28B/year waste (Flexera 2024) | Zombie EC2 fleet burning $12k/month | | **CloudOps / Security** | Cloud misconfiguration | 82% of breaches (IBM X-Force) | S3 public exposure + IAM typo | | **CloudOps / DDoS** | Live attack + runaway cost | $50k/hr average DDoS impact | DDoS + auto-scaling at $51k/hr | | **SOC / Alert triage** | Account compromise | 80% of attacks use stolen credentials | Brute-force SSH → active session | | **SOC / Malware** | C2 beacon + credential dump | QakBot pre-cursor to ransomware | Feodo C2 + LSASS dump (8 accounts) | | **SOC / APT** | Multi-stage threat | Avg. 207 days to detect (IBM) | C2 + lateral movement + 2.3 GB exfil | 现有的基准测试（SWE-bench、WebArena、OSWorld）测试的是编码或 Web 导航能力。 **目前没有任何基准测试能够测试云运维 + SOC Agent**，而此类 Agent 必须跨计费数据、IAM 策略、VPC 流日志、SIEM 告警和威胁情报进行综合推理。 ## 任务 ### CloudOps 赛道 ### 简单 — FinOps：Zombie EC2 成本异常 **Scenario**：月度 AWS 账单激增 340%（$12,400 对比 $2,800 基线）。一个已取消项目（"ProjectPhoenix"）中的三台 `m5.2xlarge` 实例已连续运行 32 天，CPU 使用率为 0%，每月消耗 $885。 **Investigation path**： ``` view_billing(ec2, month) → See $9,600 EC2 spike list_resources(ec2) / run_cli( → Find 3 zombie instances aws ec2 describe-instances) tagged Project=ProjectPhoenix, Status=cancelled apply_fix(ec2_fleet, terminate, → Terminate all three config_key=zombie) verify(ec2_fleet) → Confirm fleet healthy ``` **Root cause**：`zombie_ec2_cost_overrun` | **Services**：2 | **Budget**：15 steps ### 中等 — Security + SRE：S3 曝露 + IAM 拼写错误 **Scenario**：一次错误部署（v4.2.0）引发了两个并发问题： 1. S3 存储桶 `prod-customer-data` 的 ACL 被设置为 `public-read-write` —— 客户 PII 暴露在公共互联网上；GDPR 违规窗口开启 3 小时。 2. 支付服务 IAM 角色策略存在拼写错误（`s3:GetObejct` 代替了 `s3:GetObject`） —— AWS 会静默忽略无效操作，导致所有支付证书加载失败并报 403；**89% 的结账错误率**。 **Investigation path**： ``` view_logs(payment_service) → See S3 403 errors run_cli(aws s3api get-bucket-acl → Find public-read-write ACL --bucket prod-customer-data) run_cli(aws iam get-role-policy → Find typo: s3:GetObejct --role-name payment-service-role) apply_fix(s3, block_public_access) → Block all public access apply_fix(iam_role, update_policy, → Fix typo to s3:GetObject config_key=s3:GetObject) verify(payment_service) → Confirm checkout restored ``` **Root causes**：`s3_public_access_enabled`、`iam_role_typo` | **Services**：5 | **Budget**：25 steps ### 困难 — DDoS + FinOps + SRE：实时攻击 + 成本失控 + 级联故障 **Scenario**：来自三个 CIDR 范围的协同容量型 DDoS 攻击以 840,000 req/min （700 倍基线）的流量淹没 API 网关。Auto-scaling 通过启动 200 台额外的 EC2 实例进行响应 —— 当前成本为 **$51,200/hr**，且设置了 `max_capacity=500`（无上限）。攻击引发了级联故障：`order_service` 崩溃， `inventory_service` 性能下降。 **Three root causes**： 1. **No WAF Web ACL** —— 必须编写并部署 Terraform： ``` resource "aws_wafv2_ip_set" "block_ips" { ip_address_version = "IPV4" addresses = ["203.0.113.0/24", "198.51.100.0/24", "192.0.2.0/24"] } resource "aws_wafv2_web_acl" "main" { rule { action { block {} } } } ``` 2. **Auto-scaling `max_capacity=500`** 且未配置 DDoS 防护 —— 必须限制上限并缩减实例。 3. **No API Gateway rate limiting** —— 必须启用限流。 **Investigation path**： ``` view_logs(api_gateway) → See 840k req/min flood run_cli(aws vpc get-flow-logs) → Find attack CIDRs: 203.0.113.0/24, 198.51.100.0/24, 192.0.2.0/24 run_cli(aws wafv2 list-web-acls) → Confirm no WAF exists view_billing(ec2, realtime) → See $51,200/hr from auto-scaling write_terraform(aws_wafv2_web_acl, → Deploy WAF blocking rule config=) apply_fix(auto_scaling, → Cap max_capacity + terminate excess adjust_config, max_capacity=20) apply_fix(api_gateway, → Enable rate limiting enable_rate_limiting, throttle) verify(api_gateway) → Confirm attack mitigated ``` **Root causes**：`waf_not_configured`、`autoscaling_unbounded`、`api_gateway_no_rate_limit` **Services**：6 | **Budget**：40 steps ### SOC Analyst 赛道 ### SOC 简单 — 暴力破解 SSH → 账户被入侵 **Scenario**：SIEM 告警 SOC-2847 —— 来自 Tor 出口节点 `185.220.101.45`（Spamhaus DROP 列表）的 247 次失败 SSH 登录尝试，1 次**成功** 以 `svc_deploy` 身份登录。攻击者正在运行 `sudo` 命令并试图下载植入程序。 **Investigation path**： ``` lookup_threat_intel(185.220.101.45) → Confirm: Tor exit node, abuse score 97/100 view_logs(bastion_host) → Find active session + attacker commands apply_fix(bastion_host, → Revoke attacker session immediately revoke_session, session_token) verify(bastion_host) → Confirm clean ``` **Root cause**：`compromised_bastion_access` | **Services**：2 | **Budget**：15 steps ### SOC 中等 — QakBot C2 + LSASS 凭证转储 **Scenario**：SIEM 告警 SOC-3991 —— 三条关联规则： 1. 从 `ENG-WORKSTATION-47` 到 `162.243.103.246:8080` 的 QakBot C2 信标（Feodo Tracker，在线） 2. LSASS 内存访问 —— 转储了 8 个账户 NTLM 哈希（MITRE T1003.001） 3. 针对整个 `10.0.2.0/24` 网段的 SMB 横向移动探测 **Investigation path**： ``` lookup_threat_intel(162.243.103.246) → Confirm: QakBot C2, Feodo Tracker view_logs(endpoint_security) → Find infected host + C2 connection apply_fix(endpoint_security, → Isolate ENG-WORKSTATION-47 isolate_host, ENG-WORKSTATION-47) apply_fix(auth_service, → Rotate all 8 compromised credentials revoke_credentials, compromised_accounts) verify(endpoint_security) → Confirm C2 severed ``` **Root causes**：`malware_c2_beacon`、`credential_dump` | **Services**：4 | **Budget**：25 steps ### SOC 困难 — APT：C2 + 横向移动 + S3 数据外泄 **Scenario**：SIEM 告警 SOC-4128 —— 五条关联的 GuardDuty/IDS 发现： 1. 来自 `PROD-SRV-12` → `50.16.16.211:443` 的活跃 QakBot C2（在线，Feodo Tracker，6 小时以上信标） 2. 从 `PROD-SRV-12` → `PROD-SRV-07`、`PROD-SRV-09`、`DB-PRIMARY` 的 WMI/SMB 横向移动（MITRE T1021） 3. `DataScienceRole` 凭证窃取 + 1,847 次 S3 GetObject 调用 = 2.3 GB 数据外泄（MITRE T1530） 4. 源自 C2 IP 的 API 调用（MITRE T1078 —— Valid Accounts） 5. GuardDuty 发现：`UnauthorizedAccess:IAMUser/TorIPCaller` **Investigation path**： ``` view_logs(endpoint_security) → EDR: PROD-SRV-12 → 50.16.16.211:443 beacon, WMI lateral movement to PROD-SRV-07/09/DB-PRIMARY apply_fix(endpoint_security, → Isolate PROD-SRV-12 (primary C2 host) isolate_host, PROD-SRV-12) write_terraform(aws_network_acl, → Block C2 IP 50.16.16.211 at network ACL cidr=50.16.16.211/32, rule=DENY) view_logs(auth_service) → DataScienceRole session from C2 IP apply_fix(s3_data_lake, → Revoke stolen DataScienceRole IAM session revoke_session, DataScienceRole) verify(s3_data_lake) → Confirm C2 severed + exfiltration stopped ``` **Root causes**：`active_c2_beacon`、`lateral_movement`、`s3_data_exfiltration` **Services**：5 | **Budget**：40 steps ## Action Space 所有操作均为基于文本的 JSON 对象 —— 无空间网格，无物理引擎。 | Action | Description | Example | | --------------------- | ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------- | | `view_logs` | Service log output | `{"action_type": "view_logs", "target": "bastion_host"}` | | `view_metrics` | Time-series data | `{"action_type": "view_metrics", "target": "api_gateway", "parameters": {"metric": "request_rate"}}` | | `list_resources` | AWS resource inventory | `{"action_type": "list_resources", "parameters": {"type": "ec2"}}` | | `run_cli` | AWS CLI / system command | `{"action_type": "run_cli", "parameters": {"command": "aws guardduty list-findings"}}` | | `view_billing` | Cost and usage reports | `{"action_type": "view_billing", "target": "ec2", "parameters": {"period": "month"}}` | | `lookup_threat_intel` | Query Feodo/Spamhaus/AbuseIPDB feeds | `{"action_type": "lookup_threat_intel", "parameters": {"ioc": "50.16.16.211", "ioc_type": "ip"}}` | | `apply_fix` | Apply remediation | `{"action_type": "apply_fix", "target": "endpoint_security", "parameters": {"fix_type": "isolate_host", "config_key": "ENG-WORKSTATION-47"}}` | | `write_terraform` | Generate + validate Terraform | `{"action_type": "write_terraform", "parameters": {"resource_type": "aws_network_acl", "config": "cidr=50.16.16.211/32 rule=DENY"}}` | | `verify` | Health / security check | `{"action_type": "verify", "target": "network_ids"}` | | `escalate` | Hand off (partial credit) | `{"action_type": "escalate"}` | **CloudOps `fix_type` options**：`terminate`、`block_public_access`、`fix_iam`、`adjust_config`、`enable_rate_limiting` **SOC `fix_type` options**：`revoke_session`、`block_ip`、`isolate_host`、`quarantine`、`revoke_credentials`、`revoke_access` ## Observation Space ``` class IncidentObservation(Observation): situation_report: str # Current step/task status summary services: List[ServiceHealth]# Per-service health snapshot action_output: str # Result of the last action (logs, CLI output, etc.) available_actions: List[str] # Valid action types services_healthy: int # Count of healthy services services_total: int # Total services in episode root_causes_found: int # Root causes identified so far root_causes_total: int # Total root causes in scenario reward: float # Step reward ∈ [0.0, 1.0] done: bool # Episode complete flag ``` ## Reward Function ``` R_step = +0.08 investigation discovery (view_logs / run_cli / view_billing / lookup_threat_intel that reveals a new root cause clue for the first time) + 0.30 for each new root cause correctly identified via apply_fix / write_terraform + 0.30 for each correct fix applied + 0.10 for each service verified healthy after fix + 0.20 episode completion bonus (all root causes resolved + all services healthy) - 0.05 wrong-target fix penalty - 0.02 redundant repeated query penalty All step rewards are clipped to [0.0, 1.0] in the observation. Penalties accumulate in cumulative_reward only (for grading). ``` **Investigation-first design**：Root cause evidence only appears in the observation *after* the agent investigates the relevant service. The `+0.08` clue-discovery reward incentivises proper diagnostic investigation before applying fixes. **Grading formula**（所有任务通用）： ``` score = 0.35 × (root_causes_found / total_root_causes) + 0.25 × (services_healthy / total_services) + 0.20 × normalised_cumulative_reward + 0.20 × completion_bonus ``` ## Baseline Scores (gpt-4o-mini) | Task | Domain | Root causes | Score | Steps used | Step budget | Success | | ---------- | --------------------- | ----------- | ---------- | ---------- | ----------- | ------- | | easy | FinOps | 1 | **0.8440** | 4 | 15 | ✅ | | medium | Security+SRE | 2 | **0.8740** | 4 | 25 | ✅ | | hard | DDoS+FinOps+SRE | 3 | **0.8320** | 13 | 40 | ✅ | | soc_easy | SecOps (brute-force) | 1 | **0.8587** | 3 | 15 | ✅ | | soc_medium | SecOps (C2+cred dump) | 2 | **0.8740** | 4 | 25 | ✅ | | soc_hard | SecOps (APT) | 3 | **0.8693** | 6 | 40 | ✅ | **Primary mean (easy/medium/hard): 0.8500** | **Overall mean (all 6 tasks): 0.8587** *(gpt-4o-mini, single episode per task, investigation-first flow)* **All 6 tasks complete successfully** —— scores are meaningfully differentiated by task difficulty, step efficiency, and the number of root causes an agent must discover and remediate. **Investigation-first design**：Root cause evidence only appears in the observation *after* the agent has investigated the relevant service. The `+0.08` clue-discovery reward incentivises genuine diagnostic reasoning — an agent that skips investigation and blindly applies fixes will fail to identify root causes and score near 0. The scoring formula creates real headroom for stronger agents (GPT-4o, Claude 3.5, Llama-3-70B): a model that minimises wasted investigation steps can approach 0.95+ on easy/medium tasks. ## Quick Start ### Local (Python) ``` git clone https://github.com/Likhith-BlueLotus/cloudops-intelligence cd cloudops-intelligence pip install -r requirements.txt # (Optional) Pre-fetch real-world datasets into data/ python data_fetcher.py # Start the environment server uvicorn server.app:app --host 0.0.0.0 --port 7860 # In another terminal, set credentials and run the baseline agent # (runs all 3 tasks — easy, medium, hard — sequentially) export HF_TOKEN=sk-... # your OpenAI API key (or HF token) export API_BASE_URL=https://api.openai.com/v1 export MODEL_NAME=gpt-4o-mini python inference.py ``` ### Docker ``` docker build -t cloudops-env . docker run -p 7860:7860 \ -e API_BASE_URL=https://api.openai.com/v1 \ -e MODEL_NAME=gpt-4o-mini \ -e HF_TOKEN=sk-... \ cloudops-env ``` ### HF Spaces (live) ``` https://le0atis-cloudops-intelligence.hf.space ``` ## API Reference ### Session lifecycle The environment is **stateful and session-based**. Each episode has a `session_id` UUID that must be threaded through `/step` and `/state` calls. ``` POST /reset {"task": "easy|medium|hard", "seed": 42} → {"session_id": "", "observation": {...}} POST /step {"action": {...}, "session_id": ""} → {"observation": {...}} GET /state?session_id= → current IncidentState JSON ``` Sessions expire after **5 minutes of inactivity**. Call `/reset` to start a new one. ### Endpoint summary | Endpoint | Method | Description | | --------------- | ------ | ----------------------------------------------- | | `/health` | GET | Environment health + uptime | | `/metadata` | GET | Environment metadata (name, version, tasks) | | `/schema` | GET | Action/observation JSON schemas | | `/tasks` | GET | All task definitions with metadata | | `/reset` | POST | Start new episode → returns `session_id` | | `/step` | POST | Take an action (requires `session_id`) | | `/state` | GET | Current episode state (requires `?session_id=`) | | `/grade/{task}` | POST | Programmatic grader (episode stats → score) | ## Project Structure ``` cloudops-intelligence/ ├── models.py # Pydantic types (Action, Observation, State) ├── client.py # Async HTTP client wrapper ├── inference.py # GPT-4o-mini baseline agent (runs all 3 tasks) ├── data_fetcher.py # Downloads real-world datasets into data/ ├── openenv.yaml # OpenEnv manifest ├── requirements.txt ├── Dockerfile # Fetches real data at build time ├── .env.example # Safe credential template ├── data/ # Auto-generated: Spamhaus, CIC-IDS2018, etc. ├── server/ │ ├── app.py # FastAPI routes + grader │ └── environment.py # Scenario engine + action handlers └── tests/ ├── conftest.py ├── test_models.py # Pydantic model tests ├── test_environment.py # Environment logic tests ├── test_api.py # FastAPI endpoint tests └── test_client.py # Client smoke tests ``` ## OpenEnv Compliance Checklist - `models.py`：`Action`、`Observation`、`State` 继承自 OpenEnv 基类 - `client.py`：`reset()`、`step()`、`state` 异步接口 - `server/environment.py`：`reset()`、`step()`、`state` 属性 - `openenv.yaml`：spec_version、name、version、description、tasks - 奖励在观察中归一化至 `[0.0, 1.0]` - 位于 `/grade/{task}` 的程序化评分器 - `≥ 3 tasks`，包含简单 → 中等 → 困难的进阶 - Docker + HF Spaces 部署 - 103 个自动化测试 ## Citation ``` @misc{cloudops-intelligence-2026, title = {CloudOps Intelligence: A Multi-Domain Cloud Operations Environment for LLM Agents}, author = {le0atis}, year = {2026}, howpublished = {Hugging Face Spaces}, url = {https://huggingface.co/spaces/Le0AtiS/cloudops-intelligence}, note = {OpenEnv-compatible. Combines FinOps, Security, and SRE incident response in a single text-based environment.} } ```

标签：AIOps, APT检测, AWS, CloudOps, DAST, DDoS防御, DNS解析, Docker, DPI, ECS, FinOps, Hackathon, IAM权限管理, IP 地址批量处理, LLM Agent, Meta OpenEnv, NIDS, PB级数据处理, PE 加载器, Python, PyTorch, S3存储安全, SecOps, SRE, Streamlit, Terraform, XXE攻击, 云安全架构, 云成本优化, 云计算, 云运维, 偏差过滤, 大模型智能体, 威胁情报, 子域名变形, 安全运维, 安全运营中心, 安全防御评估, 容器化, 开发者工具, 开源项目, 强化学习, 恶意软件分析, 搜索语句（dork）, 故障排查, 无后门, 智能运维, 根因分析, 模块化设计, 站点可靠性工程, 网络安全审计, 网络映射, 自动化修复, 规则引擎, 访问控制, 请求拦截, 资源监控, 运维, 逆向工具, 靶机