codewithbrandon/cloud-threat-detection

GitHub: codewithbrandon/cloud-threat-detection

一个生产级 Kubernetes 运行时安全平台，整合 Prometheus 指标告警、Loki 日志分析和 Falco eBPF 系统调用监控，提供三层纵深检测能力。

Stars: 0 | Forks: 0

[![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/03/4f42cd9dc2200143.svg)](https://github.com/codewithbrandon/cloud-threat-detection/actions/workflows/ci.yml) [![Python](https://img.shields.io/badge/Python-3.12-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org) [![Kubernetes](https://img.shields.io/badge/Kubernetes-1.29+-326CE5?style=for-the-badge&logo=kubernetes&logoColor=white)](https://kubernetes.io) [![Prometheus](https://img.shields.io/badge/Prometheus-2.51-E6522C?style=for-the-badge&logo=prometheus&logoColor=white)](https://prometheus.io) [![Grafana](https://img.shields.io/badge/Grafana-10.4-F46800?style=for-the-badge&logo=grafana&logoColor=white)](https://grafana.com) [![Falco](https://img.shields.io/badge/Falco-0.38-00AEC7?style=for-the-badge&logo=falco&logoColor=white)](https://falco.org) [![Docker](https://img.shields.io/badge/Docker-Multi--stage-2496ED?style=for-the-badge&logo=docker&logoColor=white)](https://docker.com) [![License](https://img.shields.io/badge/License-MIT-22c55e?style=for-the-badge)](LICENSE)
[![MITRE ATT&CK](https://img.shields.io/badge/MITRE%20ATT%26CK-11%20Tactics%20Covered-ff4444?style=flat-square&logo=target&logoColor=white)](docs/threat-model.md) [![Alert Rules](https://img.shields.io/badge/Alert%20Rules-12%20Prometheus%20%7C%204%20LogQL-orange?style=flat-square&logo=bell&logoColor=white)](monitoring/prometheus/alert-rules.yaml) [![Falco Rules](https://img.shields.io/badge/Falco%20Rules-9%20Custom%20Runtime-blue?style=flat-square&logo=shield&logoColor=white)](monitoring/falco/falco-rules.yaml) [![Security](https://img.shields.io/badge/Security-Non--root%20%7C%20ReadOnly%20FS%20%7C%20Zero--trust%20Network-brightgreen?style=flat-square&logo=lock&logoColor=white)](k8s/network-policy.yaml)

[**快速开始**](#-quick-start) • [**架构**](#-architecture) • [**攻击模拟**](#-simulating-attacks) • [**告警参考**](#-alert-reference) • [**面试要点**](#-interview-talking-points)

## 解决的问题 ``` Most Kubernetes environments have zero runtime visibility. A compromised container can exfiltrate data, pivot laterally, and mine crypto for weeks before anyone notices. ``` | 之前 | 之后 | |--------|-------| | 容器内 Shell 生成 → **静默** | Shell 生成 → Falco 在 **< 1 秒内** 触发 | | 500 次登录失败 → **无人知晓** | 10 次登录失败/2分钟 → Slack 告警 + 手册链接 | | 内存耗尽 → **意外宕机** | 75% 内存阈值 → OOM kill 前预警 | | Pod 崩溃循环 → **用户反馈** | 3 次重启/15分钟 → PagerDuty 页面告警触发 | | "我们该怎么办？" → **临时应对** | SEC-001, SEC-002 手册 → 15 分钟遏制 SLA | ## 架构 ``` ┌─────────────────────────────────────────────────────────────────────────────────┐ │ CLOUD-NATIVE THREAT DETECTION PLATFORM │ │ Kubernetes Namespace: threat-detection │ └─────────────────────────────────────────────────────────────────────────────────┘ ╔═══════════════════════════════════════════════════════════════════════════════╗ ║ ATTACK SIMULATION LAYER ║ ║ ┌──────────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ ║ ║ │ brute_force.py │ │ cpu_spike.py │ │ memory_ex.py │ │ kill_chain │ ║ ║ │ T1110 BruteForce│ │ T1499 DoS │ │ T1499.004 │ │ Full APT sim│ ║ ║ └────────┬─────────┘ └──────┬───────┘ └──────┬───────┘ └──────┬──────┘ ║ ╚═══════════╪════════════════════╪═════════════════╪══════════════════╪════════╝ │ HTTP Requests │ │ │ ╔═══════════▼════════════════════▼═════════════════▼══════════════════▼════════╗ ║ APPLICATION LAYER (Python Flask + Gunicorn) ║ ║ ║ ║ /health /ready /metrics /login /load /memory /exec /probe ║ ║ ║ ║ UID 1001 │ ReadOnlyRootFS │ No SA token │ Drop ALL caps │ Seccomp ║ ║ Prometheus metrics client → Counter, Gauge, Histogram ║ ║ Structured JSON logs → stdout → captured by Promtail ║ ╚══════════════════════════════╤═══════════════════════════════════════════════╝ │ ┌──────────────────┴──────────────────┐ │ │ ╔═══════════▼═══════════════╗ ╔═════════════▼═════════════╗ ║ METRICS PIPELINE ║ ║ LOGGING PIPELINE ║ ║ ║ ║ ║ ║ ┌─────────────────────┐ ║ ║ ┌─────────────────────┐ ║ ║ │ Prometheus │ ║ ║ │ Promtail │ ║ ║ │ Scrapes /metrics │ ║ ║ │ DaemonSet per node │ ║ ║ │ every 15 seconds │ ║ ║ │ Pipeline stages │ ║ ║ │ 15-day retention │ ║ ║ │ Drop probe noise │ ║ ║ └──────────┬──────────┘ ║ ║ └──────────┬──────────┘ ║ ║ │ Evaluates ║ ║ │ Ships to ║ ║ │ 12 rules ║ ║ ┌──────────▼──────────┐ ║ ║ ┌──────────▼──────────┐ ║ ║ │ Loki │ ║ ║ │ Alertmanager │ ║ ║ │ Label-indexed logs │ ║ ║ │ Routing by │◄─╫───────╫──│ 4 LogQL alert rules│ ║ ║ │ severity + team │ ║ ║ │ 30-day retention │ ║ ║ │ Dedup + Inhibition │ ║ ║ └─────────────────────┘ ║ ║ └──────────┬──────────┘ ║ ╚═══════════════════════════╝ ╚═════════════╪═════════════╝ │ ╔═════════════▼═══════════════════════════════════════════════════════════════╗ ║ NOTIFICATION LAYER ║ ║ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────────────┐ ║ ║ │ Slack Webhook │ │ PagerDuty │ │ Email (SMTP) │ ║ ║ │ #sec-incidents │ │ Critical pages │ │ Email (configurable) │ ║ ║ │ #platform-ops │ │ 30min re-alert │ │ HTML template │ ║ ║ └─────────────────┘ └──────────────────┘ └───────────────────────┘ ║ ╚═════════════════════════════════════════════════════════════════════════════╝ ╔═════════════════════════════════════════════════════════════════════════════╗ ║ RUNTIME SECURITY LAYER ── Falco eBPF Syscall Interception ║ ║ ║ ║ Every node │ Every container │ Every syscall ║ ║ ║ ║ exec() → shell_spawned_in_container (T1059.004) CRITICAL ║ ║ connect() → unexpected_outbound_connection (T1071) HIGH ║ ║ open(WRITE) → write_sensitive_file (T1222) CRITICAL ║ ║ setuid() → privilege_escalation_attempt (T1068) CRITICAL ║ ║ open(/proc) → proc_filesystem_access (T1057) CRITICAL ║ ║ ║ ║ Falco → Falcosidekick → Alertmanager + Loki + Slack ║ ╚═════════════════════════════════════════════════════════════════════════════╝ ╔═════════════════════════════════════════════════════════════════════════════╗ ║ ENFORCEMENT LAYER ║ ║ ┌──────────────────────┐ ┌─────────────────────┐ ┌──────────────────┐ ║ ║ │ NetworkPolicy │ │ ResourceQuota │ │ Pod Security │ ║ ║ │ Default deny all │ │ 4 CPU / 4Gi hard │ │ Admission │ ║ ║ │ 8 allowlist rules │ │ 20 pod limit │ │ restricted │ ║ ║ │ Zero-trust east- │ │ No LoadBalancer │ │ profile enforced│ ║ ║ │ west traffic │ │ No NodePort │ │ on namespace │ ║ ║ └──────────────────────┘ └─────────────────────┘ └──────────────────┘ ║ ╚═════════════════════════════════════════════════════════════════════════════╝ ``` ## 检测流程 ``` ATTACK OCCURS │ ▼ ────────────────────────────────────────────────────────────── LAYER 1 │ METRIC SIGNAL ~15-30s ────────────────────────────────────────────────────────────── Flask Prometheus client increments counter failed_logins_total{source_ip="x.x.x.x"} += 1 Prometheus scrapes /metrics every 15s Alert rule evaluates: rate(failed_logins_total[2m]) > 10 PENDING → FIRING after `for:` duration │ ▼ ────────────────────────────────────────────────────────────── LAYER 2 │ LOG SIGNAL ~5-15s ────────────────────────────────────────────────────────────── Structured log emitted: AUTHENTICATION_FAILURE user=admin source_ip=x.x.x.x Promtail pipeline extracts label: event_type=AUTHENTICATION_FAILURE Loki ingests log stream with labels LogQL rule fires: count_over_time > threshold Loki ruler → Alertmanager → second correlated alert │ ▼ ────────────────────────────────────────────────────────────── LAYER 3 │ RUNTIME SIGNAL (exec/file/network) ~< 1s ────────────────────────────────────────────────────────────── eBPF probe intercepts exec() syscall Falco matches rule: shell_spawned_in_container Falcosidekick fans out to Alertmanager + Loki + Slack Correlation: same pod, overlapping time window │ ▼ ────────────────────────────────────────────────────────────── RESPONSE │ SOC ACTION ────────────────────────────────────────────────────────────── Analyst receives Slack alert with direct playbook link Opens Grafana: correlates metrics + logs + Falco events Executes playbook: isolate → preserve evidence → eradicate MTTC target: 15 minutes from first alert ``` ## 仓库结构 ``` cloud-threat-detection/ │ ├── 📦 app/ │ ├── app.py # Flask app — 10 endpoints, Prometheus metrics, attack surfaces │ └── requirements.txt # Pinned Python dependencies │ ├── 🐳 docker/ │ ├── Dockerfile # Multi-stage build, UID 1001, readOnlyRootFS, health checks │ └── .dockerignore │ ├── ☸️ k8s/ │ ├── namespace.yaml # PSA restricted enforcement │ ├── serviceaccount.yaml # No API token mounted │ ├── configmap.yaml │ ├── deployment.yaml # Full securityContext, probes, resource limits │ ├── service.yaml # ClusterIP only (no external exposure) │ ├── network-policy.yaml # Default-deny + 8 allowlist rules │ └── resource-quota.yaml # Namespace CPU/memory/object caps │ ├── 📊 monitoring/ │ ├── prometheus/ │ │ ├── prometheus-config.yaml # Scrape configs, pod discovery, self-monitoring │ │ ├── alert-rules.yaml # 12 production alert rules across 6 groups │ │ └── prometheus-deployment.yaml │ │ │ ├── alertmanager/ │ │ ├── alertmanager-config.yaml # Routing tree, receivers, inhibition rules │ │ └── alertmanager-deployment.yaml # Webhook simulator included │ │ │ ├── loki/ │ │ ├── loki-config.yaml # Loki + 4 LogQL alert rules │ │ └── loki-deployment.yaml # Loki StatefulSet + Promtail DaemonSet │ │ │ ├── falco/ │ │ ├── falco-rules.yaml # 9 custom rules with MITRE ATT&CK mapping │ │ └── falco-deployment.yaml # DaemonSet + Falcosidekick + RBAC │ │ │ └── grafana/ │ └── grafana-deployment.yaml # Datasource provisioning (Prometheus + Loki) │ ├── 💥 attacks/ │ ├── brute_force.py # Sequential + distributed credential stuffing │ ├── cpu_spike.py # CPU exhaustion with alert monitoring │ ├── memory_exhaustion.py # Escalating memory pressure + OOM simulation │ └── suspicious_commands.py # Full kill chain: recon → persistence → C2 │ ├── 📋 docs/ │ ├── incident-playbook-brute-force.md # SEC-001 with forensic queries + containment │ ├── incident-playbook-container-compromise.md # SEC-002 with 15min MTTC target │ └── threat-model.md # STRIDE + MITRE ATT&CK for Containers │ ├── docker-compose.yaml # Local dev stack (no K8s required) ├── Makefile # deploy / attack / port-forward / verify targets └── README.md ``` ## 快速开始 ### 前置条件 | 工具 | 版本 | 用途 | |------|---------|---------| | `kubectl` | 1.28+ | 集群管理 | | `helm` | 3.x | Falco 部署 | | `python3` | 3.10+ | 攻击模拟脚本 | | `docker` | 24+ | 镜像构建 | | CNI | Calico / Cilium | NetworkPolicy 执行 | ### 选项 A — 完整 Kubernetes 部署 ``` # Clone git clone https://github.com/codewithbrandon/cloud-threat-detection.git cd cloud-threat-detection # 使用 make 部署所有内容 make deploy # namespace + monitoring + app + network policies make deploy-falco # Falco via Helm (requires Linux node for eBPF) # 访问仪表板 make port-forward # 验证 Stack 健康状态 make verify ``` ### 选项 B — 本地 Docker Compose（无需 K8s） ``` # 本地启动完整监控 Stack docker-compose up -d # 验证所有容器正在运行 docker-compose ps # 查看应用日志 docker-compose logs -f app ``` ### 访问入口 | 服务 | URL | 凭证 | |---------|-----|-------------| | Grafana | http://localhost:3000 | anonymous viewer | | Prometheus | http://localhost:9090 | none | | Alertmanager | http://localhost:9093 | none | | 应用 | http://localhost:8080 | — | ## 模拟攻击 ### 暴力破解登录 — `T1110` ``` # 顺序：单 IP 快速发射（测试每 IP Prometheus 阈值） python3 attacks/brute_force.py \ --target http://localhost:8080 \ --mode sequential --count 25 --rate 5 # 分布式：多 IP（测试全局 Loki LogQL 阈值） python3 attacks/brute_force.py \ --target http://localhost:8080 \ --mode distributed --count 60 --concurrency 4 ``` **触发：** `ExcessiveFailedLogins` → `BruteForceAttackCritical` → `BruteForceInLogs` **验证：** ``` # Prometheus curl -s http://localhost:9090/api/v1/query \ --data 'query=sum(increase(failed_logins_total[2m]))by(source_ip)' # Loki (在 Grafana Explore 中) {app="threat-detection-app"} |= "AUTHENTICATION_FAILURE" | json ``` ### CPU 飙升 — `T1499` ``` python3 attacks/cpu_spike.py \ --target http://localhost:8080 \ --intensity 0.9 --duration 120 ``` **触发：** `HighCPUUsage` (>75% 持续 2 分钟) → `CriticalCPUSpike` (>95% 持续 1 分钟) ### 内存耗尽 — `T1499.004` ``` # 分 4 步逐步升级至 480MB (限制: 512MB) python3 attacks/memory_exhaustion.py \ --target http://localhost:8080 \ --mode escalating --size 480 --steps 4 --hold 30 ``` **触发：** `HighMemoryUsage` → `MemoryExhaustionCritical` → Kubernetes OOM kill → `PodCrashLoopDetected` ### 完整杀伤链 — `T1059 → T1057 → T1222 → T1071` ``` # 运行：recon → network discovery → persistence → C2 beaconing python3 attacks/suspicious_commands.py \ --target http://localhost:8080 \ --scenario kill-chain ``` **触发：** Falco `shell_spawned_in_container` + `unexpected_outbound_connection` + `write_sensitive_file` ``` # 实时查看 Falco 告警 kubectl logs -n threat-detection -l app=falco -f | \ jq '{rule: .rule, priority: .priority, pod: .output_fields."k8s.pod.name"}' ``` ### 运行所有 ``` make attack-all ``` ## 告警参考

告警	触发条件	严重性	通道	响应 SLA
`ExcessiveFailedLogins`	>10 次失败/2分钟/IP	⚠️ WARNING	#security-alerts	5 分钟
`BruteForceAttackCritical`	>50 次失败/1分钟/IP	🔴 CRITICAL	#security-incidents + page	立即
`HighCPUUsage`	CPU >75% 持续 2 分钟	⚠️ WARNING	#platform-alerts	15 分钟
`CriticalCPUSpike`	CPU >95% 持续 1 分钟	🔴 CRITICAL	#platform-oncall + page	5 分钟
`HighMemoryUsage`	内存 >384Mi 持续 2 分钟	⚠️ WARNING	#platform-alerts	15 分钟
`MemoryExhaustionCritical`	内存 >460Mi	🔴 CRITICAL	#platform-oncall + page	5 分钟
`High5xxErrorRate`	5xx >5% 持续 2 分钟	⚠️ WARNING	#platform-alerts	15 分钟
`ServiceUnavailable`	5xx >50% 持续 1 分钟	🔴 CRITICAL	#platform-oncall + page	5 分钟
`PodCrashLoopDetected`	>3 次重启 / 15 分钟	🔴 CRITICAL	#platform-oncall + page	5 分钟
`SuspiciousActivityDetected`	suspicious_activity_total > 0	⚠️ WARNING	#security-alerts	10 分钟
`PrometheusTargetDown`	抓取目标宕机 2 分钟	🔴 CRITICAL	#platform-oncall	5 分钟
`WatchdogHeartbeat`	始终触发 (dead man's switch)	🔵 NONE	外部监控	N/A

## Falco 运行时规则

规则	Syscall 触发	MITRE 技术	优先级
`shell_spawned_in_container`	exec() → sh/bash/zsh	T1059.004	🔴 CRITICAL
`unexpected_outbound_connection`	connect() 连接到非白名单 IP	T1071	🟠 HIGH
`write_sensitive_file`	open(WRITE) 在 /etc, /bin, /usr	T1222	🔴 CRITICAL
`dangerous_binary_in_container`	exec() → wget, curl, nc, nmap	T1105	🟠 HIGH
`container_running_as_root`	spawned_process, UID=0	T1078	🟠 HIGH
`proc_filesystem_access`	open() 在 /proc/1, /proc/kcore	T1057	🔴 CRITICAL
`crypto_miner_detected`	exec() → xmrig, stratum+tcp	T1496	🔴 CRITICAL
`k8s_secret_access_in_container`	open() 在 /var/run/secrets	T1552	🔴 CRITICAL
`privilege_escalation_attempt`	setuid()/setgid() 成功	T1068	🔴 CRITICAL

## 安全控制 ``` Container Layer ✅ Non-root user (UID 1001) ✅ Read-only root filesystem ✅ No privilege escalation ✅ Drop ALL Linux capabilities ✅ Seccomp RuntimeDefault profile ✅ Multi-stage minimal image Pod Layer ✅ No ServiceAccount token mounted ✅ Dedicated ServiceAccount ✅ Pod Security Admission: restricted ✅ Topology spread constraints ✅ Resource limits (CPU + memory) ✅ Liveness + readiness + startup probes Namespace Layer ✅ Default-deny NetworkPolicy ✅ 8 explicit allowlist rules ✅ ResourceQuota (CPU/mem/objects) ✅ LimitRange (per-container defaults) ✅ No LoadBalancer services ✅ No NodePort services Runtime Layer ✅ Falco eBPF syscall monitoring ✅ 9 custom rules (MITRE-mapped) ✅ Falcosidekick fan-out routing ✅ 12 Prometheus alert rules ✅ 4 LogQL log-based alert rules ✅ Dead man's switch heartbeat ✅ Alert deduplication + inhibition ✅ Multi-channel notification ``` ## MITRE ATT&CK 覆盖范围 | 战术 | 覆盖技术 | 检测层 | |--------|-------------------|-----------------| | Initial Access (初始访问) | T1190 Exploit Public-Facing App | Loki + Falco | | Execution (执行) | T1059.004 Unix Shell | Falco (exec syscall) | | Persistence (持久化) | T1222 File Permissions Modification | Falco (open syscall) | | Privilege Escalation (权限提升) | T1068, T1611 Container Escape | Falco (setuid syscall) | | Defense Evasion (防御规避) | T1070 Indicator Removal | Falco (readOnlyFS blocks) | | Credential Access (凭证访问) | T1552 Unsecured Credentials, T1110 Brute Force | Prometheus + Loki + Falco | | Discovery (发现) | T1057 Process Discovery | Falco (exec syscall) | | Lateral Movement (横向移动) | T1210 Exploitation of Remote Services | Falco + NetworkPolicy | | Command & Control (命令与控制) | T1071 Application Layer Protocol | Falco + NetworkPolicy | | Exfiltration (数据渗出) | T1041 Exfiltration Over C2 Channel | Falco + NetworkPolicy | | Impact (影响) | T1496 Resource Hijacking, T1499 Endpoint DoS | Prometheus | ## 事件响应手册 | 手册 | 场景 | 触发条件 | MTTC 目标 | |----------|----------|---------|-------------| | [SEC-001](docs/incident-playbook-brute-force.md) | 暴力破解 / 凭证填充 | `BruteForceAttackCritical` | N/A (告警 + 阻断) | | [SEC-002](docs/incident-playbook-container-compromise.md) | 容器失陷 / 运行时攻击 | `shell_spawned_in_container` | **15 分钟** | 每个手册包括： - 检测信号清单 - 事件时间线模板 - 分诊决策树 - 分步遏制命令 - 取证证据收集（Pod 终止前） - Loki / Prometheus 调查查询 - 根除和恢复程序 - 经验教训模板 ## 面试要点

介绍一下你的检测技术栈

该平台拥有三个独立的检测层。Prometheus 每 15 秒拉取一次指标——我针对 `failed_logins_total` 超过每个 IP 10 次/2分钟的情况发出告警，这能捕获暴力破解。Loki 通过 Promtail 的 pipeline stages 聚合结构化日志，提取 `event_type` 标签——我使用 LogQL `count_over_time` 规则来检测低于单 IP 阈值的分布式攻击。Falco 拦截 eBPF syscall——`exec()`、`connect()` 和 `open()`——这能捕获指标和日志完全遗漏的后渗透行为。这三者都汇聚到 Alertmanager，由它进行分组、去重，并根据严重性路由到 Slack、PagerDuty 和邮件。

你如何处理告警疲劳？

三种机制。首先，Alertmanager 抑制规则会在同一事件的高级别告警已在触发时抑制警告级别告警——`BruteForceAttackCritical` 会抑制 `ExcessiveFailedLogins`。其次，分组将相关告警打包为一个通知，而不是 50 个。第三，Promtail 中的 `drop` pipeline stage 在摄入前过滤掉 `/health`、`/ready` 和 `/metrics` 抓取日志——Loki 告警查询不会因预期流量模式而产生噪音。

如果是生产环境，你会添加什么？

三件事。Istio 或 Cilium 用于 Pod 间的 mTLS——目前 NetworkPolicy 提供网络级隔离，但没有基于身份的加密。使用 Cosign 进行镜像签名，并通过 admission webhook 拒绝未签名镜像——这填补了我 STRIDE 威胁模型中的供应链。第三，将 Kubernetes 审计日志传输到 Loki——目前我检测容器行为，但无法检测 API Server 操作，如异常的 RBAC 更改或 Secret 访问模式。

你如何证明这确实有效？

攻击模拟脚本就是测试套件。`brute_force.py` 触发 `ExcessiveFailedLogins`，我计算从第一次请求到 Slack 通知的时间——SLA 是 2 分钟，实际通常为 35-45 秒。`suspicious_commands.py` 触发所有三个 Falco 规则，我验证每一个都在 10 秒内出现在 Falco Pod 日志和 Alertmanager 中。Dead man's switch `WatchdogHeartbeat` 告警验证整个告警管道是否正常——如果它停止触发，我们就面临比任何单个告警更大的问题。

为什么选 Loki 而不是 Elasticsearch？

Loki 是基于标签索引的，而不是全文索引的。对于安全用例，我确切知道我要搜索什么——特定事件类型、Pod 名称、源 IP。Loki 的结构化标签查询在规模上便宜几个数量级。它与 Prometheus 标签原生集成，无需上下文切换即可在 Grafana 中进行指标到日志的关联。在同等日志量下，存储成本比 Elasticsearch 低约 90%。

为什么要用三层检测而不是一层？

纵深防御。攻陷应用程序并停止写入日志的攻击者仍然会生成 Falco 捕获的 syscall。低于单 IP 指标阈值的攻击仍然会出现在全局 Loki 日志计数中。Falco 规则缺口并不意味着攻击不可见——Prometheus 会捕获指标信号。每一层都有不同的盲点；结合它们意味着攻击者必须同时逃避这三层，这难度呈指数级增加。

## 质量门禁每次推送和 Pull Request 都会运行完整的 CI 流水线： | 检查项 | 工具 | 捕获内容 | |-------|------|-----------------| | YAML lint | `yamllint` | 所有清单中的缩进、尾随空格、类型错误 | | K8s schema 验证 | `kubeconform` | 针对 1.29 schema 的无效 Kubernetes API 字段 | | Python lint | `ruff` | 导入顺序、未使用变量、样式、pyupgrade 建议 | | Python 格式化 | `black` | 一致的代码格式化——漂移即失败 | | 密钥扫描 | `gitleaks` | 意外提交的密钥、Token、密码 | | Container CVE 扫描 | `trivy` (image) | 构建的 Docker 镜像中的 OS + 库 CVE（CRITICAL 级失败） | | 文件系统扫描 | `trivy` (fs) | 源码中的 IaC 配置错误和密钥（信息性） | Trivy 的结果显示在 [GitHub Security tab](https://github.com/codewithbrandon/cloud-threat-detection/security)。 ### 本地运行检查 ``` # 安装工具一次 pip install ruff==0.4.10 black==24.4.2 yamllint==1.35.1 # 安装 kubeconform (Linux/macOS) curl -sSL https://github.com/yannh/kubeconform/releases/download/v0.6.7/kubeconform-linux-amd64.tar.gz \ | tar -xz -C /usr/local/bin kubeconform # 运行所有检查 make lint # yaml + python make validate-k8s # kubeconform schema validation make lint-fix # auto-fix python formatting (writes files) ``` ## 威胁模型完整的 STRIDE 分析及风险登记表 → [docs/threat-model.md](docs/threat-model.md) **最高的残余风险（由设计决策决定）：** - 供应链攻击——通过固定基础镜像摘要缓解；Cosign 签名是下一个控制措施 - 低于单 IP 阈值的分布式暴力破解——通过全局 Loki 规则缓解；WAF 是下一个控制措施 - Falco 规则缺口——通过指标 + 日志并行检测缓解；持续规则测试是流程控制措施

**旨在解决容器运行时可见性的真实缺口。** *没有响应的检测只是昂贵的日志记录。该平台将两者连接起来。*
[![Star this repo](https://img.shields.io/github/stars/codewithbrandon/cloud-threat-detection?style=social)](https://github.com/codewithbrandon/cloud-threat-detection)

标签：Alertmanager, AMSI绕过, BurpSuite集成, Cloudflare, DevSecOps, Docker, Docker镜像, Falco, Grafana, JSONLines, Kubernetes 安全, Loki, MITRE ATT&CK, Python, Web截图, 上游代理, 告警规则, 基础设施监控, 威胁检测, 子域名突变, 安全态势感知, 安全运维 (SecOps), 安全防御评估, 容器安全, 异常检测, 敏感词过滤, 无后门, 模型鲁棒性, 生产环境, 自定义请求头, 请求拦截, 逆向工具, 零信任