0hardik1/agentmoat

GitHub: 0hardik1/agentmoat

agentmoat 是一个 Kubernetes 运维工具，安全且可逆地将容器工作负载从 runc 运行时自动化迁移到 gVisor 沙箱以防御容器逃逸攻击。

Stars: 1 | Forks: 0

# agentmoat [![release](https://img.shields.io/github/v/release/0hardik1/agentmoat?label=release&sort=semver)](https://github.com/0hardik1/agentmoat/releases) [![license](https://img.shields.io/github/license/0hardik1/agentmoat)](LICENSE) [![CI](https://img.shields.io/github/actions/workflow/status/0hardik1/agentmoat/ci.yml?branch=main&label=CI&logo=github)](https://github.com/0hardik1/agentmoat/actions/workflows/ci.yml) [![Go](https://img.shields.io/github/go-mod/go-version/0hardik1/agentmoat?logo=go&label=Go)](go.mod) [![Go Report Card](https://goreportcard.com/badge/github.com/0hardik1/agentmoat)](https://goreportcard.com/report/github.com/0hardik1/agentmoat) **安全且可逆地将 Kubernetes 工作负载从 `runc` 迁移至 gVisor (`runsc`)。** https://github.com/user-attachments/assets/2b04bc86-74a2-4936-a829-4653559eb954 agentmoat 将 Kubernetes 工作负载从默认的 `runc` 运行时迁移至 gVisor (`runsc`)，这是一个用户空间内核，旨在防御容器逃逸链中的内核漏洞利用步骤。它会扫描集群，根据 gVisor 兼容性对每个工作负载进行分类，生成带有稳定哈希值的确定性迁移计划，通过 `RuntimeClass` 应用该计划，并支持按需回滚。具体到 EKS 上，目前没有托管式的切换方案：Bottlerocket 不提供 `runsc`，托管节点组没有 gVisor 开关，且 AWS 官方并不支持 gVisor。agentmoat 正好填补了这一空白。威胁模型和 CVE 背景信息详见 [`docs/threat-model.md`](docs/threat-model.md)。 ## 快速开始在全新克隆的仓库中，针对一个预先安装了真实 gVisor 的本地 kind 集群： ``` git clone https://github.com/0hardik1/agentmoat cd agentmoat make kind-up # builds the gVisor-enabled kind node image, then creates the cluster make e2e # runs scan -> plan -> apply -> rollback end-to-end and asserts the results ``` `make e2e` 会在真实的 `runsc` 运行时上测试完整的流水线，并在 `dmesg` 中探测已修补 pod 的 gVisor 标记。这是查看该工具运行效果的最快方式。针对您自己的集群（已设置 `kubectl` context）： ``` make build # produces ./bin/agentmoat ./bin/agentmoat scan # human-readable table ./bin/agentmoat scan --output json > scan.json # versioned schema ./bin/agentmoat plan --scan scan.json --output json > plan.json ./bin/agentmoat apply --plan plan.json # dry-run by default ./bin/agentmoat apply --plan plan.json --dry-run=false # actually mutate ./bin/agentmoat rollback --plan plan.json --dry-run=false ``` 当没有不兼容的工作负载时 `scan` 会以退出码 0 退出；如果有至少一个工作负载不兼容，则退出码为 2。`apply` 和 `rollback` 是幂等的：对已应用的计划重新运行时，会将每一步报告为 `already-applied` 并以退出码 0 退出。 ## 安装使用 Go 1.26+ 从源码构建： ``` go install github.com/0hardik1/agentmoat/cmd/agentmoat@latest go install github.com/0hardik1/agentmoat/cmd/agentmoat-mcp@latest ``` 一旦发布版本被标记，预编译的二进制文件（linux/darwin, amd64/arm64）和校验和文件将附带在每一个 [GitHub release](https://github.com/0hardik1/agentmoat/releases) 中；随后还会提供 Homebrew formula 和 `kubectl agentmoat` krew 插件。 ## 功能说明 ``` ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │ scan │ ───> │ plan │ ───> │ apply │ ───> │ rollback │ │ ScanReport│ │ Migration│ │ Apply │ │ Rollback │ │ (RO) │ │ Plan │ │ Result │ │ Result │ └──────────┘ └──────────┘ └──────────┘ └────────────┘ cluster pure function strategic-merge reverse, also client-go over a Scan patch, default default dry-run Report dry-run ``` - **`scan`** 是严格只读的。它会枚举选定命名空间下的每一个 Pod、Deployment、StatefulSet、DaemonSet、Job 和 CronJob，根据内置规则对每一个工作负载进行分类，并输出带有版本号的 `ScanReport`。 - **`plan`** 是一个纯函数，将 `ScanReport` 转化为 `MigrationPlan`。输入相同的扫描结果，就会输出相同的计划和相同的 SHA-256 `planHash`。该计划会按风险对步骤进行排序（例如：无状态优先，非 host-network 优先等），并排除任何被分类为 `incompatible` 的工作负载。 - **`apply`** 是唯一会更改集群状态的阶段。它会使用 `runtimeClassName` 和匹配的 `runtime=gvisor:NoSchedule` 容忍度（toleration）来修补每个 pod 模板，为命名空间打上 `agentmoat.io/plan-hash` 标签，针对它实际修改的每个步骤发出一个 Kubernetes Event，并将每个步骤（包括预演）的审计记录追加到 `~/.agentmoat/audit.jsonl` 中。它默认使用 `--dry-run=true`。对已应用的计划重新运行时，会通过命名空间注解将每个步骤报告为 `already-applied`，并以退出码 0 退出。 - **`rollback`** 会以相反顺序遍历相同的计划，移除 `runtimeClassName`，并清除命名空间注解。它会有意保留容忍度（没有匹配污点的容忍度是无害的，而且通过 JSON Patch 索引移除特定容忍度是很脆弱的）。手动替代方案是一系列繁琐的操作：通过 `kubectl get` 进行枚举，对照 gVisor 文档进行人工分类，针对每一个 Deployment/StatefulSet/DaemonSet 执行 `kubectl patch`，以及自行构建单独的审计记录。agentmoat 则包揽了分类、排序、实现幂等性、记录以及撤销这些更改的工作。 ## 兼容性检查跨三个严重级别共有 14 个稳定的规则 ID。规则 ID 是一种公开接口：它们出现在 `--output json`、`--rules` 覆盖配置以及 `docs/compatibility-checklist.md` 表格中。规则的具体实现位于 [`pkg/classifier/builtin_rules.go`](pkg/classifier/builtin_rules.go)。 | Rule ID | Severity | What it inspects | | ------------------- | -------- | ---------------------------------------------------------------------- | | `raw-socket` | error | `CAP_NET_RAW`, or `agentmoat.io/needs-raw-socket=true` | | `host-network` | error | `pod.spec.hostNetwork=true` | | `host-pid` | error | `pod.spec.hostPID=true` | | `host-ipc` | error | `pod.spec.hostIPC=true` | | `privileged` | error | Any container with `securityContext.privileged=true` | | `ebpf` | error | Image hint (cilium, tetragon, falco) or `CAP_BPF` | | `kvm-nested` | error | `hostPath` mount of `/dev/kvm` | | `host-path-mount` | warn | Any `hostPath` volume | | `gpu-passthrough` | warn | `nvidia.com/gpu` resource request or limit | | `fuse-mount` | warn | CSI driver name containing `fuse`, or `AGENTMOAT_USES_FUSE=true` | | `io-uring` | warn | Annotation `agentmoat.io/uses-iouring=true` | | `perf-events` | warn | `CAP_PERFMON` or `CAP_SYS_ADMIN` | | `network-throughput`| info | Image hint: nginx, envoy, haproxy, traefik (expect 20-40% overhead) | | `syscall-heavy` | info | Image hint: redis, memcached (expect higher latency) | 任何触发 `error` 规则的工作负载都会被标记为 `incompatible` 并在计划中排除。触发 `warn` 则会将其标记为 `review`；只有当传入 `--include-review` 参数时，计划器才会包含 `review` 类别的工作负载。`info` 仅供参考，绝不会阻止迁移。无需重新编译，即可通过 `--rules ` 覆盖严重级别（或添加新规则）。详见 [`docs/compatibility-checklist.md`](docs/compatibility-checklist.md)。 ## 输出示例数据结构直接取自 [`internal/schema/types.go`](internal/schema/types.go)；省略的字段标记为 `...`。 `agentmoat scan --output json`（取自 `.spec.workloads` 的一个 `WorkloadResult`）： ``` { "kind": "Deployment", "namespace": "edge", "name": "frontdoor", "compatibility": "review", "reasons": [ { "ruleId": "network-throughput", "severity": "info", "description": "Workload appears network-throughput-bound (image hint: nginx, envoy, haproxy, traefik); expect ~20-40% throughput overhead under gVisor's sandbox network stack.", "remediationUrl": "https://gvisor.dev/docs/architecture_guide/performance/" } ], "recommendation": "Benchmark under gVisor before opting in; consider host-network alternatives if throughput-critical.", "overhead": "Network throughput: 20-40%" } ``` `agentmoat plan --output json`（一个 `PlanStep` 和顶层的 `planHash`）： ``` { "apiVersion": "agentmoat.io/v1alpha1", "kind": "MigrationPlan", "metadata": { "generatedAt": "2026-05-22T17:14:03Z", "planHash": "9f4b1c2d8a3e6f5b7c1e0d2a4f6b8e3c5d7a9f1b2e4d6a8c0f1b3e5d7a9c2e4f" }, "spec": { "summary": {"total": 3, "included": 2, "excluded": 1}, "options": {"runtimeClassName": "gvisor"}, "steps": [ { "order": 1, "target": {"kind": "Deployment", "namespace": "default", "name": "web"}, "action": "set-runtime-class", "runtimeClassName": "gvisor", "addToleration": true, "waitFor": "Ready", "riskScore": 10, "notes": "Stateless Deployment fronted by a Service; safe to roll first." } ], "excluded": [...] } } ``` `agentmoat apply --output json`（一个 `StepResult` 和应用操作的封装信息）： ``` { "apiVersion": "agentmoat.io/v1alpha1", "kind": "ApplyResult", "metadata": { "planHash": "9f4b1c2d8a3e6f5b7c1e0d2a4f6b8e3c5d7a9f1b2e4d6a8c0f1b3e5d7a9c2e4f", "dryRun": false }, "spec": { "summary": {"total": 2, "applied": 2, "alreadyApplied": 0, "skipped": 0, "failed": 0}, "steps": [ { "order": 1, "target": {"kind": "Deployment", "namespace": "default", "name": "web"}, "status": "applied", "patch": "{\"spec\":{\"template\":{\"spec\":{\"runtimeClassName\":\"gvisor\",\"tolerations\":[{\"key\":\"runtime\",\"operator\":\"Equal\",\"value\":\"gvisor\",\"effect\":\"NoSchedule\"}]}}}}" } ] } } ``` ## 运维保障 - **默认只读。** `scan` 和 `plan` 绝不会更改集群状态。在 CI 任务或使用只读 kubeconfig 的生产环境中运行它们是安全的。 - **默认预演。** `apply` 和 `rollback` 默认设置为 `--dry-run=true`：补丁会被计算并展示在 `StepResult.patch` 字段中，但绝不会发送到 API server。若要实际更改状态，必须显式指定 `--dry-run=false`。 - **幂等性。** 每次 `apply` 都会将计划哈希值作为 `agentmoat.io/plan-hash` 写入受影响的命名空间。对同一个集群重新运行相同的计划，会将每个步骤报告为 `already-applied` 并以退出码 0 退出。 - **可审计。** 每一个步骤都会向 `~/.agentmoat/audit.jsonl` 追加一行 JSON 记录（可以使用 `--no-audit` 禁用），包括带有 `dryRun: true` 标记的预演步骤。每一个实际发生更改的步骤也会在被修补的对象上发出一个 Kubernetes Event（可以使用 `--no-events` 禁用）。 - **确定性的退出码。** CI 脚本可以根据它们进行条件分支处理： | Code | Meaning | | ---- | --------------------------------------------------------- | | 0 | Success. Cluster matches requested state. | | 1 | Generic error (kubeconfig, network, malformed plan, etc). | | 2 | `scan` / `explain namespace` / `explain workload`: at least one `incompatible` workload. | | 3 | `apply` or `rollback`: partial outcome; idempotent re-run is safe. | | 4 | `verify`: `runtimeClassName` mismatch and/or in-pod probe did not find gVisor. | 完整表格详见 [`docs/exit-codes.md`](docs/exit-codes.md)。 ## 通过 Packer 适配 EKS [`packer/eks-gvisor-al2023.pkr.hcl`](packer/eks-gvisor-al2023.pkr.hcl) 会构建一个经过 EKS 优化的 AL2023 AMI，预装了 `runsc` 和 containerd v2 shim，预先配置了 containerd 插件，并固定使用 systrap 平台（EKS 实例上无法使用 KVM）。 ``` cd packer packer init . packer validate . packer build . ``` 将生成的 AMI 接入自管理的节点组或 Karpenter 的 `EC2NodeClass`，为节点打上 `runtime=gvisor` 标签，为它们设置 `runtime=gvisor:NoSchedule` 污点，然后应用 `deploy/runtimeclass.yaml`。详细的端到端操作指南（CFN/Terraform 代码片段、IAM、Karpenter 配置）记录在 [`docs/eks-deployment.md`](docs/eks-deployment.md) 中。 ## 本地 kind `make kind-up` 会构建一个自定义的 kind worker 镜像 ([`kind/Dockerfile.gvisor-node`](kind/Dockerfile.gvisor-node))，其中包含 `/usr/local/bin/runsc` 和 containerd v2 shim。集群拓扑 ([`kind/cluster.yaml`](kind/cluster.yaml)) 由一个原生的控制平面和一个标记为 `runtime=gvisor` 的 worker 组成；位于 [`test/e2e/manifests/runtimeclass.yaml`](test/e2e/manifests/runtimeclass.yaml) 中的 `RuntimeClass` 使用了 `handler: gvisor`，因此带有 `runtimeClassName: gvisor` 的 pod 确实会在 `runsc` 下运行。支持通过 `CLUSTER_NAME` 和 `KEEP_CLUSTER=1` 参数进行迭代。 ``` make kind-up # idempotent; rebuilds the image only when missing make e2e # full scan -> plan -> apply -> rollback against the cluster KEEP_CLUSTER=1 make e2e make kind-down ``` ## 速查表 ### 命令 | Command | Purpose | | -------------------- | ------------------------------------------------------------- | | `agentmoat scan` | Enumerate workloads and classify gVisor compatibility. RO. | | `agentmoat plan` | Produce a deterministic `MigrationPlan` from a scan. RO. | | `agentmoat apply` | Patch workloads per the plan. Default dry-run. Idempotent. | | `agentmoat rollback` | Reverse a previously applied plan. Default dry-run. | | `agentmoat verify` | Confirm live pods match the plan's `runtimeClassName`. RO. | | `agentmoat explain` | Embedded docs viewer; `explain namespace` / `explain workload` for deep scans. | | `agentmoat version` | Print binary version and git SHA. | ### 全局参数 | Flag | Default | Purpose | | -------------------------- | ---------------- | ------------------------------------------------------------------------ | | `--output / -o` | `table` | `table`, `json`, or `yaml`. `json`/`yaml` follow `agentmoat.io/v1alpha1`.| | `--kubeconfig` | `$KUBECONFIG` | Path to kubeconfig. | | `--context` | current-context | kubeconfig context to use. | | `--namespace / -n` | (all) | Repeatable. Default: scan every namespace. | | `--all-namespaces / -A` | true if `-n` unset | Explicit all-namespaces flag. | | `--selector / -l` | (none) | Kubernetes label selector applied to every list call. | `--include-system` | `false` | Include `kube-system` and other `kube-*` namespaces. | | `--rules` | (none) | Path to YAML overriding rule severities or adding rules. | | `--explain` | `false` | Inline educational notes in supported output formats. | ### 独立命令参数 | Flag | Command | Default | Purpose | | -------------------------- | ------------ | ----------- | ----------------------------------------------------------------------- | | `--scan` | `plan` | (inline) | Read a stored `ScanReport` from disk instead of scanning. | | `--include-review` | `plan` | `false` | Also include `review`-class workloads in the plan. | | `--runtime-class` | `plan` | `gvisor` | RuntimeClass name to patch onto migrated workloads. | | `--plan` | apply/rollback | required | Path to a `MigrationPlan` JSON/YAML. | | `--dry-run` | apply/rollback | `true` | Compute patches but do not mutate the cluster. | | `--no-events` | apply/rollback | `false` | Do not emit Kubernetes Events per mutation. | | `--no-audit` | apply/rollback | `false` | Do not append to `~/.agentmoat/audit.jsonl`. | | `--in-pod-probe` | verify | `false` | Exec into a running pod and check dmesg/cmdline for gVisor markers. | ### 输出格式 | Format | Use case | | ------ | ---------------------------------------------------------------------------------- | | `table`| Default human-readable view; per-workload row with verdict and top reason. | | `json` | Versioned, stable schema (`agentmoat.io/v1alpha1`). Pipe into `jq` or store as evidence. | | `yaml` | Byte-identical to JSON after canonicalisation; convenient for review/diff. | ## 常见问题 **为什么不直接对所有内容使用 `kubectl patch`？** 完全可以。agentmoat 记录的就是相同的操作：它进行分类（这样您就不会无意中修补了不兼容的工作负载），它按风险排序（无状态和非 host-network 优先），它是幂等的（命名空间注解使得重复运行是安全的），它会保留审计记录，并且具有一键回滚功能。 **在生产环境中运行安全吗？** `scan` 和 `plan` 是严格只读的。`apply` 和 `rollback` 默认设置为 `--dry-run=true`，并会在发生任何更改之前在 `StepResult.patch` 中展示精确的 strategic-merge 补丁。只需一个只读的 kubeconfig 即可运行 `scan` 和 `plan`：针对这两种模式可直接绑定的 RBAC 配置位于 [`deploy/`](deploy/) 目录下，即 `clusterrole-readonly.yaml`（用于 scan / plan / verify）和 `clusterrole-apply.yaml`（用于 apply / rollback）。 **如果工作负载需要 raw sockets、eBPF 或 GPU passthrough 怎么办？** 分类器会将它们标记为 `incompatible`。计划器会将其排除。每个 `WorkloadResult` 上的 `reasons[]` 字段会指明触发的规则，并链接到相关的 gVisor 文档，因此给出的建议是非常具体的：要么将工作负载保留在 `runc` 上（并将其放在非 gVisor 节点池上），要么采用支持它的 gVisor 选项（`--net-raw`、`nvproxy`）。 **agentmoat 会在集群内部安装任何东西吗？** 没有 CRD，没有 webhooks，也没有控制器。它只修补 pod 模板，并读取/写入一个命名空间注解（`agentmoat.io/plan-hash`）。集群内唯一的前提条件是一个名为 `gvisor` 的 `RuntimeClass`（或传入的任何 `--runtime-class` 参数），并且该 RuntimeClass 需指向一个实际搭载 `runsc` 的节点。 **为什么要在 EKS 上使用自定义的 Packer AMI？** Bottlerocket 不提供 `runsc`，托管节点组没有 gVisor 开关，AWS 也不官方支持 gVisor。阻力最小的途径就是自带预装了 `runsc` 的 AL2023 节点镜像。`packer/eks-gvisor-al2023.pkr.hcl` 就是这样的镜像。 **为什么使用 `systrap`，而不是 KVM？** 在 EKS 上，实例内核不会向用户空间暴露 `/dev/kvm`；在 macOS 宿主机上的 kind 中，也无法使用 KVM。而 `systrap` 平台在这两种环境下都能工作。相关配置固定在 [`kind/runsc.toml`](kind/runsc.toml) 和 `packer/files/runsc.toml` 中。 ## 状态与路线图阶段 0（基础设施）、阶段 1（只读扫描 + 分类器）和阶段 2（计划器 + 应用器 + 回滚，具备幂等性和审计功能）的代码均已提交。`agentmoat scan`、`plan`、`apply` 和 `rollback` 已经连通，并在真实的 gVisor kind 集群上进行了端到端测试。路线图： - **阶段 3**：`agentmoat verify`（检查 pod 的 `runtimeClassName`；可选的 `--in-pod-probe` 用于容器内部确认）以及 `agentmoat explain`（内嵌的文档查看器）。两者均已发布。 - **阶段 5 及以后**：EKS 端到端操作指南（CloudFormation/Terraform/Karpenter），更多 Packer 变体。 ## 接下来浏览 - [架构](docs/architecture.md)：核心的 Go 库；CLI 只是一个轻量级外壳。 - [gVisor 101](docs/gvisor-101.md)：Sentry、Gofer、平台，以及开销的来源。 - [RuntimeClass 101](docs/runtimeclass-101.md)：关于 `RuntimeClass` API 的一页纸简介。 - [威胁模型](docs/threat-model.md)：gVisor 阻止了哪些 `runc` 无法阻止的攻击，包含 CVE 参考。 - [兼容性检查表](docs/compatibility-checklist.md)：完整的规则目录和 `--rules` 覆盖 schema。 - [退出码](docs/exit-codes.md)：按命令划分的确定性退出码。 - [EKS 部署](docs/eks-deployment.md)：Packer + EKS 指南（目前是占位符；计划在阶段 5 完成）。 - [Kind 快速开始](docs/kind-quickstart.md)：启动一个预装了 gVisor 的本地集群（文档目前是占位符；`make kind-up` 是目前可行的路径）。 ## 许可证 Apache License 2.0。详见 [LICENSE](LICENSE)。

标签：EVTX分析, Go, gVisor, Ruby工具, Web截图, 子域名突变, 容器安全, 容器运行时, 日志审计