radheshpai87/checkDK

GitHub: radheshpai87/checkDK

AI 驱动的 Docker Compose 与 Kubernetes 配置静态验证工具，提供智能修复建议与 Pod 故障风险预测。

Stars: 5 | Forks: 2

# checkDK **AI 驱动的 Docker Compose 与 Kubernetes 配置验证器 —— 访问 [checkdk.app](https://checkdk.app)** 在生产环境中造成损失之前，提前发现端口冲突、安全配置错误、缺失的健康检查探针等问题。上传配置文件，即可获得包含 AI 生成的修复建议的即时分析，并保留每次扫描的可搜索历史记录。 ## 功能特性 - **Web UI** —— 在 [checkdk.app](https://checkdk.app) 上传或粘贴配置文件，即可即时获得结果 - **Docker Compose 分析** —— 端口冲突、损坏的依赖关系、缺失的资源限制、未定义的环境变量、`:latest` 标签 - **Kubernetes 分析** —— NodePort 冲突、selector/label 不匹配、权限提升风险、缺失的健康检查探针 - **AI 驱动的修复** —— Mistral 和 Groq LLM 解释每个问题，并提供可直接复制粘贴的修复步骤 - **ML 风险预测** —— RandomForest 模型估算 Pod/容器发生故障的概率 - **分析历史** —— 每次扫描均按用户存储；支持搜索和重新打开过往结果 - **GitHub & Google OAuth** —— 使用现有账号登录，无需密码 - **CLI** —— 可选的 `checkdk` CLI 封装工具，用于本地使用 —— 通过 `npm install -g @checkdk/cli`（无需 Python）或 `pip install checkdk-cli` 安装（参见 [cli/README.md](cli/README.md)） ## 在线演示访问 **[checkdk.app](https://checkdk.app)** —— 演练场无需注册。使用 GitHub 或 Google 登录以保存您的分析历史。 ## 架构 ``` Browser │ ├─▶ CloudFront CDN ──▶ S3 (React SPA) │ └─▶ CloudFront /api/* ──▶ AWS App Runner (FastAPI backend) │ ├── DynamoDB (users + history) ├── Mistral / Groq (AI fixes) └── RandomForest model (risk score) ``` | 层级 | 技术 | | -------- | ---------------------------------------------------- | | Frontend | React 19, TypeScript, Vite 7, TailwindCSS 4 | | Backend | FastAPI, Python 3.11, Uvicorn | | Auth | GitHub OAuth, Google OAuth, JWT (HS256) | | Database | AWS DynamoDB | | AI | Mistral AI, Groq (Llama 3.3 70B) | | ML | scikit-learn RandomForest | | Hosting | AWS App Runner (backend), S3 + CloudFront (frontend) | | CI/CD | GitHub Actions → ECR → App Runner + S3 | ## 本地开发 ### 前置条件 - Docker and Docker Compose v2 - 项目根目录下的 `.env` 文件（见下文） ### 1. 创建 `.env` ``` # OAuth (在 github.com/settings/developers 和 console.cloud.google.com 创建应用) GITHUB_CLIENT_ID= GITHUB_CLIENT_SECRET= GOOGLE_CLIENT_ID= GOOGLE_CLIENT_SECRET= # JWT JWT_SECRET=change-me-to-a-long-random-string # AI providers (可选 — 没有它们分析仍然可以工作) GROQ_API_KEY= MISTRAL_API_KEY= # AWS (DynamoDB 历史存储必需) AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= DYNAMODB_REGION=us-east-1 DYNAMODB_USERS_TABLE=checkdk_users DYNAMODB_HISTORY_TABLE=checkdk_history ``` ### 2. 启动服务栈 ``` docker compose up --build ``` | 服务 | URL | | ----------- | -------------------------- | | Frontend | http://localhost:3000 | | Backend API | http://localhost:8000 | | API docs | http://localhost:8000/docs | ### 3. 运行后端测试 ``` cd backend pip install -e ".[api,ml]" pytest tests/ -v ``` ### 4. CLI (可选) CLI 是一个独立的 Python 包。后端必须先运行。 ``` cd cli bash setup.sh # creates .venv and installs the package source .venv/bin/activate export CHECKDK_API_URL=http://localhost:8000 checkdk docker validate path/to/docker-compose.yml --dry-run checkdk k8s validate path/to/deployment.yml --dry-run checkdk predict --cpu 85 --memory 78 deactivate ``` 要针对生产环境使用 CLI，请设置 `CHECKDK_API_URL=https://checkdk.app/api`。 ## 验证内容 ### Docker Compose - 服务间的端口冲突 - 缺失的镜像或构建规格 - 损坏的服务依赖 (`depends_on`) - 未定义的环境变量 - 缺失的资源限制 (`deploy.resources`) - `:latest` 镜像标签 - 未定义的 volumes / networks ### Kubernetes - NodePort 冲突 - Service 内部的重复端口 - Selector / label 不匹配 - 安全问题（特权容器、以 root 身份运行） - 缺失的 liveness / readiness 探针 - 缺失的资源限制 / 请求 - `:latest` 镜像标签 ## 示例输出 ``` ┌─────────────────────── Analysis ───────────────────────┐ │ File: docker-compose-complex.yml │ │ Score: 23 / 100 ▓░░░░░░░░░░░░░░░ │ └────────────────────────────────────────────────────────┘ 🔴 Critical Issues (7) 1. Port conflict — services 'web' and 'web2' both bind port 8080 Fix: Change 'web2' port mapping to 8081:80 2. Undefined variable — 'backend' references ${DB_URL} (not set) Fix: Add DB_URL to your .env file or remove the reference ⚠ Warnings (10) 1. 'frontend' uses :latest tag — pin to a specific digest for reproducibility 2. 'backend' has no CPU/memory limits — at risk in resource-constrained environments ╭────────────── AI Suggestion ──────────────╮ │ Top priority: resolve the port conflict. │ │ Both containers will fail to start until │ │ one of them is remapped. │ ╰───────────────────────────────────────────╯ ``` ## CI / CD GitHub Actions 会在每次 pull request 和每次合并到 `main` 时自动运行。 | Workflow | Trigger | Steps | | ---------- | -------------- | ------------------------------------------------------------------------------------------------------------- | | **CI** | PR to `main` | pytest, `tsc --noEmit`, ESLint | | **Deploy** | Push to `main` | Build & push Docker image to ECR → deploy to App Runner → build frontend → sync to S3 → invalidate CloudFront | AWS 认证使用 OIDC（不在 GitHub 中存储长期有效的 AWS 密钥）。有关所需的最低权限，请参见 [.github/iam-policy-github-actions.json](.github/iam-policy-github-actions.json)。所需的 GitHub repository secrets： | Secret | Value | | ---------------------------- | -------------------------- | | `AWS_ROLE_ARN` | IAM role ARN for OIDC | | `VITE_API_BASE_URL` | Production API URL | | `VITE_GITHUB_CLIENT_ID` | GitHub OAuth app client ID | | `VITE_GOOGLE_CLIENT_ID` | Google OAuth app client ID | | `CLOUDFRONT_DISTRIBUTION_ID` | CloudFront distribution ID | ## 路线图 | Phase | Feature | Status | | ----- | -------------------------------------------------------------------- | ----------- | | 1 | AWS infrastructure (App Runner, S3, CloudFront, ECR) | ✅ Complete | | 2 | Auth + Database (GitHub/Google OAuth, JWT, DynamoDB history) | ✅ Complete | | 3 | Post-login app interface (dashboard, playground, get-started) | ✅ Complete | | 4 | CI/CD (GitHub Actions — pytest, lint, ECR deploy, S3 sync) | ✅ Complete | | 5 | Real-time monitoring (WebSocket pod metrics stream, recharts) | 🔲 Planned | | 6 | Chaos dataset + ML retraining (real EKS failure data via Chaos Mesh) | 🔲 Planned | | 7 | Amazon Bedrock (replace Mistral with Claude Haiku via IAM role) | 🔲 Planned | ## ML 预测 API 后端为 Kubernetes Pod 故障风险暴露了四个预测端点： | Endpoint | Model | | ----------------------------- | ------------------------------ | | `POST /predict/random-forest` | scikit-learn RandomForest | | `POST /predict/xgboost` | XGBoost | | `POST /predict/lstm` | PyTorch LSTM | | `POST /predict/ensemble` | Majority vote across all three | **Request fields:** `cpu_usage`, `memory_usage`, `disk_usage`, `network_latency`, `restart_count`, `probe_failures`, `node_cpu_pressure`, `node_memory_pressure`, `pod_age_minutes` 示例 —— 集成预测（故障情况）： ``` curl -X POST https://checkdk.app/api/predict/ensemble \ -H "Content-Type: application/json" \ -d '{ "cpu_usage": 94.5, "memory_usage": 96.2, "disk_usage": 88.0, "network_latency": 45.0, "restart_count": 7, "probe_failures": 4, "node_cpu_pressure": 1, "node_memory_pressure": 1, "pod_age_minutes": 95 }' ``` ``` { "ensemble_label": "failure", "ensemble_confidence": 0.87, "random_forest": { "label": "failure", "confidence": 0.62 }, "xgboost": { "label": "failure", "confidence": 1.0 }, "lstm": { "label": "failure", "confidence": 1.0 } } ``` 在本地使用时，请将 `https://checkdk.app/api` 替换为 `http://localhost:8000`。 ## 项目结构 ``` checkDK/ ├── backend/ # FastAPI application │ └── checkdk/ │ ├── api/ # Routes (auth, analysis, history) │ ├── ai/ # Mistral & Groq providers │ ├── ml/ # RandomForest predictor + training │ ├── parsers/ # Docker Compose & Kubernetes YAML parsers │ ├── validators/ # Rule-based validators │ └── services/ # Business logic ├── frontend/ # React + Vite SPA │ └── src/ │ ├── components/ # UI components + dashboard + playground │ ├── contexts/ # AuthContext │ ├── lib/ # API client, mock analyzer, utilities │ └── pages/ ├── cli/ # Optional CLI wrapper (`checkdk` command) ├── ml-models/ # Standalone model training scripts ├── .github/ │ ├── workflows/ │ │ ├── ci.yml # PR checks │ │ └── deploy.yml # Production deployment │ ├── iam-policy-github-actions.json │ └── trust-policy.json ├── docker-compose.yml # Local development stack └── .env # Local secrets (not committed) ``` ## 许可证 MIT —— 详情见 [LICENSE](LICENSE)。 ## 支持 - **Bug / 功能请求**: [open an issue](https://github.com/radheshpai87/checkDK/issues) - **问题**: open a discussion **Made with ❤️ for developers who want to catch issues early**

标签：AI代码审计, AV绕过, AWS Serverless, CI/CD插件, DevSecOps工具, Docker Compose, Docker安全, FastAPI, K8s安全, LLM代码修复, Mistral AI, NodePort检测, OAuth认证, Pod故障预测, React, SaaS, Syscalls, URL发现, Web截图, 云安全监控, 健康检查, 凭据扫描, 安全检测, 容器安全, 端口冲突检测, 运维自动化, 逆向工具, 配置验证, 随机森林, 静态分析, 风险评分