radheshpai87/checkDK
GitHub: radheshpai87/checkDK
AI 驱动的 Docker Compose 与 Kubernetes 配置静态验证工具,提供智能修复建议与 Pod 故障风险预测。
Stars: 6 | Forks: 1
# checkDK
**AI 驱动的 Docker Compose 与 Kubernetes 配置验证器 —— 访问 [checkdk.app](https://checkdk.app)**
在生产环境中造成损失之前,提前发现端口冲突、安全配置错误、缺失的健康检查探针等问题。上传配置文件,即可获得包含 AI 生成的修复建议的即时分析,并保留每次扫描的可搜索历史记录。
## 功能特性
- **Web UI** —— 在 [checkdk.app](https://checkdk.app) 上传或粘贴配置文件,即可即时获得结果
- **Docker Compose 分析** —— 端口冲突、损坏的依赖关系、缺失的资源限制、未定义的环境变量、`:latest` 标签
- **Kubernetes 分析** —— NodePort 冲突、selector/label 不匹配、权限提升风险、缺失的健康检查探针
- **AI 驱动的修复** —— Mistral 和 Groq LLM 解释每个问题,并提供可直接复制粘贴的修复步骤
- **ML 风险预测** —— RandomForest 模型估算 Pod/容器发生故障的概率
- **分析历史** —— 每次扫描均按用户存储;支持搜索和重新打开过往结果
- **GitHub & Google OAuth** —— 使用现有账号登录,无需密码
- **CLI** —— 可选的 `checkdk` CLI 封装工具,用于本地使用 —— 通过 `npm install -g @checkdk/cli`(无需 Python)或 `pip install checkdk-cli` 安装(参见 [cli/README.md](cli/README.md))
## 在线演示
访问 **[checkdk.app](https://checkdk.app)** —— 演练场无需注册。
使用 GitHub 或 Google 登录以保存您的分析历史。
## 架构
```
Browser
│
├─▶ CloudFront CDN ──▶ S3 (React SPA)
│
└─▶ CloudFront /api/* ──▶ AWS App Runner (FastAPI backend)
│
├── DynamoDB (users + history)
├── Mistral / Groq (AI fixes)
└── RandomForest model (risk score)
```
| 层级 | 技术 |
| -------- | ---------------------------------------------------- |
| Frontend | React 19, TypeScript, Vite 7, TailwindCSS 4 |
| Backend | FastAPI, Python 3.11, Uvicorn |
| Auth | GitHub OAuth, Google OAuth, JWT (HS256) |
| Database | AWS DynamoDB |
| AI | Mistral AI, Groq (Llama 3.3 70B) |
| ML | scikit-learn RandomForest |
| Hosting | AWS App Runner (backend), S3 + CloudFront (frontend) |
| CI/CD | GitHub Actions → ECR → App Runner + S3 |
## 本地开发
### 前置条件
- Docker and Docker Compose v2
- 项目根目录下的 `.env` 文件(见下文)
### 1. 创建 `.env`
```
# OAuth (在 github.com/settings/developers 和 console.cloud.google.com 创建应用)
GITHUB_CLIENT_ID=
GITHUB_CLIENT_SECRET=
GOOGLE_CLIENT_ID=
GOOGLE_CLIENT_SECRET=
# JWT
JWT_SECRET=change-me-to-a-long-random-string
# AI providers (可选 — 没有它们分析仍然可以工作)
GROQ_API_KEY=
MISTRAL_API_KEY=
# AWS (DynamoDB 历史存储必需)
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
DYNAMODB_REGION=us-east-1
DYNAMODB_USERS_TABLE=checkdk_users
DYNAMODB_HISTORY_TABLE=checkdk_history
```
### 2. 启动服务栈
```
docker compose up --build
```
| 服务 | URL |
| ----------- | -------------------------- |
| Frontend | http://localhost:3000 |
| Backend API | http://localhost:8000 |
| API docs | http://localhost:8000/docs |
### 3. 运行后端测试
```
cd backend
pip install -e ".[api,ml]"
pytest tests/ -v
```
### 4. CLI (可选)
CLI 是一个独立的 Python 包。后端必须先运行。
```
cd cli
bash setup.sh # creates .venv and installs the package
source .venv/bin/activate
export CHECKDK_API_URL=http://localhost:8000
checkdk docker validate path/to/docker-compose.yml --dry-run
checkdk k8s validate path/to/deployment.yml --dry-run
checkdk predict --cpu 85 --memory 78
deactivate
```
要针对生产环境使用 CLI,请设置 `CHECKDK_API_URL=https://checkdk.app/api`。
## 验证内容
### Docker Compose
- 服务间的端口冲突
- 缺失的镜像或构建规格
- 损坏的服务依赖 (`depends_on`)
- 未定义的环境变量
- 缺失的资源限制 (`deploy.resources`)
- `:latest` 镜像标签
- 未定义的 volumes / networks
### Kubernetes
- NodePort 冲突
- Service 内部的重复端口
- Selector / label 不匹配
- 安全问题(特权容器、以 root 身份运行)
- 缺失的 liveness / readiness 探针
- 缺失的资源限制 / 请求
- `:latest` 镜像标签
## 示例输出
```
┌─────────────────────── Analysis ───────────────────────┐
│ File: docker-compose-complex.yml │
│ Score: 23 / 100 ▓░░░░░░░░░░░░░░░ │
└────────────────────────────────────────────────────────┘
🔴 Critical Issues (7)
1. Port conflict — services 'web' and 'web2' both bind port 8080
Fix: Change 'web2' port mapping to 8081:80
2. Undefined variable — 'backend' references ${DB_URL} (not set)
Fix: Add DB_URL to your .env file or remove the reference
⚠ Warnings (10)
1. 'frontend' uses :latest tag — pin to a specific digest for reproducibility
2. 'backend' has no CPU/memory limits — at risk in resource-constrained environments
╭────────────── AI Suggestion ──────────────╮
│ Top priority: resolve the port conflict. │
│ Both containers will fail to start until │
│ one of them is remapped. │
╰───────────────────────────────────────────╯
```
## CI / CD
GitHub Actions 会在每次 pull request 和每次合并到 `main` 时自动运行。
| Workflow | Trigger | Steps |
| ---------- | -------------- | ------------------------------------------------------------------------------------------------------------- |
| **CI** | PR to `main` | pytest, `tsc --noEmit`, ESLint |
| **Deploy** | Push to `main` | Build & push Docker image to ECR → deploy to App Runner → build frontend → sync to S3 → invalidate CloudFront |
AWS 认证使用 OIDC(不在 GitHub 中存储长期有效的 AWS 密钥)。有关所需的最低权限,请参见 [.github/iam-policy-github-actions.json](.github/iam-policy-github-actions.json)。
所需的 GitHub repository secrets:
| Secret | Value |
| ---------------------------- | -------------------------- |
| `AWS_ROLE_ARN` | IAM role ARN for OIDC |
| `VITE_API_BASE_URL` | Production API URL |
| `VITE_GITHUB_CLIENT_ID` | GitHub OAuth app client ID |
| `VITE_GOOGLE_CLIENT_ID` | Google OAuth app client ID |
| `CLOUDFRONT_DISTRIBUTION_ID` | CloudFront distribution ID |
## 路线图
| Phase | Feature | Status |
| ----- | -------------------------------------------------------------------- | ----------- |
| 1 | AWS infrastructure (App Runner, S3, CloudFront, ECR) | ✅ Complete |
| 2 | Auth + Database (GitHub/Google OAuth, JWT, DynamoDB history) | ✅ Complete |
| 3 | Post-login app interface (dashboard, playground, get-started) | ✅ Complete |
| 4 | CI/CD (GitHub Actions — pytest, lint, ECR deploy, S3 sync) | ✅ Complete |
| 5 | Real-time monitoring (WebSocket pod metrics stream, recharts) | 🔲 Planned |
| 6 | Chaos dataset + ML retraining (real EKS failure data via Chaos Mesh) | 🔲 Planned |
| 7 | Amazon Bedrock (replace Mistral with Claude Haiku via IAM role) | 🔲 Planned |
## ML 预测 API
后端为 Kubernetes Pod 故障风险暴露了四个预测端点:
| Endpoint | Model |
| ----------------------------- | ------------------------------ |
| `POST /predict/random-forest` | scikit-learn RandomForest |
| `POST /predict/xgboost` | XGBoost |
| `POST /predict/lstm` | PyTorch LSTM |
| `POST /predict/ensemble` | Majority vote across all three |
**Request fields:** `cpu_usage`, `memory_usage`, `disk_usage`, `network_latency`, `restart_count`, `probe_failures`, `node_cpu_pressure`, `node_memory_pressure`, `pod_age_minutes`
示例 —— 集成预测(故障情况):
```
curl -X POST https://checkdk.app/api/predict/ensemble \
-H "Content-Type: application/json" \
-d '{
"cpu_usage": 94.5,
"memory_usage": 96.2,
"disk_usage": 88.0,
"network_latency": 45.0,
"restart_count": 7,
"probe_failures": 4,
"node_cpu_pressure": 1,
"node_memory_pressure": 1,
"pod_age_minutes": 95
}'
```
```
{
"ensemble_label": "failure",
"ensemble_confidence": 0.87,
"random_forest": { "label": "failure", "confidence": 0.62 },
"xgboost": { "label": "failure", "confidence": 1.0 },
"lstm": { "label": "failure", "confidence": 1.0 }
}
```
在本地使用时,请将 `https://checkdk.app/api` 替换为 `http://localhost:8000`。
## 项目结构
```
checkDK/
├── backend/ # FastAPI application
│ └── checkdk/
│ ├── api/ # Routes (auth, analysis, history)
│ ├── ai/ # Mistral & Groq providers
│ ├── ml/ # RandomForest predictor + training
│ ├── parsers/ # Docker Compose & Kubernetes YAML parsers
│ ├── validators/ # Rule-based validators
│ └── services/ # Business logic
├── frontend/ # React + Vite SPA
│ └── src/
│ ├── components/ # UI components + dashboard + playground
│ ├── contexts/ # AuthContext
│ ├── lib/ # API client, mock analyzer, utilities
│ └── pages/
├── cli/ # Optional CLI wrapper (`checkdk` command)
├── ml-models/ # Standalone model training scripts
├── .github/
│ ├── workflows/
│ │ ├── ci.yml # PR checks
│ │ └── deploy.yml # Production deployment
│ ├── iam-policy-github-actions.json
│ └── trust-policy.json
├── docker-compose.yml # Local development stack
└── .env # Local secrets (not committed)
```
## 许可证
MIT —— 详情见 [LICENSE](LICENSE)。
## 支持
- **Bug / 功能请求**: [open an issue](https://github.com/radheshpai87/checkDK/issues)
- **问题**: open a discussion
**Made with ❤️ for developers who want to catch issues early**
标签:AI代码审计, AV绕过, AWS Serverless, CI/CD插件, DevSecOps工具, Docker Compose, Docker安全, FastAPI, K8s安全, LLM代码修复, Mistral AI, NodePort检测, OAuth认证, Pod故障预测, React, SaaS, Syscalls, URL发现, Web截图, 云安全监控, 健康检查, 凭据扫描, 安全检测, 容器安全, 端口冲突检测, 运维自动化, 逆向工具, 配置验证, 随机森林, 静态分析, 风险评分