Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent

GitHub: Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent

基于AI Agent的智能Kubernetes事件响应系统，自动化完成检测、诊断、通知与修复。

Stars: 0 | Forks: 0

# 🦅 K8sHawk-使用 AI Agent 进行智能 Kubernetes 事件响应 ## 📐 架构图 ![Architecture diagram](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/e3c8e98ba50df13dd10b62c7e4193378a42e43e6/Gemini_Generated_Image_mjm71pmjm71pmjm7.png) ## 🔍 什么是 K8sHawk？ **K8sHawk** is an end-to-end autonomous Kubernetes incident response system that I built to eliminate the manual toil of debugging Kubernetes cluster failures. When a pod crashes, fails to pull an image, runs out of memory, or hits any other common Kubernetes failure, K8sHawk automatically: 1. **Detects** the incident in real time by watching the Kubernetes event stream 2. **Investigates** the root cause using an AI agent (Groq LLM) that runs actual `kubectl` commands iteratively — just like a human SRE would 3. **Generates a voice alert** in English, Tamil, or Tanglish using Sarvam AI's TTS API, and uploads the audio directly to Slack 4. **Sends a rich Slack notification** with full RCA, severity rating, and an interactive one-click fix button 5. **Executes the approved fix** command (`kubectl delete`, `kubectl rollout restart`, etc.) when you click Approve in Slack 6. **Monitors recovery** by polling the cluster every 15 seconds and reports back to the Slack thread 7. **Saves documentation** as structured Markdown RCA files, organized by date The system handles everything from image pull failures and crash loops to OOMKilled events and scheduling failures — all with zero human investigation required. ## ✨ 关键特性 - **Agentic AI Investigation** — The LLM doesn't just read static data. It iteratively calls `kubectl get`, `kubectl describe`, `kubectl logs`, and `kubectl top` in a tool-use loop (up to 5 rounds) to investigate the incident like a real engineer - **Multi-model Fallback** — Primary: `llama-3.3-70b-versatile`. Falls back silently to `llama-3.1-8b-instant` on rate limits, then to pattern-based kubectl analysis if both are unavailable - **Voice Alerts (Multilingual)** — Generates spoken summaries in English (Priya), Tamil (Anushka), or Tanglish via Sarvam AI TTS and uploads the MP3 directly to the Slack thread - **Interactive Slack Notifications** — Rich Block Kit messages with severity colors, RCA summaries, kubectl commands, and Approve / Dismiss buttons - **One-Click Fix Execution** — Clicking Approve in Slack triggers the fix command on the actual cluster with real-time terminal output and Slack thread updates - **Recovery Monitoring** — Polls cluster health every 15 seconds for up to 3 minutes after fix execution - **Deduplication** — Prevents duplicate alerts for the same pod within a 60-second window - **Date-Organized Storage** — RCA Markdown files and voice MP3s are saved under `knowledge-base/` organized by date ## 🛠️ 技术栈 | Layer | Technology | |---|---| | Language | Python 3.11+ | | API Server | FastAPI + Uvicorn | | Async Runtime | asyncio | | Kubernetes Client | kubernetes-python | | AI Investigation | Groq API (`llama-3.3-70b-versatile`) | | Voice Generation | Sarvam AI TTS | | Notifications | Slack SDK (Block Kit) | | HTTP Client | httpx (async) | | Config Management | Pydantic Settings | | Terminal UI | Rich | | Local Cluster | k3d (k3s in Docker) | ## 📸 系统流程 — 现场演示 The following screenshots show the complete K8sHawk pipeline end-to-end, from cluster setup to incident resolution. Each step is shown in sequence as it happened during a live demo run. ### 步骤 1 — 通过 Docker Desktop (k3d) 运行 Kubernetes 集群 The cluster backing K8sHawk is a lightweight k3d cluster (k3s running inside Docker containers) managed via Docker Desktop. The five containers shown represent the k3d loadbalancer, server node, and three agent nodes. ![Docker Desktop - k3d cluster containers](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(584).png?raw=true) ### 步骤 2 — K8sHawk 启动并连接到集群 Running `python -m k8shawk.main` launches the application. It loads configuration, initializes all services, connects to the Kubernetes cluster, prints a live summary of cluster stats (3 nodes, 11 pods, 5 namespaces, 7 services, 6 deployments), and begins watching all namespaces for Warning events. ![K8sHawk startup banner and cluster status](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(573).png?raw=true) ### 步骤 3 — 触发真实事件（错误镜像拉取） To simulate a real incident, I created a pod using a nonexistent Docker image. This causes Kubernetes to emit a `Failed` / `ImagePullBackOff` warning event, which K8sHawk detects immediately. The second command shows it already exists from a previous test run. ![Creating test pod with bad image in PowerShell](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(582).png?raw=true) ### 步骤 4 — 事件被检测到，AI 调查开始 K8sHawk picks up the `Failed` event on `test-incident` in the `default` namespace within seconds. The full error message from the Kubernetes event is captured. The AI investigator starts working — it first tries the primary Groq model, hits a rate limit, silently switches to the fallback model, which is also rate-limited, and finally falls back to kubectl-based pattern analysis. The voice alert generation then begins by calling the Sarvam TTS API. ![Incident detected and AI pipeline running](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(574).png?raw=true) ### 步骤 5 — 生成语音警报，上传到 Slack，根因分析（RCA）保存 The Sarvam TTS API generates a 1.8MB MP3 voice alert file. It is uploaded to Slack using the `files.uploadV2` workflow and attached to the incident thread. The RCA Markdown file is simultaneously saved to `knowledge-base/rca/2026-04-16/`. The incident response pipeline completes in full with severity MEDIUM. ![Voice upload, Slack confirmation, RCA save](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(575).png?raw=true) ### 步骤 6 — 带有批准/忽略按钮的丰富警报到达 Slack The Slack notification arrives in `#k8shawk-alerts` with full Block Kit formatting. It shows the severity level (🟡 MEDIUM), the affected pod and namespace, detection timestamp, RCA summary, proposed kubectl fix command, and two interactive buttons — **✅ Approve Fix** and **❌ Dismiss**. The alert is structured so any on-call engineer can understand the situation without touching the terminal. ![Slack alert with severity, RCA, and action buttons](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(577).png?raw=true) ### 步骤 7 — Slack 线程显示完整根因分析和语音警报音频播放器 The Slack thread reply includes the complete Root Cause Analysis text — the full investigation output written by the AI explaining what happened, why, and what to review. Below it, the uploaded MP3 audio file appears as an inline waveform player labelled **K8sHawk Voice Alert** (0:41 duration, 2 MB), so the engineer can listen to the spoken summary without reading. ![Slack thread with full RCA text and voice audio player](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(576).png?raw=true) ### 步骤 8 — 工程师点击批准修复 — 确认对话框出现 When the on-call engineer clicks **✅ Approve Fix** in Slack, a confirmation modal appears showing exactly which command will be run on the cluster: `kubectl describe pod test-incident -n default`. This gives one final chance to review before execution. ![Slack confirmation dialog before fix execution](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(578).png?raw=true) ### 步骤 9 — 在终端中批准修复，执行开始 The Slack button click is received by the FastAPI webhook server. The server logs show the full request trace with a unique request ID (`REQ-1912ab91`), action ID, and thread timestamp. The fix command is confirmed, the background task is started, and the terminal shows the complete fix execution box — pod name, namespace, command, and description. ![Terminal showing fix approval and execution kickoff](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(581).png?raw=true) ### 步骤 10 — kubectl 命令运行，修复成功应用 The kubectl command executes with a 30-second subprocess timeout. The full command output is shown — the pod's name, namespace, priority, service account, node assignment, start time, labels, annotations, status, and IP address. K8sHawk confirms the fix succeeded and begins monitoring recovery every 15 seconds. ![Terminal showing kubectl output and Fix Applied Successfully banner](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(580).png?raw=true) ### 步骤 11 — 修复结果回传到 Slack 线程 The fix command output is immediately posted back to the Slack thread so the engineer sees the result without switching context. The full `kubectl describe` output — including node assignment, container image (`nonexistent-image:latest`), and pod status (`Pending`) — appears inline in the thread alongside the **Applying fix** status message. ![Slack thread showing applied fix output](https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent/blob/508904254f49329871326cd00df33c1579f0f220/Screenshot%20(579).png?raw=true) ## 📁 项目结构 ``` k8shawk/ ├── main.py # Application entry point ├── config/ │ └── settings.py # Pydantic settings (env vars, defaults) ├── models/ │ ├── incident.py # KubernetesIncident dataclass │ ├── analysis.py # IncidentAnalysis + Severity enum │ └── fix.py # PendingFix dataclass ├── services/ │ ├── event_watcher.py # K8s event stream watcher + deduplication │ ├── ai_investigator.py # Groq agentic tool loop (kubectl tools) │ ├── voice_generator.py # Sarvam AI TTS + file storage │ ├── notification_service.py # Slack Block Kit messages + audio upload │ ├── fix_executor.py # kubectl fix execution + recovery monitor │ ├── rca_manager.py # Markdown RCA file writer │ └── kubectl_tools.py # Safe kubectl subprocess wrapper ├── handlers/ │ └── incident_handler.py # Orchestrates the full response pipeline └── server/ └── webhook_server.py # FastAPI webhook for Slack button interactions knowledge-base/ ├── rca/ │ └── YYYY-MM-DD/ │ └── rca_{pod}_{timestamp}.md └── voice-messages/ └── YYYY-MM-DD/ └── voice_{pod}_{timestamp}.mp3 k8s/ └── k8shawk-deployment.yaml # Kubernetes manifests (ServiceAccount, RBAC, Deployment) ``` ## ⚙️ 配置 All configuration is loaded from a `.env` file using Pydantic Settings. ``` # .env # API 密钥 GROQ_API_KEY=your_groq_api_key_here SLACK_BOT_TOKEN=xoxb-your-slack-bot-token SARVAM_API_KEY=your_sarvam_api_key_here # Slack SLACK_CHANNEL=#k8shawk-alerts # 语音 VOICE_LANGUAGE=english # Options: english | tamil | tanglish # Webhook WEBHOOK_PORT=8081 # Kubernetes WATCH_NAMESPACE= # Leave empty to watch all namespaces # 恢复 RECOVERY_TIMEOUT_SECONDS=180 ``` ## 🚀 如何运行 ### 先决条件 - Python 3.11+ - A running Kubernetes cluster (local k3d, minikube, or remote) - `kubectl` configured and pointing to your cluster - ngrok (to expose the webhook server to Slack) - Slack app with bot token and interactive components enabled ### 设置 ``` # 克隆仓库 git clone https://github.com/Naveen15github/K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent.git cd K8sHawk-Intelligent-Kubernetes-Incident-Response-with-AI-Agent # 安装依赖项 pip install -r requirements.txt # 配置环境 cp .env.example .env # 使用你的 API 密钥编辑 .env # 暴露 Slack 的 Webhook（在单独的终端中） ngrok http 8081 # 复制 ngrok HTTPS URL → 粘贴到 Slack 应用的交互性和快捷方式 URL # 格式：https://your-ngrok-url.ngrok.io/slack/interactions # 启动 K8sHawk python -m k8shawk.main ``` ### 触发测试事件 ``` # 创建一个使用不存在镜像的 Pod — 触发 ImagePullBackOff / Failed 事件 kubectl run test-incident --image=nonexistent-image:latest # 测试后清理 kubectl delete pod test-incident ``` ### 验证 Webhook 是否正常工作 ``` curl http://localhost:8081/health # → {"status": "healthy"} curl http://localhost:8081/slack/validate # → 返回 ngrok URL 和服务器配置信息 ``` ## 🔄 事件响应流水线 ``` K8s Warning Event │ ▼ EventWatcher detects it (deduplication: 60s window per pod) │ ▼ IncidentHandler orchestrates: │ ├─ 1. AIInvestigator.investigate() │ └─ Agentic tool loop: │ kubectl get → kubectl describe → kubectl logs → kubectl top │ (up to 5 rounds, primary → fallback → basic analysis) │ ├─ 2. VoiceGenerator.generate_alert() │ └─ Sarvam TTS API → MP3 saved to knowledge-base/ │ ├─ 3. NotificationService.post_incident_alert() │ └─ Slack Block Kit → Audio upload → Approve/Dismiss buttons │ ├─ 4. RCAManager.save_rca() │ └─ Markdown file → knowledge-base/rca/YYYY-MM-DD/ │ └─ 5. Wait for Slack button click... │ ├─ Approve → FixExecutor.apply_fix() │ └─ kubectl command → recovery polling → Slack update │ └─ Dismiss → Log dismissal, no action ``` ## 🧠 AI 调查 — 工作原理 The `AIInvestigator` runs an agentic tool loop where the LLM is given access to four kubectl tools: | Tool | Command | |---|---| | `kubectl_get` | `kubectl get -n ` | | `kubectl_describe` | `kubectl describe -n ` | | `kubectl_logs` | `kubectl logs -n --tail=50` | | `kubectl_top` | `kubectl top pods -n ` | The model receives the incident details and the full cluster context prompt, then decides which commands to run. It iterates up to 5 rounds, reads the outputs, and produces a structured JSON response containing severity, severity reason, full RCA (3–5 sentences), one-line RCA summary, fix command, and fix description. If the primary model (`llama-3.3-70b-versatile`) hits a rate limit, it silently switches to `llama-3.1-8b-instant`. If that also fails, it generates a basic analysis from kubectl pattern matching (CrashLoopBackOff → restart pod, ImagePullBackOff → delete pod to stop retries, OOMKilled → review memory limits, etc.). ## 📊 严重级别 | Severity | Color | Indicator | Examples | |---|---|---|---| | LOW | 🟢 Green | `#36a64f` | Minor warnings, non-critical events | | MEDIUM | 🟡 Yellow | `#ffcc00` | ImagePullBackOff, image not found | | HIGH | 🟠 Orange | `#ff8c00` | CrashLoopBackOff, OOMKilled | | CRITICAL | 🔴 Red | `#cc0000` | Multiple pods down, node failures | ## 🎙️ 语音警报语言 | Setting | Language | Speaker | Voice ID | |---|---|---|---| | `english` | English (India) | Priya (Female) | `en-IN` | | `tamil` | Tamil | Anushka (Female) | `ta-IN` | | `tanglish` | Tamil + English mix | Anushka (Female) | `ta-IN` | Voice scripts are capped at 500 characters due to the Sarvam API constraint and are automatically truncated if needed. Generated MP3 files are stored in `knowledge-base/voice-messages/YYYY-MM-DD/` and deleted locally after successful Slack upload. ## 🔐 Kubernetes RBAC 要求 For in-cluster deployment, K8sHawk needs the following permissions via ClusterRole: ``` - pods: get, list, watch, delete - events: get, list, watch - deployments: get, list, watch, patch, update - nodes: get, list, watch - namespaces: get, list, watch - services: get, list, watch - configmaps: get, list, watch ``` ## 📈 性能特征 | Stage | Typical Duration | |---|---| | Event detection latency | < 5 seconds | | AI investigation | 10–30 seconds | | Voice generation | 2–5 seconds | | Slack notification | 1–3 seconds | | Fix execution | 1–5 seconds | | Recovery monitoring | Up to 3 minutes (15s polling) | | **Total pipeline (excl. recovery)** | **15–45 seconds** | ## 🏗️ 集群内部署 K8sHawk can be deployed inside the cluster it monitors using the provided Kubernetes manifests: ``` # 应用所有清单文件 kubectl apply -f k8s/k8shawk-deployment.yaml # 该清单创建： # - ServiceAccount: k8shawk # - ClusterRole + ClusterRoleBinding (RBAC) # - Secret (API 密钥) # - Deployment (1 个副本，python -m k8shawk.main) # - Service (ClusterIP，端口 8081) ``` ## 🔭 经验总结 Building K8sHawk gave me hands-on experience with: - **Agentic AI systems** — designing tool-use loops where an LLM drives investigation decisions iteratively rather than receiving a static prompt - **Kubernetes internals** — working with the event stream API, pod lifecycle states, and RBAC in a real cluster - **Multi-service orchestration** — wiring together Groq, Sarvam, Slack SDK, FastAPI, and the Kubernetes Python client into a cohesive async pipeline - **Production-grade error handling** — graceful degradation across model fallbacks, silent voice failures, deduplication, and subprocess timeouts - **Slack Block Kit** — building interactive messages with buttons, file uploads via the V2 upload flow, and thread-based conversations - **k3d for local clusters** — running a full multi-node Kubernetes environment inside Docker for rapid local testing ## 📝 许可证 This project is for educational and demonstration purposes. *Built by Naveen G — K8sHawk 🦅 | Intelligent Kubernetes Incident Response*

标签：CrashLoopBackOff, kubectl, LLM, Markdown, OOMKilled, Pod崩溃, RCA, Sarvam AI, SEO, Slack, TTS, Unmanaged PE, 事件检测, 多模型回退, 容器故障, 文档生成, 根因分析, 结构化报告, 自动化响应, 计算机取证, 语音告警, 调度失败, 迭代命令执行, 逆向工具, 镜像拉取失败, 零手动排查