sanjanamahajan2001-sys/production-monitoring-platform

GitHub: sanjanamahajan2001-sys/production-monitoring-platform

面向Kubernetes生产环境的集中式可观测性平台，整合指标采集、日志聚合与自动告警工作流，帮助运维团队实现基础设施和应用健康状态的全面监控与事件响应。

Stars: 0 | Forks: 0

# 🔍 生产监控与事件平台 [![Prometheus](https://img.shields.io/badge/Metrics-Prometheus-E6522C?logo=prometheus)](https://prometheus.io/) [![Grafana](https://img.shields.io/badge/Viz-Grafana-F46800?logo=grafana)](https://grafana.com/) [![Loki](https://img.shields.io/badge/Logs-Loki-FFFFFF?logo=grafana)](https://grafana.com/oss/loki/) [![OpenSearch](https://img.shields.io/badge/Search-OpenSearch-005EB8?logo=opensearch)](https://opensearch.org/) 一个全面、生产就绪的可观测性与事件响应平台。本项目为基础设施和应用程序健康状态提供云原生监控，集成了全栈 metrics、日志聚合以及自动告警工作流。 ## 🎯 项目目标本项目的创建是为了探索和演示： - 用于多集群 Kubernetes 环境的**集中式可观测性**模式。 - 使用 Grafana Loki 进行**日志聚合与关联**。 - 通过 AlertManager 和 Slack 集成实现**事件响应自动化**。 - 针对应用程序可靠性的**服务级别目标 (SLOs)** 追踪。 - 面向一致的基础设施可见性的**仪表盘标准化**。 ## 🏗️ 架构该平台旨在处理高基数 metrics 和可扩展的日志存储。 ### 1. 可观测性架构 ``` graph TD subgraph K8s_Nodes [Kubernetes Cluster] Apps[Application Pods] Nodes[Node Exporter] KubeState[Kube-State-Metrics] end subgraph Monitoring_Stack [Observability Stack] Prom[Prometheus] Loki[Grafana Loki] AM[AlertManager] Graf[Grafana Dashboards] end Apps -->|Logs| Loki Apps -->|Metrics| Prom Nodes -->|Metrics| Prom KubeState -->|Metrics| Prom Prom --> AM AM -->|Alerts| Slack[Slack Channel] Prom --> Graf Loki --> Graf ``` ### 2. 告警与事件流 ``` sequenceDiagram participant Pod as Kubernetes Pod participant Prom as Prometheus participant AM as AlertManager participant Slack as Slack API Pod->>Pod: Threshold Exceeded (CPU > 80%) Prom->>Prom: Rule Evaluation Prom->>AM: Fire Alert AM->>AM: Deduplication & Inhibition AM->>Slack: Send Notification ``` ## 🛠️ 验证与测试可观测性工作流和告警模式已通过以下方式进行验证： - **本地 K8s (k3d)**：在本地部署完整技术栈，以验证抓取与日志摄入。 - **告警模拟**：手动触发高负载场景，以验证 AlertManager 路由。 - **仪表盘验证**：针对实时 metric 来源验证了 Grafana JSON 仪表盘。 - **Slack Webhook 测试**：验证了从 K8s 到 Slack 的端到端通知交付。 ## 📊 示例输出 ### Kubernetes Pod 状态 ``` kubectl get pods -n monitoring NAME READY STATUS RESTARTS AGE prometheus-server-7d89f4b5-x2p89 2/2 Running 0 1d grafana-5d9d9f8c-m9lqz 1/1 Running 0 1d loki-0 1/1 Running 0 1d alertmanager-server-abc12 1/1 Running 0 1d ``` ### AlertManager 日志 ``` level=info msg="Alert fired" alert=HighCPULoad instance=node-1 severity=critical level=info msg="Notification sent" receiver=slack-ops duration=120ms ``` ## 🚀 未来改进 - **OpenTelemetry 集成**：过渡到 OTel collector 以实现统一的遥测数据摄入。 - **异常检测**：实施 Prometheus AI/ML 规则以实现主动告警。 - **Canary 监控**：为 Canary 发布自动生成仪表盘。 - **FinOps 可观测性**：在 Grafana 仪表盘中添加按服务计算的成本追踪。 ## 📂 仓库结构 ``` production-monitoring-platform/ ├── kubernetes/ # K8s manifests for the stack │ ├── base/ # Unified deployment manifests │ ├── prometheus/ # Rules and scraping config │ └── grafana/ # Dashboards and Datasources ├── alerts/ # Incident Response Configuration │ ├── alertmanager.yml # Routing logic (Slack/Email) │ └── rules/ # Threshold definitions ├── dashboards/ # Exported JSON dashboards └── README.md # Project Documentation ``` ## 🤝 联系方式由 **Sanjana Mahajan** 构建。 - **作品集**：[personal-portfolio-gold-phi-44.vercel.app](https://personal-portfolio-gold-phi-44.vercel.app) - **LinkedIn**：[linkedin.com/in/sanjana-mahajan-467982233/](https://www.linkedin.com/in/sanjana-mahajan-467982233/) - **邮箱**：[sanjanamaahi2001@gmail.com](mailto:sanjanamaahi2001@gmail.com)

标签：AlertManager, Grafana, Kubernetes监控, Kube-State-Metrics, Loki, Node Exporter, Slack集成, SLO跟踪, SRE, 事件响应平台, 云原生架构, 云原生监控, 仪表盘标准化, 偏差过滤, 全栈监控, 可观测性解决方案, 告警工作流, 基础设施监控, 多集群管理, 子域名突变, 容器监控, 应用可靠性, 应用性能监控, 日志关联, 日志聚合, 时序数据, 生产监控, 自动化告警, 自定义请求头, 运维自动化, 集中式可观测性, 高基数指标