linuxjmp/linux-monitoring-ir-lab

GitHub: linuxjmp/linux-monitoring-ir-lab

基于 Prometheus、Grafana 与 Ansible 的 Linux 集群监控部署与事件响应实验室，旨在展示运维监控与故障排查的完整实践能力。

Stars: 0 | Forks: 0

# linux-monitoring-ir-lab **Linux 操作环境监控与事件响应实验室** 这是一个作品集家庭实验室项目，它使用 Ansible 和 Podman Quadlet 在由四台主机组成的 Linux 集群中部署 Prometheus、Grafana 和 node_exporter，随后模拟并记录了五个真实的实际生产事件，旨在展示监控、故障排查和沟通技能。 ## 目的展示胜任 Linux 基础设施岗位的运维能力： - 监控技术栈的部署与配置 - 健康检查脚本（Bash, Python） - 事件的检测、调查与解决 - 面向技术和非技术人员的沟通 - 在实际运行环境中的 SELinux、systemd 和日志分析 ## 架构 ``` +--------------------+ | servera (monitor) | | Prometheus :9090 | | Grafana :3000 | | node_exporter :9100| +---------+----------+ | | Prometheus scrape (HTTP :9100) | +---------+----+ +---------+ +---------+ | serverb | | serverc | | serverd | | node_exporter| | node_ex | | node_ex | +--------------+ +---------+ +---------+ ``` 请查看 [architecture.md](architecture.md) 获取完整的组件图表。 ## 技术栈 - RHEL / Rocky / AlmaLinux - Prometheus + Grafana (Podman Quadlet 容器) - node_exporter (Podman Quadlet) - Ansible (部署与验证) - systemd, SELinux, auditd, firewalld ## 仓库结构 ``` linux-monitoring-ir-lab/ ├── README.md ├── architecture.md ├── ansible.cfg ├── inventory ├── group_vars/ │ └── all.yml # image names, ports, scrape targets ├── playbooks/ │ ├── 01-deploy-node-exporter.yml │ ├── 02-deploy-monitoring-stack.yml │ └── 03-validate-monitoring.yml ├── configs/ │ ├── prometheus/ │ │ └── prometheus.yml # scrape config │ └── grafana/ │ └── provisioning/ # auto-loaded datasource + dashboards ├── dashboards/ │ └── linux-node-health.json ├── scripts/ │ ├── disk-health.sh │ ├── service-health.py │ └── network-check.sh ├── incidents/ │ ├── INC-001-disk-full.md │ ├── INC-002-ssh-failure.md │ ├── INC-003-selinux-denial.md │ ├── INC-004-service-failure-after-update.md │ └── INC-005-dns-outage.md ├── docs/ │ ├── monitoring-runbook.md │ └── lessons-learned.md └── screenshots/ ├── monitoring-validation.txt └── prometheus-targets.txt ``` ## 快速开始 ``` # 在所有主机上部署 node_exporter ansible-playbook playbooks/01-deploy-node-exporter.yml # 在监控主机（servera）上部署 Prometheus + Grafana ansible-playbook playbooks/02-deploy-monitoring-stack.yml # 验证整个 stack ansible-playbook playbooks/03-validate-monitoring.yml ``` 通过 `http://servera:3000` 访问 Grafana (admin / admin)。通过 `http://servera:9090` 访问 Prometheus。 ## 健康检查脚本可直接在任何受管主机上运行： ``` bash scripts/disk-health.sh # check disk usage python3 scripts/service-health.py # check required services bash scripts/network-check.sh # check DNS + gateway + ports ``` ## 事件五个包含完整调查与文档记录的模拟事件： | ID | 场景 | 严重程度 | |---------|---------------------------------------|----------| | INC-001 | 大型临时文件导致的磁盘压力 | 中级 | | INC-002 | SSH 配置拼写错误导致重载失败 | 高级 | | INC-003 | SELinux 阻止 httpd 使用 8888 端口 | 中级 | | INC-004 | inode 耗尽导致 auditd 故障 | 中级 | | INC-005 | /etc/resolv.conf 更改导致 DNS 中断 | 高级 | 每个事件都包含技术摘要、通俗易懂的概要、带实际输出的调查命令、根本原因、修复措施以及经验教训。 ## 作品集证明 | 文件 | 展示内容 | |------|---------------| | `screenshots/monitoring-validation.txt` | 所有验证检查均已通过 | | `screenshots/prometheus-targets.txt` | 所有抓取目标均为 UP 状态 | | `incidents/INC-001..005.md` | 包含真实命令输出的五份事件调查报告 | | `dashboards/linux-node-health.json` | Grafana 仪表板（可导入） | ## 已知问题与部署说明 **Cockpit / 9090 端口冲突：** RHEL 系统会运行一个 Cockpit socket（`cockpit.socket`），它在启动时就会预先占用 9090 端口。部署 playbook 会在监控主机上启动 Prometheus 之前，停止并屏蔽该 socket。在共享主机上，请改为修改 `group_vars/all.yml` 中的 `prometheus_port`。请查看 [docs/monitoring-runbook.md](docs/monitoring-runbook.md) 获取完整的运维说明。 ## 简历要点

标签：Ansible, BurpSuite集成, Grafana, NIDS, 容器化, 应用安全, 系统提示词, 自定义请求头, 运维监控, 逆向工具