MrN7King/network-observability-lab
GitHub: MrN7King/network-observability-lab
Stars: 0 | Forks: 0
# 🌐 Network Observability Lab
### A Production-Style Network Monitoring & NOC Simulation Environment
## Overview
A fully containerised network monitoring lab that simulates a small ISP-style topology - core router, edge nodes, and a DMZ - wired into a complete observability stack. Everything is provisioned as code: no manual Grafana clicking, no YAML hunting after startup.
Built as a portfolio project to demonstrate hands-on NOC, networking, and observability skills on a CV.
**What it shows:**
- Multi-node Docker network topology design
- ICMP reachability monitoring + HTTP application-layer probing
- Metrics collection, PromQL querying, and alerting rules
- Alertmanager routing pipeline (Slack-ready)
- Log aggregation with Loki + Promtail
- Live NOC Wallboard dashboard auto-provisioned in Grafana
- Fault injection and chaos testing scripts
## Architecture

## Stack
| Container | Image | Role |
|-----------|-------|------|
| `core-router` | alpine:3.19 | Hub node - multi-homed across all segments |
| `edge-a` | alpine:3.19 | Branch office simulation |
| `edge-b` | alpine:3.19 | Datacenter edge simulation |
| `dmz-host` | alpine:3.19 | DMZ node (ICMP monitored) |
| `web-dmz` | nginx:alpine | DMZ web server (HTTP + ICMP monitored) |
| `exporter-*` ×4 | prom/node-exporter | CPU, memory, interface metrics per node |
| `blackbox` | prom/blackbox-exporter | ICMP reachability + HTTP health probes |
| `prometheus` | prom/prometheus | Metrics collection and alerting rules |
| `alertmanager` | prom/alertmanager | Alert routing (Slack-ready) |
| `loki` | grafana/loki | Log aggregation backend |
| `promtail` | grafana/promtail | Collects and ships all container logs |
| `grafana` | grafana/grafana | NOC Wallboard + dashboards (auto-provisioned) |
## Prerequisites
- **Docker Desktop for Windows** (WSL2 or Hyper-V backend)
- **PowerShell** (built into Windows - no install needed)
- **Git** (optional, for cloning)
## Quick Start
# Clone
git clone https://github.com/MrN7King/network-observability-lab.git
cd network-observability-lab
# Wipe any old data (important for a clean first run)
docker compose down -v
# Launch all 16 containers
docker compose up -d
# Confirm everything is running
docker compose ps
Once all containers show `Up`:
| Service | URL | Credentials |
|---------|-----|-------------|
| **Grafana NOC Wallboard** | http://localhost:3001 | admin / netlab123 |
| Prometheus | http://localhost:9090 | - |
| Alertmanager | http://localhost:9093 | - |
| DMZ Web Server | http://localhost:8080 | - |
| Loki | http://localhost:3100 | - |
## Usage
### Generate traffic
Makes the Grafana dashboard panels show live graphs instead of flatlines:
.\scripts\traffic.ps1
### Fault injection - single node
Simulate a node failure and watch the alert fire in Grafana:
.\scripts\fault-inject.ps1 -Node edge-a -DownSeconds 30
Open the NOC Wallboard at http://localhost:3001. The `edge-a` tile turns **red** within ~20 seconds, the alert fires, then everything recovers automatically.
### Chaos mode
Random continuous faults - great for a screen recording or live demo:
.\scripts\chaos.ps1 # runs forever
.\scripts\chaos.ps1 -Rounds 5 # 5 random faults then stops
## Alerting
Alerts defined in `prometheus/alerts.yml`:
| Alert | Condition | Severity |
|-------|-----------|----------|
| `NodeUnreachable` | ICMP probe fails for > 20s | critical |
| `WebServerDown` | HTTP probe fails for > 15s | critical |
| `HighLatency` | ICMP RTT > 100ms for > 30s | warning |
| `ExporterDown` | Node exporter stops responding | warning |
| `HighCPU` | CPU > 85% for > 1 minute | warning |
| `HighRxTraffic` | Interface RX > 50 MB/s | warning |
### Enable Slack notifications
1. Create an Incoming Webhook at https://api.slack.com/messaging/webhooks
2. Open `alertmanager/alertmanager.yml`
3. Uncomment the `slack_configs` block and paste your webhook URL
4. Reload Alertmanager:
docker exec alertmanager wget -qO- --post-data='' http://localhost:9093/-/reload
## Grafana Dashboards
- Node reachability tiles (green = UP, red = DOWN) for all 5 nodes
- HTTP response time for the DMZ web server
- ICMP round-trip time history for all nodes
- Interface RX / TX traffic per node
- Active alerts panel (live from Prometheus)
- Container log stream (live from Loki)
## Project Structure
network-observability-lab/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml # Scrape configs, relabelling, alertmanager integration
│ └── alerts.yml # Alert rules (ICMP, HTTP, CPU, traffic)
├── alertmanager/
│ └── alertmanager.yml # Routing tree + Slack config (commented out)
├── blackbox/
│ └── blackbox.yml # ICMP + HTTP + TCP probe modules
├── loki/
│ └── loki.yml # Log storage backend config
├── promtail/
│ └── promtail.yml # Docker log scraping via Docker socket
├── nginx/
│ ├── conf/default.conf # nginx with /health endpoint + stub_status
│ └── html/index.html # DMZ landing page
├── grafana/
│ ├── provisioning/
│ │ ├── datasources/ # Auto-configures Prometheus + Loki
│ │ └── dashboards/ # Auto-loads dashboard JSON
│ └── dashboards/
│ └── noc-wallboard.json # Main NOC dashboard
└── scripts/
├── traffic.ps1 # Continuous ICMP + HTTP traffic generator
├── fault-inject.ps1 # Single node failure simulation
└── chaos.ps1 # Random continuous fault injection
## Stopping the Lab
docker compose down # stop containers, keep Prometheus/Grafana data
docker compose down -v # stop containers and wipe all stored data
## Skills Demonstrated
- **Docker networking** - multi-bridge topology, static IP allocation, subnet planning
- **Prometheus** - scrape configuration, metric relabelling, PromQL, alerting rules
- **Alertmanager** - routing trees, group configuration, receiver setup, Slack integration
- **Grafana** - datasource provisioning, dashboard-as-code (JSON model), stat/timeseries/logs panels
- **Blackbox Exporter** - ICMP reachability and HTTP endpoint monitoring
- **Node Exporter** - interface-level telemetry (RX/TX bytes, CPU, memory)
- **Loki + Promtail** - log aggregation pipeline with Docker service discovery
- **nginx** - server configuration, health endpoints, access log formatting
- **Observability methodology** - metrics, logs, alerting as a unified pipeline
A project made during my free time · BSc (Hons) Computer Networking · CCNA in progress