ascjreddy/Network-Monitoring-Incident-Response-Platform
GitHub: ascjreddy/Network-Monitoring-Incident-Response-Platform
Stars: 1 | Forks: 0
Network Incident Intelligence Platform
Automated root cause analysis for multi-device network outages.
The problem
When a network goes down, engineers SSH into every device one by one, manually grep through logs, and cross-reference timestamps across routers and switches to figure out what broke first. That takes 15–30 minutes per incident.
This project automates that entire process.
What it does
The system watches all devices in real time. When something fails, it correlates events across the topology, identifies the root cause, and generates a plain-English incident report — all within 60 seconds.
Example: Core-SW-01 interface goes down at 06:43:41 → OSPF adjacency lost at 06:43:43 → incident detected and reported at 06:43:44 with root cause, affected devices, and remediation steps.
How it works
Three scripts run together:
1. syslog_collector.py — runs on the GNS3 VM
Listens on UDP 514. All network devices send their syslog messages here. Parses each message (severity, device name, raw message) and writes it to PostgreSQL on the Windows host.
2. snmp_poller2.py — runs on the GNS3 VM
Polls all devices every 30 seconds using snmpget. Checks if each device is reachable and writes the status to PostgreSQL. Detects devices that go unreachable even if they don't send a syslog.
3. incident_engine.py — runs on Windows
Reads the last 10 minutes of events from PostgreSQL every 30 seconds. Look for known failure patterns (interface down, OSPF adjacency lost, device unreachable, link flapping). Traces the causal chain using the topology map. Saves the incident to the database and prints a full report. Also detects when incidents resolve and marks them closed.
GNS3 VM (Linux) Windows Host
───────────────────── ──────────────────────
syslog_collector.py ──┐ incident_engine.py
snmp_poller2.py ──┼──► PostgreSQL ◄── reads events
│ └──► Grafana dashboard
Network devices ───────┘
(Cisco IOU, 12 devices)
Network topology
12 Cisco IOU devices in GNS3, connected in a 4-tier hierarchy:
ISP-RTR-A
|
Edge-RTR-01
|
Core-SW-01 ──── Core-SW-02
| |
Dist-HQ Dist-Branch
/ \ |
Acc-SW1 Acc-SW2 Acc-SW3
| | |
IT-PCs ENG-PCs SALES-PCs
All routers run OSPF. Syslog and SNMP community public configured on every device.
Tech stack
ComponentToolNetwork simulationGNS3Network devicesCisco IOU (routers + switches)TelemetryPython syslog server + snmpget subprocessDatabasePostgreSQLTopology modelingPython dict (TOPOLOGY map in engine)DashboardsGrafana
How to run
Requirements: GNS3 VM, Python 3, PostgreSQL, Grafana, snmp-utils installed on GNS3 VM
On each Cisco IOU device:
conf t
snmp-server community public RO
logging host
logging trap informational
logging on
end
write memory
GNS3 VM — terminal 1:
bashsudo python3 ~/syslog_collector.py
GNS3 VM — terminal 2 (SSH from Windows):
bashsudo python3 ~/snmp_poller2.py
Windows PowerShell:
powershellcd C:\network-incident-platform\monitoring\correlation
python incident_engine.py
Grafana runs as a Windows service at http://localhost:3000. Import grafana-dashboard.json from the dashboards folder.
Test scenarios
Failures injected manually, system response observed:
#ScenarioHow triggeredWhat the system detected1OSPF adjacency lossshutdown on Core-SW-01 Ethernet0/0OSPF_NEIGHBOR_LOST on downstream device, root cause traced, incident resolved when interface restored2Interface down——3Device unreachable——4Link flapping——5HSRP failover——6–15More scenarios——
Cases 2–15 in progress — results will be updated here as each is tested.
# Sample output
# INCIDENT DETECTED -- 2026-06-01 02:57:28
## Type: OSPF_NEIGHBOR_LOST
Root Cause: 192.168.42.253
Event: OSPF adjacency lost at 2026-06-01 02:57:22
Affected: (downstream devices)
Remediation: Verify OSPF config on both sides.
Run: show ip ospf neighbor. Check interface status.
# Timeline:
2026-06-01 02:57:22 | 192.168.42.253 | %OSPF-5-ADJCHG: Process 1,
Nbr 2.2.2.1 on FastEthernet0/1 from FULL to DOWN
# ============================================================
INCIDENT RESOLVED -- 2026-06-01 03:13:37
# Device: 192.168.42.253
Type: OSPF_NEIGHBOR_LOST
Status: OSPF adjacency restored
Dashboards
Grafana dashboard at http://localhost:3000 shows:
Live incident status (HEALTHY / active incident count)
Device health heatmap (SNMP reachability per device)
Syslog stream (live events by device and severity)
Incident log with root cause, affected devices, and resolution time
SNMP polling history over time
Grafana dashboard at http://localhost:3000 shows:
Live incident status (HEALTHY / active incident count)
Device health heatmap (SNMP reachability per device)
Syslog stream (live events by device and severity)
Incident log with root cause, affected devices, and resolution time
SNMP polling history over time