ascjreddy/Network-Monitoring-Incident-Response-Platform

GitHub: ascjreddy/Network-Monitoring-Incident-Response-Platform

Stars: 1 | Forks: 0

Network Incident Intelligence Platform Automated root cause analysis for multi-device network outages. The problem When a network goes down, engineers SSH into every device one by one, manually grep through logs, and cross-reference timestamps across routers and switches to figure out what broke first. That takes 15–30 minutes per incident. This project automates that entire process. What it does The system watches all devices in real time. When something fails, it correlates events across the topology, identifies the root cause, and generates a plain-English incident report — all within 60 seconds. Example: Core-SW-01 interface goes down at 06:43:41 → OSPF adjacency lost at 06:43:43 → incident detected and reported at 06:43:44 with root cause, affected devices, and remediation steps. How it works Three scripts run together: 1. syslog_collector.py — runs on the GNS3 VM Listens on UDP 514. All network devices send their syslog messages here. Parses each message (severity, device name, raw message) and writes it to PostgreSQL on the Windows host. 2. snmp_poller2.py — runs on the GNS3 VM Polls all devices every 30 seconds using snmpget. Checks if each device is reachable and writes the status to PostgreSQL. Detects devices that go unreachable even if they don't send a syslog. 3. incident_engine.py — runs on Windows Reads the last 10 minutes of events from PostgreSQL every 30 seconds. Look for known failure patterns (interface down, OSPF adjacency lost, device unreachable, link flapping). Traces the causal chain using the topology map. Saves the incident to the database and prints a full report. Also detects when incidents resolve and marks them closed. GNS3 VM (Linux) Windows Host ───────────────────── ────────────────────── syslog_collector.py ──┐ incident_engine.py snmp_poller2.py ──┼──► PostgreSQL ◄── reads events │ └──► Grafana dashboard Network devices ───────┘ (Cisco IOU, 12 devices) Network topology 12 Cisco IOU devices in GNS3, connected in a 4-tier hierarchy: ISP-RTR-A | Edge-RTR-01 | Core-SW-01 ──── Core-SW-02 | | Dist-HQ Dist-Branch / \ | Acc-SW1 Acc-SW2 Acc-SW3 | | | IT-PCs ENG-PCs SALES-PCs All routers run OSPF. Syslog and SNMP community public configured on every device. Tech stack ComponentToolNetwork simulationGNS3Network devicesCisco IOU (routers + switches)TelemetryPython syslog server + snmpget subprocessDatabasePostgreSQLTopology modelingPython dict (TOPOLOGY map in engine)DashboardsGrafana How to run Requirements: GNS3 VM, Python 3, PostgreSQL, Grafana, snmp-utils installed on GNS3 VM On each Cisco IOU device: conf t snmp-server community public RO logging host logging trap informational logging on end write memory GNS3 VM — terminal 1: bashsudo python3 ~/syslog_collector.py GNS3 VM — terminal 2 (SSH from Windows): bashsudo python3 ~/snmp_poller2.py Windows PowerShell: powershellcd C:\network-incident-platform\monitoring\correlation python incident_engine.py Grafana runs as a Windows service at http://localhost:3000. Import grafana-dashboard.json from the dashboards folder. Test scenarios Failures injected manually, system response observed: #ScenarioHow triggeredWhat the system detected1OSPF adjacency lossshutdown on Core-SW-01 Ethernet0/0OSPF_NEIGHBOR_LOST on downstream device, root cause traced, incident resolved when interface restored2Interface down——3Device unreachable——4Link flapping——5HSRP failover——6–15More scenarios—— Cases 2–15 in progress — results will be updated here as each is tested. # Sample output # INCIDENT DETECTED -- 2026-06-01 02:57:28 ## Type: OSPF_NEIGHBOR_LOST Root Cause: 192.168.42.253 Event: OSPF adjacency lost at 2026-06-01 02:57:22 Affected: (downstream devices) Remediation: Verify OSPF config on both sides. Run: show ip ospf neighbor. Check interface status. # Timeline: 2026-06-01 02:57:22 | 192.168.42.253 | %OSPF-5-ADJCHG: Process 1, Nbr 2.2.2.1 on FastEthernet0/1 from FULL to DOWN # ============================================================ INCIDENT RESOLVED -- 2026-06-01 03:13:37 # Device: 192.168.42.253 Type: OSPF_NEIGHBOR_LOST Status: OSPF adjacency restored Dashboards Grafana dashboard at http://localhost:3000 shows: Live incident status (HEALTHY / active incident count) Device health heatmap (SNMP reachability per device) Syslog stream (live events by device and severity) Incident log with root cause, affected devices, and resolution time SNMP polling history over time Grafana dashboard at http://localhost:3000 shows: Live incident status (HEALTHY / active incident count) Device health heatmap (SNMP reachability per device) Syslog stream (live events by device and severity) Incident log with root cause, affected devices, and resolution time SNMP polling history over time