higiphysical-maker/HiGI-IDS

GitHub: higiphysical-maker/HiGI-IDS

Stars: 0 | Forks: 0

# HiGI IDS — Unsupervised Network Anomaly Detection Engine ## Overview ## Snapshot - 🛡 Built a fully unsupervised IDS that detects unknown network attacks from raw PCAP traffic - ⚡ Sub-minute detection latency (< 1 min end-to-end) - 🎯 100% recall on DoS/DDoS (CIC-IDS2017 controlled benchmark) - 🧠 No labels required — learns normal traffic baseline - 📊 Produces explainable forensic reports (MITRE ATT&CK mapped) ## Live Pipeline / Demo Fully reproducible end-to-end pipeline from training to forensic report generation. (extraction → training → detection → forensic report) - Notebook: [DEMO_NOTEBOOK.ipynb](./DEMO_NOTEBOOK.ipynb) - Technical deep dive: [docs/technical_deep_dive.md](./docs/technical_deep_dive.md) ## Recruiter Notes This project was built as an end-to-end demonstration of: - Statistical anomaly detection & unsupervised learning. - Machine learning engineering & reproducible workflows. - Data engineering pipelines ((Polars, PCAP processing).) - Explainable ML systems (XAI) & forensic reporting. - End-to-end system design, from ingestion to benchmark validation. The repository includes reproducible pipelines, a walkthrough notebook, benchmark results on CIC-IDS2017, [explainable forensic outputs](./reports/forensic_wednesday/Wednesday_Victim_50_results_FORENSIC.md), and [full technical documentation](./docs/). [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](./LICENSE) [![PEP8](https://img.shields.io/badge/code%20style-PEP8-green.svg)](https://peps.python.org/pep-0008/) [![CIC-IDS2017 Validated](https://img.shields.io/badge/benchmark-CIC--IDS2017-orange.svg)](./reports/benchmarks/) [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/05/43d7053408195826.svg)](https://github.com/higiphysical-maker/HiGI-IDS/actions/workflows/ci.yml) ## Why it matters Most intrusion detection systems (IDS) are built around signatures: they can only flag what they already know. That works for known threats, but it leaves a blind spot for anything new — a novel attack, a subtle probe, or a slow exfiltration that stays under the radar. HiGI takes a different path. Instead of chasing signatures, it learns what normal traffic looks like and watches for anything that deviates from that baseline. When network behavior shifts in a statistically meaningful way, the system flags it — without needing labeled attack data, without retraining, and without assuming that tomorrow's threat will look like yesterday's. ## My contribution - Designed full unsupervised detection architecture from scratch - Implemented feature engineering pipeline over raw PCAP traffic - Built ensemble anomaly detection system (GMM, Isolation Forest, k-NN, velocity gating) - Developed explainable forensic engine with MITRE ATT&CK mapping - Validated system on CIC-IDS2017 benchmark ## Key results - 100% detection rate on DoS/DDoS attacks (CIC-IDS2017) - 0 false positives on benign traffic baseline - Up to 21 minutes early attack signal detection ## Performance (CIC-IDS2017 Benchmark) | Metric | Value | |------|------| | Precision | **1.000** *(no reportable false positives on benign control day)* | | Recall | **0.875–1.000** *(depending on classification of edge cases in Thursday session)* | | F1-score | **0.933 (conservative)** | | False positives | **0 (incident-level, benign day)** | | Detection latency | **≤ 1 minute** | | Pre-attack detection | **21 min early signal (recon phase)** | **Dataset:** CIC-IDS2017 (UNB) **Training:** Only benign Monday traffic **Evaluation:** Unseen attack days (Wed/Thu) ## Operational Impact | Operational Metric | Measured Value | Operational Meaning | |---|---|---| | **Reportable false positives (8h benign)** | **0** | No spurious escalations during a full benign monitoring session | | **Sub-threshold transients suppressed** | **266** | Events absorbed by stabilization layer without analyst intervention | | **Incident-level F1** | **0.933 (conservative)** | Detection reliability with Precision = 1.000 across all reported incidents | | **Detection latency** | **≤ 1 minute** | All attack classes, including slow-rate vectors | | **Pre-attack detection window** | **21 minutes** | Reconnaissance flagged before first destructive action (Wednesday) | | **DoS/DDoS recall** | **100%** | All four Wednesday DoS variants detected; 0 missed incidents | ## Project layout This repository separates research, runtime artifacts, and production pipeline: - `src/` → core detection engine - `models/` → trained statistical baselines - `reports/` → forensic outputs (auditable results) - `data/` → raw and processed network traffic - `docs/` → technical documentation ## Core Architecture HiGI is a **multi-layer unsupervised detection system** built on statistical modeling of normal traffic. graph TD A[Raw PCAP] --> B["Feature Extraction (36 features)"] B --> C[Data Conditioning] C --> D[Ensemble Detectors] D --> E[Consensus Decision Layer] E --> F[Temporal Stabilization] F --> G[Forensic Engine] G --> H["SOC Report (MITRE #43; XAI)"] ## Pipeline Overview ### 1. Feature Extraction * 36 flow-level features from 1s network windows * Polars-based streaming ingestion (PCAP → feature matrix) ### 2. Data Conditioning * Yeo-Johnson power transform (variance stabilization) * Blocked PCA per feature family * Whitening → enables Euclidean ≈ Mahalanobis distance ### 3. Detection Ensemble **Tier 1 — Geometric Detector** * k-NN BallTree distance from baseline manifold **Tier 2 — Probabilistic Detectors** * Bayesian Gaussian Mixture Model * Isolation Forest (structural outliers) **Tier 3 — Feature-Level Sentinel** * Per-feature anomaly scoring * Directionality (SPIKE / DROP) * High-sensitivity detection + interpretability **Tier 4 — Velocity Gate** * Detects high-rate floods (DoS/DDoS) * Catches cases where geometry fails (compressed variance attacks) ### 4. Consensus Decision Layer Weighted ensemble voting: * GMM + IForest + BallTree + Velocity signals * Final anomaly decision via calibrated threshold ### 5. Forensic Engine (XAI Layer) Each incident includes: * Culprit features (ranked) * Severity score * MITRE ATT&CK mapping * Analyst-ready structured report (PDF + JSON + Markdown) ## Feature Families | Family | Meaning | | ---------- | ------------------------------ | | Volume | traffic intensity (bytes, PPS) | | Payload | payload structure & density | | Flags | TCP/ICMP state signals | | Protocol | transport distribution | | Connection | graph + timing behavior | ## Benchmark Results (CIC-IDS2017) ### Wednesday — DoS/DDoS | Attack | Detection | Key Signal | | ------------ | --------- | ------------------------- | | Slowloris | ✅ | connection exhaustion | | Slowhttptest | ✅ | ICMP + session stress | | Hulk | ✅ | payload collapse | | GoldenEye | ✅ | extreme payload deviation | ### Key insight HiGI detects attacks via **statistical deviation from baseline traffic structure**, not payload signatures. ## Example Output (Forensic Engine) Incident #29 | Severity: CRITICAL Features: [Connection] unique_dst_ports +45.84 SPIKE [Flags] syn_ratio +9.8 SPIKE MITRE ATT&CK: T1499.001 — Resource Exhaustion Flood Decision: BallTree ✔ | GMM ✔ | IForest ✔ | Sentinel ✔ ## Tech Stack * Python 3.11+ * scikit-learn (GMM, IsolationForest, BallTree) * Polars (high-performance ingestion) * NumPy / Pandas * Matplotlib (analysis & reports) * ReportLab (PDF forensic reports) * PyYAML (config-driven architecture) ## Quickstart git clone https://github.com/higiphysical-maker/HiGI-IDS cd higi-ids python -m venv venv source venv/bin/activate pip install -r requirements.txt ### Train baseline python main.py train --source data/raw/Monday.pcap --bundle models/baseline.pkl ### Detect attacks python main.py detect \ --source data/raw/Wednesday.pcap \ --bundle models/baseline.pkl ### Generate forensic report python main.py report \ --results data/processed/results.csv \ --bundle models/baseline.pkl \ --output-dir reports/ ## Limitations HiGI does **not** detect: * Semantic Layer 7 attacks (SQLi, XSS, injection) * Encrypted payload content anomalies * Highly non-stationary environments without baseline refresh ## Future Work * Real-time streaming ingestion (AF_PACKET / scapy) * Adaptive baselines for non-stationary networks * Active response integration (iptables / nftables) * Multi-dataset validation (UNSW-NB15, CIC-IDS2019) ## Further Reading For the complete technical deep dive (philosophy, full architecture, detailed limitations, and Docker deployment), **[see the Technical README](./docs/technical_deep_dive.md)**. Additional manuals in English and Spanish are available in the **[docs/](./docs/)** directory. ## License MIT License — see [`LICENSE`](./LICENSE) for full terms. Open for research and production experimentation. *HiGI IDS — Created and Developed by Pablo Aguadero, 2026. Built with AI-assisted tooling (Gemini, Claude, GitHub Copilot) for architectural iteration, code review, and documentation drafting. All design decisions, feature engineering, and validation protocols are original work by the author.* *Validated against CIC-IDS2017. Reference: Engelen, G., Rimmer, V., & Joosen, W. (2021). Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study. IEEE EuroS&PW. doi:10.1109/EuroSPW54576.2021.00015*