higiphysical-maker/HiGI-IDS
GitHub: higiphysical-maker/HiGI-IDS
Stars: 0 | Forks: 0
# HiGI IDS — Unsupervised Network Anomaly Detection Engine
## Overview
## Snapshot
- 🛡 Built a fully unsupervised IDS that detects unknown network attacks from raw PCAP traffic
- ⚡ Sub-minute detection latency (< 1 min end-to-end)
- 🎯 100% recall on DoS/DDoS (CIC-IDS2017 controlled benchmark)
- 🧠 No labels required — learns normal traffic baseline
- 📊 Produces explainable forensic reports (MITRE ATT&CK mapped)
## Live Pipeline / Demo
Fully reproducible end-to-end pipeline from training to forensic report generation. (extraction → training → detection → forensic report)
- Notebook: [DEMO_NOTEBOOK.ipynb](./DEMO_NOTEBOOK.ipynb)
- Technical deep dive: [docs/technical_deep_dive.md](./docs/technical_deep_dive.md)
## Recruiter Notes
This project was built as an end-to-end demonstration of:
- Statistical anomaly detection & unsupervised learning.
- Machine learning engineering & reproducible workflows.
- Data engineering pipelines ((Polars, PCAP processing).)
- Explainable ML systems (XAI) & forensic reporting.
- End-to-end system design, from ingestion to benchmark validation.
The repository includes reproducible pipelines, a walkthrough notebook, benchmark results on CIC-IDS2017, [explainable forensic outputs](./reports/forensic_wednesday/Wednesday_Victim_50_results_FORENSIC.md), and [full technical documentation](./docs/).
[](https://www.python.org/)
[](./LICENSE)
[](https://peps.python.org/pep-0008/)
[](./reports/benchmarks/)
[](https://github.com/higiphysical-maker/HiGI-IDS/actions/workflows/ci.yml)
## Why it matters
Most intrusion detection systems (IDS) are built around signatures: they can only
flag what they already know. That works for known threats, but it leaves a
blind spot for anything new — a novel attack, a subtle probe, or a slow
exfiltration that stays under the radar.
HiGI takes a different path. Instead of chasing signatures, it learns what
normal traffic looks like and watches for anything that deviates from that
baseline. When network behavior shifts in a statistically meaningful way, the
system flags it — without needing labeled attack data, without retraining, and
without assuming that tomorrow's threat will look like yesterday's.
## My contribution
- Designed full unsupervised detection architecture from scratch
- Implemented feature engineering pipeline over raw PCAP traffic
- Built ensemble anomaly detection system (GMM, Isolation Forest, k-NN, velocity gating)
- Developed explainable forensic engine with MITRE ATT&CK mapping
- Validated system on CIC-IDS2017 benchmark
## Key results
- 100% detection rate on DoS/DDoS attacks (CIC-IDS2017)
- 0 false positives on benign traffic baseline
- Up to 21 minutes early attack signal detection
## Performance (CIC-IDS2017 Benchmark)
| Metric | Value |
|------|------|
| Precision | **1.000** *(no reportable false positives on benign control day)* |
| Recall | **0.875–1.000** *(depending on classification of edge cases in Thursday session)* |
| F1-score | **0.933 (conservative)** |
| False positives | **0 (incident-level, benign day)** |
| Detection latency | **≤ 1 minute** |
| Pre-attack detection | **21 min early signal (recon phase)** |
**Dataset:** CIC-IDS2017 (UNB)
**Training:** Only benign Monday traffic
**Evaluation:** Unseen attack days (Wed/Thu)
## Operational Impact
| Operational Metric | Measured Value | Operational Meaning |
|---|---|---|
| **Reportable false positives (8h benign)** | **0** | No spurious escalations during a full benign monitoring session |
| **Sub-threshold transients suppressed** | **266** | Events absorbed by stabilization layer without analyst intervention |
| **Incident-level F1** | **0.933 (conservative)** | Detection reliability with Precision = 1.000 across all reported incidents |
| **Detection latency** | **≤ 1 minute** | All attack classes, including slow-rate vectors |
| **Pre-attack detection window** | **21 minutes** | Reconnaissance flagged before first destructive action (Wednesday) |
| **DoS/DDoS recall** | **100%** | All four Wednesday DoS variants detected; 0 missed incidents |
## Project layout
This repository separates research, runtime artifacts, and production pipeline:
- `src/` → core detection engine
- `models/` → trained statistical baselines
- `reports/` → forensic outputs (auditable results)
- `data/` → raw and processed network traffic
- `docs/` → technical documentation
## Core Architecture
HiGI is a **multi-layer unsupervised detection system** built on statistical modeling of normal traffic.
graph TD
A[Raw PCAP] --> B["Feature Extraction (36 features)"]
B --> C[Data Conditioning]
C --> D[Ensemble Detectors]
D --> E[Consensus Decision Layer]
E --> F[Temporal Stabilization]
F --> G[Forensic Engine]
G --> H["SOC Report (MITRE #43; XAI)"]
## Pipeline Overview
### 1. Feature Extraction
* 36 flow-level features from 1s network windows
* Polars-based streaming ingestion (PCAP → feature matrix)
### 2. Data Conditioning
* Yeo-Johnson power transform (variance stabilization)
* Blocked PCA per feature family
* Whitening → enables Euclidean ≈ Mahalanobis distance
### 3. Detection Ensemble
**Tier 1 — Geometric Detector**
* k-NN BallTree distance from baseline manifold
**Tier 2 — Probabilistic Detectors**
* Bayesian Gaussian Mixture Model
* Isolation Forest (structural outliers)
**Tier 3 — Feature-Level Sentinel**
* Per-feature anomaly scoring
* Directionality (SPIKE / DROP)
* High-sensitivity detection + interpretability
**Tier 4 — Velocity Gate**
* Detects high-rate floods (DoS/DDoS)
* Catches cases where geometry fails (compressed variance attacks)
### 4. Consensus Decision Layer
Weighted ensemble voting:
* GMM + IForest + BallTree + Velocity signals
* Final anomaly decision via calibrated threshold
### 5. Forensic Engine (XAI Layer)
Each incident includes:
* Culprit features (ranked)
* Severity score
* MITRE ATT&CK mapping
* Analyst-ready structured report (PDF + JSON + Markdown)
## Feature Families
| Family | Meaning |
| ---------- | ------------------------------ |
| Volume | traffic intensity (bytes, PPS) |
| Payload | payload structure & density |
| Flags | TCP/ICMP state signals |
| Protocol | transport distribution |
| Connection | graph + timing behavior |
## Benchmark Results (CIC-IDS2017)
### Wednesday — DoS/DDoS
| Attack | Detection | Key Signal |
| ------------ | --------- | ------------------------- |
| Slowloris | ✅ | connection exhaustion |
| Slowhttptest | ✅ | ICMP + session stress |
| Hulk | ✅ | payload collapse |
| GoldenEye | ✅ | extreme payload deviation |
### Key insight
HiGI detects attacks via **statistical deviation from baseline traffic structure**, not payload signatures.
## Example Output (Forensic Engine)
Incident #29 | Severity: CRITICAL
Features:
[Connection] unique_dst_ports +45.84 SPIKE
[Flags] syn_ratio +9.8 SPIKE
MITRE ATT&CK:
T1499.001 — Resource Exhaustion Flood
Decision:
BallTree ✔ | GMM ✔ | IForest ✔ | Sentinel ✔
## Tech Stack
* Python 3.11+
* scikit-learn (GMM, IsolationForest, BallTree)
* Polars (high-performance ingestion)
* NumPy / Pandas
* Matplotlib (analysis & reports)
* ReportLab (PDF forensic reports)
* PyYAML (config-driven architecture)
## Quickstart
git clone https://github.com/higiphysical-maker/HiGI-IDS
cd higi-ids
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
### Train baseline
python main.py train --source data/raw/Monday.pcap --bundle models/baseline.pkl
### Detect attacks
python main.py detect \
--source data/raw/Wednesday.pcap \
--bundle models/baseline.pkl
### Generate forensic report
python main.py report \
--results data/processed/results.csv \
--bundle models/baseline.pkl \
--output-dir reports/
## Limitations
HiGI does **not** detect:
* Semantic Layer 7 attacks (SQLi, XSS, injection)
* Encrypted payload content anomalies
* Highly non-stationary environments without baseline refresh
## Future Work
* Real-time streaming ingestion (AF_PACKET / scapy)
* Adaptive baselines for non-stationary networks
* Active response integration (iptables / nftables)
* Multi-dataset validation (UNSW-NB15, CIC-IDS2019)
## Further Reading
For the complete technical deep dive (philosophy, full architecture, detailed limitations, and Docker deployment), **[see the Technical README](./docs/technical_deep_dive.md)**.
Additional manuals in English and Spanish are available in the **[docs/](./docs/)** directory.
## License
MIT License — see [`LICENSE`](./LICENSE) for full terms. Open for research and production experimentation.
*HiGI IDS — Created and Developed by Pablo Aguadero, 2026. Built with AI-assisted tooling (Gemini, Claude, GitHub Copilot) for architectural iteration, code review, and documentation drafting. All design decisions, feature engineering, and validation protocols are original work by the author.*
*Validated against CIC-IDS2017. Reference: Engelen, G., Rimmer, V., & Joosen, W. (2021). Troubleshooting an Intrusion Detection Dataset: the CICIDS2017 Case Study. IEEE EuroS&PW. doi:10.1109/EuroSPW54576.2021.00015*