PeakCoder-Here/threat-hunter
GitHub: PeakCoder-Here/threat-hunter
Stars: 1 | Forks: 0
# 🛡 Threat Hunter — Autonomous AI-Driven Threat Hunting Agent
A complete **SIEM companion** that ingests network logs, uses machine learning to detect anomalies, and deploys a multi-agent AI system to investigate alerts and generate forensic incident reports.
## Architecture
┌─────────────────────────────────────────────────────────────────┐
│ THREAT HUNTER │
│ │
│ ┌──────────┐ ┌───────────────┐ ┌──────────────────────┐ │
│ │ Log │ │ ML Anomaly │ │ Multi-Agent AI │ │
│ │ Ingestor │──▶│ Detector │──▶│ Investigation │ │
│ │ │ │ (Isolation │ │ │ │
│ │ Synthetic│ │ Forest) │ │ Agent 1: Researcher │ │
│ │ or Real │ │ │ │ Agent 2: Forensics │ │
│ │ Logs │ │ Score & Flag │ │ Agent 3: Reporter │ │
│ └──────────┘ └───────────────┘ └──────────┬───────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────┐│ │
│ │ SOC Dashboard (FastAPI + HTML) ││ │
│ │ - Pipeline control - Alert management ││ │
│ │ - Live log stream - Incident reports ││ │
│ │ - Stats & severity - Remediation cmds │◀ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
## Quick Start
# 1. Install dependencies
pip install -r requirements.txt
# 2. Run the system
python main.py
# 3. Open dashboard
# http://localhost:8000
Or use the setup script:
chmod +x run.sh && ./run.sh
## Pipeline Steps
### Step 1 — Log Ingestion
Generates synthetic network logs simulating normal and anomalous traffic (DoS, Exploits, Reconnaissance, Backdoor, Shellcode). You can also load real datasets like UNSW-NB15 or CICIDS2017.
### Step 2 — ML Anomaly Detection
Trains an **Isolation Forest** (200 estimators, StandardScaler preprocessing) on 16 network features. The model learns normal traffic patterns and flags outliers without needing labeled data.
**Features used:** duration, packets (src/dst), bytes (src/dst), rate, TTL, load, inter-packet time, jitter, connection counts.
### Step 3 — Anomaly Scoring
Scores all unprocessed logs and creates alerts for those crossing the anomaly threshold. Assigns severity levels: critical, high, medium, low.
### Step 4 — Multi-Agent AI Investigation
When an alert triggers, three AI agents work in sequence:
| Agent | Role | Output |
|-------|------|--------|
| **Researcher** | Queries threat intel APIs, assesses IP reputation | Threat score, geo, categories, IOCs |
| **Forensics Analyst** | Correlates logs, builds MITRE ATT&CK timeline | Attack timeline, techniques, kill chain |
| **Reporter** | Compiles findings into incident report | Markdown report + firewall commands |
## Project Structure
threat-hunter/
├── main.py # Entry point
├── config.py # All configuration
├── db.py # Database models (SQLAlchemy + async)
├── requirements.txt
├── run.sh # Quick setup script
│
├── ingestion/
│ └── log_ingestor.py # Synthetic data gen + DB ingestion
│
├── ml/
│ ├── anomaly_detector.py # Isolation Forest pipeline
│ └── models/ # Saved model files
│
├── agents/
│ ├── llm_provider.py # LLM abstraction (Groq/Ollama/Mock)
│ ├── threat_intel.py # AlienVault OTX / VirusTotal client
│ └── threat_agents.py # Multi-agent orchestration
│
├── api/
│ └── server.py # FastAPI endpoints
│
├── static/
│ └── dashboard.html # SOC dashboard
│
└── reports/ # Generated incident reports
## LLM Configuration
The system supports three LLM backends. Set via environment variable:
# Mock mode (default, no API key needed — great for demo)
LLM_PROVIDER=mock python main.py
# Groq Cloud (fast, free tier available)
LLM_PROVIDER=groq GROQ_API_KEY=gsk_... python main.py
# Ollama (fully local, privacy-first)
# First: ollama pull llama3
LLM_PROVIDER=ollama python main.py
## Threat Intelligence APIs
Optional — enriches investigations with real threat data:
# AlienVault OTX (free)
export OTX_API_KEY=your_key_here
# VirusTotal (free tier: 4 lookups/min)
export VT_API_KEY=your_key_here
## API Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/` | SOC Dashboard |
| GET | `/api/status` | System status |
| POST | `/api/ingest/synthetic?n=5000` | Ingest synthetic logs |
| POST | `/api/ingest/simulate-live` | Simulate live batch |
| POST | `/api/ml/train` | Train anomaly model |
| GET | `/api/ml/metrics` | Model evaluation |
| POST | `/api/detect` | Run anomaly detection |
| GET | `/api/alerts` | List alerts |
| GET | `/api/alerts/{id}` | Alert details |
| POST | `/api/alerts/{id}/investigate` | AI investigation |
| POST | `/api/alerts/investigate-all` | Batch investigation |
| GET | `/api/logs?flagged_only=true` | Browse logs |
| GET | `/api/logs/stats` | Aggregate statistics |
Interactive API docs available at `/docs` (Swagger UI).
## Using Real Datasets
### UNSW-NB15
from ingestion.log_ingestor import load_unsw_csv, ingest_dataframe
df = load_unsw_csv("path/to/UNSW-NB15.csv")
await ingest_dataframe(df)
### CICIDS2017
Requires column mapping — the ingestor handles common UNSW-NB15 column names automatically.
## Performance
Tested on synthetic dataset (3000 samples, 8% anomaly rate):
| Metric | Value |
|--------|-------|
| Precision | 0.870 |
| Recall | 1.000 |
| F1 Score | 0.930 |
| Flagged | 9.2% |
| Training time | ~2s |
## Extending the System
**Add a new agent:** Create a new system prompt + handler in `agents/threat_agents.py` and wire it into `investigate_anomaly()`.
**Add a new LLM provider:** Add a new async function in `agents/llm_provider.py` and register it in `llm_complete()`.
**Switch to PostgreSQL:** Change `DATABASE_URL` in config.py to `postgresql+asyncpg://...` and install `asyncpg`.
**Switch to ELK Stack:** Replace the SQLite ingestion with a Logstash pipeline feeding Elasticsearch, and query ES in the detection loop.
## License
Educational project. Use responsibly.