PeakCoder-Here/threat-hunter

GitHub: PeakCoder-Here/threat-hunter

Stars: 1 | Forks: 0

# 🛡 Threat Hunter — Autonomous AI-Driven Threat Hunting Agent A complete **SIEM companion** that ingests network logs, uses machine learning to detect anomalies, and deploys a multi-agent AI system to investigate alerts and generate forensic incident reports. ## Architecture ┌─────────────────────────────────────────────────────────────────┐ │ THREAT HUNTER │ │ │ │ ┌──────────┐ ┌───────────────┐ ┌──────────────────────┐ │ │ │ Log │ │ ML Anomaly │ │ Multi-Agent AI │ │ │ │ Ingestor │──▶│ Detector │──▶│ Investigation │ │ │ │ │ │ (Isolation │ │ │ │ │ │ Synthetic│ │ Forest) │ │ Agent 1: Researcher │ │ │ │ or Real │ │ │ │ Agent 2: Forensics │ │ │ │ Logs │ │ Score & Flag │ │ Agent 3: Reporter │ │ │ └──────────┘ └───────────────┘ └──────────┬───────────┘ │ │ │ │ │ ┌──────────────────────────────────────────────┐│ │ │ │ SOC Dashboard (FastAPI + HTML) ││ │ │ │ - Pipeline control - Alert management ││ │ │ │ - Live log stream - Incident reports ││ │ │ │ - Stats & severity - Remediation cmds │◀ │ │ └──────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ## Quick Start # 1. Install dependencies pip install -r requirements.txt # 2. Run the system python main.py # 3. Open dashboard # http://localhost:8000 Or use the setup script: chmod +x run.sh && ./run.sh ## Pipeline Steps ### Step 1 — Log Ingestion Generates synthetic network logs simulating normal and anomalous traffic (DoS, Exploits, Reconnaissance, Backdoor, Shellcode). You can also load real datasets like UNSW-NB15 or CICIDS2017. ### Step 2 — ML Anomaly Detection Trains an **Isolation Forest** (200 estimators, StandardScaler preprocessing) on 16 network features. The model learns normal traffic patterns and flags outliers without needing labeled data. **Features used:** duration, packets (src/dst), bytes (src/dst), rate, TTL, load, inter-packet time, jitter, connection counts. ### Step 3 — Anomaly Scoring Scores all unprocessed logs and creates alerts for those crossing the anomaly threshold. Assigns severity levels: critical, high, medium, low. ### Step 4 — Multi-Agent AI Investigation When an alert triggers, three AI agents work in sequence: | Agent | Role | Output | |-------|------|--------| | **Researcher** | Queries threat intel APIs, assesses IP reputation | Threat score, geo, categories, IOCs | | **Forensics Analyst** | Correlates logs, builds MITRE ATT&CK timeline | Attack timeline, techniques, kill chain | | **Reporter** | Compiles findings into incident report | Markdown report + firewall commands | ## Project Structure threat-hunter/ ├── main.py # Entry point ├── config.py # All configuration ├── db.py # Database models (SQLAlchemy + async) ├── requirements.txt ├── run.sh # Quick setup script │ ├── ingestion/ │ └── log_ingestor.py # Synthetic data gen + DB ingestion │ ├── ml/ │ ├── anomaly_detector.py # Isolation Forest pipeline │ └── models/ # Saved model files │ ├── agents/ │ ├── llm_provider.py # LLM abstraction (Groq/Ollama/Mock) │ ├── threat_intel.py # AlienVault OTX / VirusTotal client │ └── threat_agents.py # Multi-agent orchestration │ ├── api/ │ └── server.py # FastAPI endpoints │ ├── static/ │ └── dashboard.html # SOC dashboard │ └── reports/ # Generated incident reports ## LLM Configuration The system supports three LLM backends. Set via environment variable: # Mock mode (default, no API key needed — great for demo) LLM_PROVIDER=mock python main.py # Groq Cloud (fast, free tier available) LLM_PROVIDER=groq GROQ_API_KEY=gsk_... python main.py # Ollama (fully local, privacy-first) # First: ollama pull llama3 LLM_PROVIDER=ollama python main.py ## Threat Intelligence APIs Optional — enriches investigations with real threat data: # AlienVault OTX (free) export OTX_API_KEY=your_key_here # VirusTotal (free tier: 4 lookups/min) export VT_API_KEY=your_key_here ## API Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/` | SOC Dashboard | | GET | `/api/status` | System status | | POST | `/api/ingest/synthetic?n=5000` | Ingest synthetic logs | | POST | `/api/ingest/simulate-live` | Simulate live batch | | POST | `/api/ml/train` | Train anomaly model | | GET | `/api/ml/metrics` | Model evaluation | | POST | `/api/detect` | Run anomaly detection | | GET | `/api/alerts` | List alerts | | GET | `/api/alerts/{id}` | Alert details | | POST | `/api/alerts/{id}/investigate` | AI investigation | | POST | `/api/alerts/investigate-all` | Batch investigation | | GET | `/api/logs?flagged_only=true` | Browse logs | | GET | `/api/logs/stats` | Aggregate statistics | Interactive API docs available at `/docs` (Swagger UI). ## Using Real Datasets ### UNSW-NB15 from ingestion.log_ingestor import load_unsw_csv, ingest_dataframe df = load_unsw_csv("path/to/UNSW-NB15.csv") await ingest_dataframe(df) ### CICIDS2017 Requires column mapping — the ingestor handles common UNSW-NB15 column names automatically. ## Performance Tested on synthetic dataset (3000 samples, 8% anomaly rate): | Metric | Value | |--------|-------| | Precision | 0.870 | | Recall | 1.000 | | F1 Score | 0.930 | | Flagged | 9.2% | | Training time | ~2s | ## Extending the System **Add a new agent:** Create a new system prompt + handler in `agents/threat_agents.py` and wire it into `investigate_anomaly()`. **Add a new LLM provider:** Add a new async function in `agents/llm_provider.py` and register it in `llm_complete()`. **Switch to PostgreSQL:** Change `DATABASE_URL` in config.py to `postgresql+asyncpg://...` and install `asyncpg`. **Switch to ELK Stack:** Replace the SQLite ingestion with a Logstash pipeline feeding Elasticsearch, and query ES in the detection loop. ## License Educational project. Use responsibly.