tyceerickson/ai-soc-pipeline

GitHub: tyceerickson/ai-soc-pipeline

Stars: 0 | Forks: 0

# AI-Powered SOC Pipeline **CMU MSISPM Portfolio — Project 4** Tyce Erickson · May 2026 A production-grade, AI-powered Security Operations Center pipeline built on real internet attack data. Three honeypots exposed to the public internet collect live attack traffic across SSH, web, and malware-capture vectors. Data is processed through a Wazuh SIEM, enriched with geolocation and VirusTotal threat intelligence, and analyzed by a locally-hosted large language model. A custom real-time dashboard provides 12 integrated threat-intelligence panels across three honeypot sources. Which is currently still live and collecting data in real time. **This is not a simulation. Every alert in this dataset came from a real attacker.** ## 🎥 Live Demo [![Watch the SOC dashboard demo](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/5de5cf1417020412.png)](https://youtu.be/YOUR_VIDEO_ID "Watch the 3-minute demo") **▶ [Watch the 3-minute walkthrough](https://youtu.be/YOUR_VIDEO_ID)** — a live tour of the dashboard running on real honeypot data: attack timeline and geographic map, behavioral botnet fingerprinting, VirusTotal-verified WannaCry malware capture, cross-honeypot threat-actor correlation, and on-demand AI threat analysis. ## Live Stats (collection window May 21–29, 2026) | Metric | Value | |--------|-------| | Total Wazuh alerts | 11,611,908 (~11.6M) | | Cowrie SSH/Telnet events | 872,871 | | nginx web requests | 1,352+ | | Malware binaries captured | 7 (6 WannaCry + 1 downloader, VirusTotal-verified) | | Unique attacker IPs | 1,000+ | | Countries observed | 99 | | Active botnets identified | 6 | | MITRE ATT&CK tactics | 7 | | Peak day | ~2.8M alerts / 24h | ## Architecture Internet Attackers │ ▼ [DigitalOcean VPS — NYC1] [Alienware m16 R2] Cowrie SSH/Telnet Honeypot Ollama + llama3.1:8b nginx Web Honeypot RTX 4070 (8GB VRAM) Dionaea Malware Capture │ │ │ rsync / 15min (Tailscale) │ HTTP API (Tailscale) ▼ │ [Ubuntu Server — aarch64] │ Wazuh SIEM + OpenSearch │ GeoIP + VirusTotal Enrichment │ parse_nginx.py + parse_dionaea.py │ Flask SOC Dashboard ──────────────────────────────────┘ │ ▼ Browser (Tailscale only) http://:5000 Three machines connected via Tailscale mesh VPN. The dashboard is **not** accessible from the public internet. Credentials are supplied via environment variables, never stored in source. ## Dashboard The custom Flask dashboard provides **12 real-time intelligence panels** across three sections. ### Cowrie SSH Honeypot | Panel | Description | |-------|-------------| | Alert Timeline | Time series with severity and MITRE ATT&CK overlay | | Geographic Attack Map | Natural Earth world map with volume-scaled attack dots | | Attack Chain Funnel | Kill-chain dropout: Connect → KEX → Login → Commands → Downloads | | Attack Velocity | Real-time attacks/min with 60-min spark chart | | Attack Heatmap | 14-day hour × day intensity grid | | Botnet Fingerprints | Auto-detected campaigns with timelines and on-click AI analysis | | Credential Intelligence | Success rates, coordinated-attack detection, botnet badges | | Attacker Intelligence | Threat-scored attackers with full session breakdown | | MITRE ATT&CK Framework | Dynamic tactic/technique mapping from live alert data | | On-Demand AI Analysis | llama3.1:8b summary, full, and executive triage modes | ### Multi-Honeypot & Correlation (nginx + Dionaea) | Panel | Description | |-------|-------------| | Dionaea Malware Capture | Captured binaries with **VirusTotal verdict, malware family, source attribution, file size**, service breakdown, and activity timeline | | nginx Web Honeypot | Scanner fingerprints, CVE probe paths, user agents, request timeline | | Cross-Honeypot Attackers | IPs seen attacking multiple honeypots simultaneously | | **Threat Actor Correlation** | Unified per-IP profiles across all three honeypots, ranked by composite threat score — links SSH, web, and malware activity into single coordinated-actor views | ### Incident Management Built-in case management (open/investigating/closed, severity, audit log), alert pivoting and search, and five response playbooks — backed by a local SQLite store (`schema.sql`). ## Key Findings ### Live WannaCry Capture The Dionaea honeypot captured **7 unique malware binaries** over SMB — **6 confirmed WannaCry ransomware variants** (59–66 of ~76 VirusTotal engines flagging each) plus one trojan downloader — delivered from source IPs in the United States, Thailand, Sri Lanka, and Vietnam. Each sample is SHA256-hashed, VirusTotal-verified, attributed to its source, and preserved in a permanent read-only archive. WannaCry still self-propagating over exposed SMB years after 2017 is a concrete demonstration of long-tail internet threat activity. ### The mdrfckr Botnet The most sophisticated campaign observed. Installs a persistent SSH backdoor via `~/.ssh/authorized_keys` using a distinctive RSA key ending in `mdrfckr`, then uses `chattr -ia` (immutable flag) to prevent key removal even by root. ~90,000 implant attempts in the collection window from hundreds of distributed IPs. ### The 345gs5662d34 Campaign Massive credential stuffing using `root/345gs5662d34` — attempted **103,084 times from 357 unique IPs**, the largest single-credential effort in the dataset. ### nginx Web Honeypot Requests probing hundreds of unique paths: IoT botnets, `.env` credential theft targeting SendGrid/Twilio API keys, Hikvision CVE-2021-36260 RCE probes, TP-Link firmware exploits (CVE-2021-22161), and Tomcat manager brute-force. ### Attack Scale At peak, a single day saw **~2.8 million alerts**, driven by overlapping botnet campaigns. Across the 6-day capture the system averaged roughly 1.9 million alerts/day — on the order of 20+ attack events every second at sustained volume. ## Repository Structure ai-soc-pipeline/ ├── dashboard/ │ ├── app.py # Flask backend — 16 API endpoints │ ├── schema.sql # Incident-management SQLite schema │ └── templates/ │ └── index.html # SOC dashboard frontend (12 panels) ├── pipeline/ │ ├── parse_nginx.py # nginx CLF → Wazuh JSON parser │ ├── parse_dionaea.py # Dionaea SQLite → Wazuh JSON; SHA256 + VirusTotal + archive │ ├── sync_dionaea.sh # VPS → SIEM sync (SQLite + binaries), then parse │ └── rebuild_geoip_cache.py # MaxMind GeoLite2 cache refresh ├── triage/ │ ├── ai_triage.py # LLM threat-analysis engine │ ├── alert_poller.py # OpenSearch alert sampler │ └── triage_runner.py # 30-min cron orchestrator ├── scripts/ │ └── resolve_alert_ips.py # GeoIP backfill for alert IPs ├── config/ │ ├── soc-dashboard.service # Dashboard systemd unit │ ├── dionaea-sync.service # Dionaea sync+parse systemd unit │ ├── dionaea-sync.timer # 15-min timer for the above │ ├── geoip-enrich.cron # Hourly enrichment cron │ ├── wazuh-cowrie-rules.xml # Cowrie detection rules (100100–100110) │ ├── wazuh-honeypot-web-rules.xml # Dionaea + nginx rules (100200–100360) │ └── wazuh-ossec-snippet.xml # Wazuh agent localfile config ├── docs/ # 01–09 architecture, deployment, design, findings ├── data/samples/ # Sample alert JSON for testing ├── requirements.txt └── README.md ## API Endpoints | Endpoint | Description | |----------|-------------| | `GET /api/stats?minutes=N` | Full stats: timeline, countries, IPs, MITRE, credentials, commands | | `GET /api/attack_chain?minutes=N` | Kill-chain funnel stage counts | | `GET /api/velocity` | Real-time attacks/min + 60-min spark data | | `GET /api/heatmap` | 14-day hour × day attack matrix | | `GET /api/sessions?minutes=N` | Top sessions with full event chains | | `GET /api/botnets?minutes=N` | Behavioral botnet fingerprints | | `GET /api/intel?minutes=N` | Parallel: attack chain + sessions + botnets + cred_intel | | `GET /api/cred_intel?minutes=N` | Credential success rates + coordination detection | | `POST /api/botnet_analysis` | AI analysis of a specific campaign | | `GET /api/dionaea?minutes=N` | Dionaea malware stats + VirusTotal-enriched binaries | | `GET /api/nginx?minutes=N` | nginx web honeypot stats | | `GET /api/honeypots?minutes=N` | Combined Dionaea + nginx (parallel) | | `GET /api/threat_actors?minutes=N` | Cross-honeypot threat-actor correlation | | `GET /api/alert/` | Full alert/context drawer for a source IP | | `GET /api/search?q=&type=` | Pivot/search across IPs, credentials, commands | | `GET /api/cases` · `POST /api/cases` · `PATCH /api/cases/` | Incident case management | | `GET /api/playbooks` | Response playbooks | | `GET /api/triage` · `POST /api/analysis/run` | AI triage report / on-demand analysis | ## Technology Stack - **Honeypots:** Cowrie SSH/Telnet, nginx, Dionaea (Docker on DigitalOcean NYC1) - **Transport:** rsync over Tailscale VPN (15-min intervals) - **Enrichment:** Python + MaxMind GeoLite2 (City + ASN) + VirusTotal API (hash-only) - **Log parsers:** custom Python for nginx CLF and Dionaea SQLite - **SIEM:** Wazuh 4.x + OpenSearch (Ubuntu Server, aarch64) - **Backend:** Python 3, Flask (minimal dependencies — `flask`, `geoip2`) - **Frontend:** vanilla HTML/CSS/JS, HTML5 Canvas, Natural Earth 50m geodata - **AI:** Ollama + llama3.1:8b on NVIDIA RTX 4070 (fully local inference) - **Network:** Tailscale mesh VPN (no public dashboard exposure) - **Secrets:** environment-variable based; no credentials in source ## Setup See `docs/02-wazuh-installation.md` for full deployment instructions. High-level steps: 1. Deploy Cowrie, nginx, and Dionaea on a VPS (Docker Compose) 2. Install Wazuh all-in-one on your SIEM server 3. Configure key-based sync from VPS → SIEM server via Tailscale 4. Deploy `pipeline/parse_nginx.py` (cron) and `pipeline/sync_dionaea.sh` (systemd timer) 5. Set up GeoIP enrichment cron (`config/geoip-enrich.cron`) 6. Add Wazuh rules (`config/wazuh-cowrie-rules.xml`, `config/wazuh-honeypot-web-rules.xml`) 7. Deploy the Flask dashboard (`config/soc-dashboard.service`); set `OPENSEARCH_PASS` in the unit 8. (Optional) set `VT_API_KEY` in `config/dionaea-sync.service` to enable VirusTotal enrichment 9. Install Ollama and pull `llama3.1:8b` on your AI inference machine ## Documentation Full project documentation is in `docs/`: 1. **Architecture** — system design, data flow, infrastructure 2. **Wazuh Installation** — SIEM deployment and configuration 3. **Log Ingestion** — pipeline from honeypot to SIEM 4. **Custom Rules** — decoders, rules, MITRE mapping 5. **AI Triage Design** — LLM integration and prompt engineering 6. **Dashboard Guide** — panel reference and interpretation 7. **Alert Samples** — real attack session analysis 8. **Lessons Learned** — technical retrospective (incl. the Dionaea schema bug and secrets-handling migration) 9. **Executive Summary** — CISO-level findings and significance *Built as Project 4 of 4 for a CMU MSISPM application portfolio. All data collected from real internet attack traffic on infrastructure owned and operated by the author.*
标签:后端开发