Madathanapalleleena/Bda_threat_sig

GitHub: Madathanapalleleena/Bda_threat_sig

Stars: 0 | Forks: 0

# ThreatSig a Real-Time Threat Intelligence Dashboard A Big Data Analytics (BDA) project that detects, classifies, and visualizes cyber threats in real time using **PySpark MLlib**, **FastAPI**, **Kafka**, and live threat intelligence APIs. ## Features - **Live threat stream** — WebSocket-based event feed with simulated Kafka pipeline - **IP reputation check** — AbuseIPDB API (threat score, country, ISP, TOR/VPN detection) - **Email breach check** — XposedOrNot API (breach count, affected services) - **Geo intelligence** — IP geolocation + ASN risk scoring (ip-api.com / ipinfo.io) - **PySpark ML pipelines**: - KMeans clustering (Attacker / Scanner / Spammer / Clean) - Random Forest classification (threat tier prediction) - Anomaly detection (distance from cluster center) - Linear regression (score trend projection) - Feature Assembly with VectorAssembler & StandardScaler ## Tech Stack | Layer | Technology | |---|---| | Backend | FastAPI, Uvicorn, Python 3.10+ | | Big Data / ML | PySpark 3.4, scikit-learn, pandas, NumPy | | Streaming | WebSocket, aiokafka (optional) | | Threat Intel | AbuseIPDB API, XposedOrNot API | | Geo Intel | ip-api.com, ipinfo.io | | Container | Docker + Docker Compose (Kafka/Zookeeper) | ## Project Structure bda/ ├── main.py # FastAPI app, REST + WebSocket endpoints, Kafka integration ├── ml_engine.py # PySpark MLlib pipelines (KMeans, RF, regression) ├── geo_intel.py # IP geolocation + ASN risk enrichment ├── requirements.txt ├── commands.txt # Quick-start run guide └── .env # API keys (do NOT commit) ## Setup ### Prerequisites - Python 3.10+ - Java 8+ (required for PySpark) — set `JAVA_HOME` - Docker (only for Kafka mode) ### Install dependencies pip install -r requirements.txt ### Configure environment Create a `.env` file: ABUSEIPDB_API_KEY=your_free_key_here Get a free key at [abuseipdb.com](https://www.abuseipdb.com) (1000 checks/day, no cost). ## Running ### Option A — Without Kafka (simplest) uvicorn main:app --reload --host 0.0.0.0 --port 8000 Open `index.html` in your browser or visit `http://localhost:8000`. Stream events appear automatically within ~5 seconds. ### Option B — With Kafka (full pipeline) # 1. Start Kafka + Zookeeper docker-compose up -d # 2. Start backend uvicorn main:app --reload --host 0.0.0.0 --port 8000 ## API Endpoints | Method | Endpoint | Description | |---|---|---| | GET | `/api/check/ip/{ip}` | IP reputation scan (AbuseIPDB + ML) | | GET | `/api/check/email/{email}` | Email breach lookup (XposedOrNot) | | GET | `/api/stream/events` | Recent threat events (REST fallback) | | GET | `/api/stats` | Live stats and threat distribution | | GET | `/api/ml/anomaly` | KMeans anomaly detection on live events | | GET | `/api/ml/analytics` | Full ML batch report (all pipelines) | | GET | `/api/ml/trend` | Linear regression score trend | | GET | `/health` | Health check (API keys, ML, Spark status) | | WS | `/ws/stream` | WebSocket live threat feed | ### Quick test scans # Threat levels curl http://localhost:8000/api/check/ip/9.9.9.9 # Critical curl http://localhost:8000/api/check/ip/7.7.7.7 # High curl http://localhost:8000/api/check/ip/8.8.8.8 # Medium # Email breach curl "http://localhost:8000/api/check/email/test@yahoo.com" # ML reports curl http://localhost:8000/api/ml/analytics curl http://localhost:8000/api/ml/trend ## Threat Score Levels | Score | Level | |---|---| | 85 – 100 | Critical | | 60 – 84 | High | | 35 – 59 | Medium | | 0 – 34 | Low | ## Notes - Kafka is **optional** — the app runs a built-in stream simulator if Kafka is unavailable. - PySpark is **optional** — the ML engine gracefully degrades to scikit-learn if Java/Spark is not available. - Never commit your `.env` file — add it to `.gitignore`.
标签:后端开发