prattikkk/Phishing-Detection-and-Incident-Response-Lab

GitHub: prattikkk/Phishing-Detection-and-Incident-Response-Lab

Stars: 0 | Forks: 0

# Phishing Detection and Incident Response Lab SOC-oriented project that detects phishing attempts from email and URL intelligence, generates analyst-ready incident outputs, and maps findings to MITRE ATT&CK. ## Portfolio Highlights - End-to-end pipeline from data preparation to response reporting. - Explainable rule engine with optional ML augmentation. - Automatic alert generation with risk score, severity, and response guidance. - ATT&CK mapping for SOC documentation quality. - Evaluated on 70,241 records combined from CEAS-08 and webpage intelligence data. ## Architecture flowchart LR A[Raw Datasets\nEmails + URLs] --> B[Data Normalization\nsrc/prepare_data.py] B --> C[Indicator Engine\nsrc/indicators.py] B --> D[Optional ML Scoring\nsrc/ml_classifier.py] C --> E[Risk Blending\nsrc/pipeline.py] D --> E E --> F[Alert Builder\nsrc/response.py] F --> G[MITRE Mapping\nsrc/mitre_mapping.py] G --> H[Outputs\ndetections.csv\nalerts.json\nresponse_report.md] ## Measured Results External combined dataset summary: - Total rows: 70,241 - Safe: 47,312 - Phishing: 22,929 Performance on labeled external data: | Run | Threshold | Precision | Recall | F1 | Accuracy | |---|---:|---:|---:|---:|---:| | external | 55 | 0.9655 | 0.0024 | 0.0049 | 0.6743 | | external_t30 | 30 | 0.9153 | 0.9208 | 0.9181 | 0.9463 | Takeaway: - Threshold 55 is very strict and prioritizes precision over recall. - Threshold 30 gives balanced production-style triage performance for this dataset. ## Tech Stack - Python - pandas - scikit-learn (optional, auto fallback to internal Naive Bayes) ## Quick Start (PowerShell) 1. Create and activate virtual environment: python -m venv .venv .\.venv\Scripts\Activate.ps1 2. Install dependencies: pip install -r requirements.txt 3. Run rules + ML on starter sample: python -m src.pipeline --input data/sample_emails.csv --output-dir reports --threshold 55 --use-ml ## Use External Datasets Normalize and combine CEAS + webpage datasets: python -m src.prepare_data --ceas-path "CEAS_08.csv/CEAS_08.csv" --web-path "Webpages_Classification_test_data.csv/Webpages_Classification_test_data.csv" --output data/external_combined.csv --max-web-good 30000 Run pipeline with stricter threshold: python -m src.pipeline --input data/external_combined.csv --output-dir reports/external --threshold 55 --use-ml Run pipeline with balanced threshold: python -m src.pipeline --input data/external_combined.csv --output-dir reports/external_t30 --threshold 30 --use-ml Evaluate detection quality: python -m src.evaluate --detections reports/external_t30/detections.csv --run-name external_t30 --output-markdown reports/external_t30/evaluation.md ## Outputs - reports/*/detections.csv: row-level risk and classification - reports/*/alerts.json: SOC alert artifacts - reports/*/response_report.md: human-readable incident summary - reports/*/evaluation.md: quantitative detection metrics ## SOC-Oriented Skills Demonstrated - IOC-driven detection engineering - Rule tuning and threshold calibration - ML-assisted security triage - Analyst handoff report writing - ATT&CK-aligned incident documentation ## Project Structure - src/pipeline.py: orchestrates full workflow - src/prepare_data.py: normalizes external datasets to lab schema - src/indicators.py: phishing heuristic engine - src/ml_classifier.py: ML scoring (sklearn or fallback) - src/response.py: alert and response artifact generation - src/mitre_mapping.py: ATT&CK technique mapping - src/evaluate.py: confusion-matrix and quality metrics - docs/lab_guide.md: guided learning walkthrough ## Notes - This project is for education and portfolio demonstration. - It is not a substitute for enterprise-grade email security controls. - Generated reports and large raw datasets are excluded from git for clean repository hygiene.