prattikkk/Phishing-Detection-and-Incident-Response-Lab
GitHub: prattikkk/Phishing-Detection-and-Incident-Response-Lab
Stars: 0 | Forks: 0
# Phishing Detection and Incident Response Lab
SOC-oriented project that detects phishing attempts from email and URL intelligence, generates analyst-ready incident outputs, and maps findings to MITRE ATT&CK.
## Portfolio Highlights
- End-to-end pipeline from data preparation to response reporting.
- Explainable rule engine with optional ML augmentation.
- Automatic alert generation with risk score, severity, and response guidance.
- ATT&CK mapping for SOC documentation quality.
- Evaluated on 70,241 records combined from CEAS-08 and webpage intelligence data.
## Architecture
flowchart LR
A[Raw Datasets\nEmails + URLs] --> B[Data Normalization\nsrc/prepare_data.py]
B --> C[Indicator Engine\nsrc/indicators.py]
B --> D[Optional ML Scoring\nsrc/ml_classifier.py]
C --> E[Risk Blending\nsrc/pipeline.py]
D --> E
E --> F[Alert Builder\nsrc/response.py]
F --> G[MITRE Mapping\nsrc/mitre_mapping.py]
G --> H[Outputs\ndetections.csv\nalerts.json\nresponse_report.md]
## Measured Results
External combined dataset summary:
- Total rows: 70,241
- Safe: 47,312
- Phishing: 22,929
Performance on labeled external data:
| Run | Threshold | Precision | Recall | F1 | Accuracy |
|---|---:|---:|---:|---:|---:|
| external | 55 | 0.9655 | 0.0024 | 0.0049 | 0.6743 |
| external_t30 | 30 | 0.9153 | 0.9208 | 0.9181 | 0.9463 |
Takeaway:
- Threshold 55 is very strict and prioritizes precision over recall.
- Threshold 30 gives balanced production-style triage performance for this dataset.
## Tech Stack
- Python
- pandas
- scikit-learn (optional, auto fallback to internal Naive Bayes)
## Quick Start (PowerShell)
1. Create and activate virtual environment:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
2. Install dependencies:
pip install -r requirements.txt
3. Run rules + ML on starter sample:
python -m src.pipeline --input data/sample_emails.csv --output-dir reports --threshold 55 --use-ml
## Use External Datasets
Normalize and combine CEAS + webpage datasets:
python -m src.prepare_data --ceas-path "CEAS_08.csv/CEAS_08.csv" --web-path "Webpages_Classification_test_data.csv/Webpages_Classification_test_data.csv" --output data/external_combined.csv --max-web-good 30000
Run pipeline with stricter threshold:
python -m src.pipeline --input data/external_combined.csv --output-dir reports/external --threshold 55 --use-ml
Run pipeline with balanced threshold:
python -m src.pipeline --input data/external_combined.csv --output-dir reports/external_t30 --threshold 30 --use-ml
Evaluate detection quality:
python -m src.evaluate --detections reports/external_t30/detections.csv --run-name external_t30 --output-markdown reports/external_t30/evaluation.md
## Outputs
- reports/*/detections.csv: row-level risk and classification
- reports/*/alerts.json: SOC alert artifacts
- reports/*/response_report.md: human-readable incident summary
- reports/*/evaluation.md: quantitative detection metrics
## SOC-Oriented Skills Demonstrated
- IOC-driven detection engineering
- Rule tuning and threshold calibration
- ML-assisted security triage
- Analyst handoff report writing
- ATT&CK-aligned incident documentation
## Project Structure
- src/pipeline.py: orchestrates full workflow
- src/prepare_data.py: normalizes external datasets to lab schema
- src/indicators.py: phishing heuristic engine
- src/ml_classifier.py: ML scoring (sklearn or fallback)
- src/response.py: alert and response artifact generation
- src/mitre_mapping.py: ATT&CK technique mapping
- src/evaluate.py: confusion-matrix and quality metrics
- docs/lab_guide.md: guided learning walkthrough
## Notes
- This project is for education and portfolio demonstration.
- It is not a substitute for enterprise-grade email security controls.
- Generated reports and large raw datasets are excluded from git for clean repository hygiene.