Joshua-Terna/phishing-detection-system
GitHub: Joshua-Terna/phishing-detection-system
Stars: 0 | Forks: 0
# Phishing Detector
A local phishing detection workspace with:
- URL phishing model using feature extraction + RandomForest
- Email phishing model using DistilBERT
- Streamlit UI for URL and email scanning
- Optional WHOIS, SSL, Google Safe Browsing, and VirusTotal checks
- Local SQLite scan history with hashed inputs
- Risk tiers, sensitivity controls, local-only mode, and local feedback capture
## Setup
1. Create and activate your Python environment.
2. Install dependencies:
pip install -r requirements.txt
3. Install development tools for tests and formatting (optional but recommended):
pip install -r requirements-dev.txt
## Local secrets
The app supports local API key configuration from a `.env` file or from OS environment variables.
Create a `.env` file in the project root:
GOOGLE_SAFE_BROWSING_API_KEY=your_google_safe_browsing_api_key
VIRUSTOTAL_API_KEY=your_virustotal_api_key
The file `.env` is ignored by git, and `.env.example` remains available as a safe template for collaborators.
## Running the app
streamlit run app.py
On Windows you can also run:
.\scripts\run_app.ps1
## Developer commands
- Install dependencies: `pip install -r requirements.txt`
- Run unit tests: `pytest`
- Run smoke test: `python smoke_test.py`
- Evaluate URL model: `python evaluate_url_model.py`
- Evaluate email model: `python evaluate_email_model.py`
- Check Python syntax: `python -m py_compile app.py analyzers.py config.py email_headers.py feature_extractor.py model_loader.py model_metadata.py reputation.py risk.py storage.py trust_signals.py train_url_model.py train_email_model.py evaluate_url_model.py evaluate_email_model.py smoke_test.py`
- Check formatting: `black .`
- Check linting: `ruff check .`
- Run all local checks on Windows: `.\scripts\run_checks.ps1`
VS Code tasks are available in `.vscode/tasks.json`.
## Optional API keys
These environment variables are supported:
- `GOOGLE_SAFE_BROWSING_API_KEY`
- `VIRUSTOTAL_API_KEY`
If keys are missing, reputation checks are disabled and the app still works for URL/email scanning.
VirusTotal checks first try to read an existing URL verdict. If the URL is unknown, the app submits it for analysis and reports the verdict as pending when VirusTotal has not finished processing it yet.
## Model quality
The training scripts write model metadata JSON files with dataset size, features, metrics, and training time. The evaluation scripts write JSON reports with:
- accuracy, precision, recall, F1, ROC-AUC
- confusion matrix
- classification report
- false positive examples
- false negative examples
The URL training script uses a calibrated RandomForest so model scores are better behaved than raw forest probabilities.
## UI Features
- `Strict`, `Balanced`, and `Lenient` sensitivity levels
- `Local-only mode` to disable WHOIS, SSL, Google Safe Browsing, and VirusTotal
- URL and email sample cases for demos
- `.eml`, `.html`, and `.txt` upload support
- email header anomaly checks
- risk score breakdowns
- feedback buttons for correct, false positive, and false negative results
## Generated artifacts
The trained URL model, DistilBERT checkpoints, model metadata, evaluation reports, local database, virtual environment, and cache directories are ignored by git. If these were already tracked, remove them from the git index once:
git rm -r --cached venv __pycache__ .pytest_cache .ruff_cache phishguard_logs.db phishing_model.pkl distilbert_email_model url_model_metadata.json email_model_metadata.json url_model_evaluation.json email_model_evaluation.json
Then regenerate or restore the model files locally before running the app.
## Smoke test
Run the basic smoke test to verify imports and core functions:
python smoke_test.py
## Continuous integration
This repository includes `.github/workflows/ci.yml`.
The CI workflow installs dependencies, runs syntax checks, runs unit tests, checks formatting with `black`, linting with `ruff`, and executes `smoke_test.py`.