VishnuSas21/fraud-detection-system

GitHub: VishnuSas21/fraud-detection-system

Stars: 0 | Forks: 0

# Fraud Detection System — SQL Rule Engine + XGBoost ML Model A production-inspired fraud detection system built on 50,000 synthetic transactions. Combines a **SQL-based rule engine** with an **XGBoost machine learning model** to demonstrate how real fraud teams use rules and ML together — not as alternatives, but as layers. ## Project Structure fraud-detection-system/ ├── sql/ │ ├── 01_exploratory_analysis.sql │ ├── 02_velocity_bin_geo.sql │ ├── 03_ato_merchant_structuring.sql │ └── 04_hybrid_decision_engine.sql ├── python/ │ └── fraud_detection_xgboost.ipynb ├── data/ │ └── fraud_db_setup.sql ├── docs/ │ └── project_findings.md └── README.md ## Business Problem Financial platforms lose billions annually to fraud. Two failure modes exist: - **Miss fraud** → direct financial loss, chargebacks, regulatory penalties - **Block legitimate customers** → lost revenue, customer churn, support overhead This project builds a system that minimises both — catching fraud with high precision while keeping false positive rates at an acceptable level for operations teams. ## Dataset - **50,000 transactions** across 5,000 customers and 200 merchants - **782 fraud cases** — 1.56% fraud rate (realistic for card fraud) - **Fraud patterns included:** ATO, BIN attacks, geo-velocity, structuring, CNP fraud - **Features:** amount, transaction type, merchant category, KYC status, failed logins, account age, device ID, geo-location, risk segment Generated using `data/fraud_db_setup.sql` — fully reproducible in MySQL. ## Part 1 — SQL Rule Engine Four modules built progressively, each targeting specific fraud patterns. ### Modules | File | Fraud Patterns Covered | Key SQL Techniques | |---|---|---| | 01_exploratory_analysis | Baseline fraud rate by segment, KYC, account age | GROUP BY, conditional aggregation, CASE WHEN | | 02_velocity_bin_geo | Daily velocity, rolling 1-hour velocity, BIN attacks, geo-velocity | Self JOIN, INTERVAL window, CTE, LAG(), TIMESTAMPDIFF | | 03_ato_merchant_structuring | Account takeover, merchant chargeback risk, AML structuring | Multi-signal AND logic, JOIN, near-threshold detection | | 04_hybrid_decision_engine | Risk scoring engine + final decision engine | Weighted CASE WHEN, hard rules, 4-CTE chain | ### Risk Scoring Engine Each transaction receives a weighted risk score based on 10 fraud signals. Weights initially set by domain knowledge, then validated by XGBoost feature importance. | Signal | Weight | Basis | |---|---|---| | Failed logins > 2 | 35 | Strongest ATO predictor — 60.5% XGBoost importance | | Amount > $800 | 10 | High value = high loss exposure | | KYC not verified | 9 | Unverified identity = elevated risk | | High risk segment | 9 | Was underweighted in v1 — ML corrected this | | Crypto/Gaming merchant | 8 | 2.67% fraud rate vs 1.38% ecommerce | | Foreign transaction | 6 | Was overweighted in v1 — ML corrected this | | Amount > 3x user average | 5 | Behavioural spike signal | | Account age < 30 days | 5 | New account risk | | Off-hours (2am–5am) | 5 | Behavioural anomaly | | CNP transaction | 5 | Context signal — weak alone | ### Hybrid Decision Engine Hard rules for high-confidence single signals. Soft scoring for combinations of weaker signals. Strong single signal → HARD BLOCK (no score needed) Weak signals combined → Score ≥ 25 → REVIEW Score < 25 → APPROVE **Hard block conditions:** - failed_logins > 5 - failed_logins > 2 AND kyc_verified = 0 - failed_logins > 2 AND amount > $800 - risk_segment = High AND kyc_verified = 0 AND merchant = Crypto/Gaming **Why failed_logins is excluded from soft score:** It is already the trigger for 3 of 4 hard rules — including it in soft scoring would double-count the signal. The two layers are kept cleanly separated. ### Decision Engine Results | Decision | Transactions | Fraud Count | Fraud Rate | FP Rate | |---|---|---|---|---| | BLOCK | 285 | 112 | 39.3% | 60.7% | | REVIEW | 3,146 | 87 | 2.77% | 97.2% | | APPROVE | 46,569 | 583 | 1.25% | 0% | **Overall recall: 25.4%** — see limitations section for explanation. ## Part 2 — XGBoost ML Model ### Why ML alongside rules? Rule engines catch known fraud patterns. ML catches unknown ones. More importantly — ML tells you which signals actually matter vs which ones you assumed matter. ### Notebook Structure | Step | What it does | |---|---| | Step 1 | Load 50,000 transactions from MySQL | | Step 2 | Feature engineering v1 — binary flags baseline | | Step 3 | Train XGBoost model v1 | | Step 4 | Evaluate model v1 — reveals binary flags are too coarse | | Step 5 | Feature engineering v2 — raw continuous values | | Step 6 | Evaluate model v2 — dramatic improvement over v1 | | Step 7 | XGBoost feature importance analysis | | Step 8 | Map importance scores to SQL weights v1 | | Step 9 | Proportional ML-validated SQL weights v2 | | Step 10 | Threshold analysis — precision/recall trade-off | | Step 11 | Signal coverage analysis — why recall is capped | | Step 12 | Hybrid decision engine — hard rules + soft scoring | | Step 13 | Final 3-way comparison V1 vs V2 vs V3 | ### Model Performance | Metric | Model V1 (Binary Flags) | Model V2 (Raw + Engineered) | |---|---|---| | AUC-ROC | 0.71 | 0.9999* | | AUC-PR | 0.15 | 0.9956* | | Fraud caught | 41.7% | 98.1%* | | False Positives | 969 | 8* | *Near-perfect scores reflect synthetic data leakage — fraud labels were generated from the same features used in training. See limitations section. ### XGBoost Feature Importance — SQL Weight Validation | Feature | ML Importance | Old SQL Weight | New SQL Weight | Change | |---|---|---|---|---| | failed_logins | 60.5% | 30 | 35 | ↑ Confirmed strongest | | amount | 11.3% | 25 | 10 | ↓ Was overweighted | | kyc_verified | 8.2% | 20 | 9 | ↓ Reduced | | flag_high_risk_seg | 7.8% | 10 | 9 | ↑ Was underweighted | | flag_risky_merchant | 6.2% | 15 | 8 | ↓ Reduced | | is_foreign | 3.1% | 20 | 6 | ↓ Significantly overweighted | | daily_velocity | 0.0% | 20 | 0 | ❌ Removed entirely | ### Threshold Analysis Systematic evaluation showing no single threshold resolves the precision-recall trade-off: | Threshold | Precision | Recall | F1 | FP Rate | |---|---|---|---|---| | 35 | 17.6% | 20.6% | 0.190 | 82.4% | | 50 | 25.9% | 8.7% | 0.130 | 74.1% | | 65 | 80.0% | 1.5% | 0.030 | 20.0% | Best F1 at threshold 35 — but recall still only 20.6%. This directly motivated the hybrid architecture decision. ## Decision Engine Evolution | Version | Architecture | BLOCK Precision | BLOCK FP Rate | BLOCK Count | Recall | |---|---|---|---|---|---| | V1 | Domain weights, threshold 60 | 7.75% | 92.25% | 3,344 | 33.2% | | V2 | ML-validated weights, threshold 65 | 25.86% | 74.14% | 263 | 25.4% | | V3 | Hybrid hard rules + soft scoring | 39.30% | 60.70% | 285 | 25.4% | **V1 → V2:** ML weights reduced BLOCK size by 92% (3,344 → 263) while improving precision 3.3x **V2 → V3:** Hard rules pushed BLOCK precision to 39.3% — nearly 1 in 2 blocks is genuine fraud ## Key Findings **1. Failed logins dominates all other signals** At 60.5% XGBoost importance, failed_logins is 5x more predictive than the next feature. Any transaction with failed_logins > 2 combined with one secondary signal justifies a hard block. **2. Foreign transaction signal was overweighted** is_foreign fires on 80% of all transactions — too broad to discriminate. ML importance of 3.1% confirmed it should be a supporting signal only. Reduced from 20 to 6 points in the scoring engine. **3. Hybrid architecture outperforms pure scoring** Separating high-confidence single signals (hard rules) from weak signal combinations (soft scoring) improved BLOCK precision from 25.86% to 39.30% without changing recall. **4. Rules and ML validate each other** Domain expertise set initial weights. ML corrected three overweighted signals and one underweighted signal. Neither approach alone would have found this. **5. REVIEW bucket should use step-up authentication in production** 97% false positive rate in REVIEW is unacceptable for human analyst review. Solution: automated OTP or 3DS — customer self-clears in seconds, ops team never sees it, false positive cost drops to near zero. ## Limitations **Synthetic data ceiling — recall capped at 25%** 583 fraud cases score low because they were labelled by single isolated signals in the data generator. The scoring engine requires multiple signals to fire together, so single-signal fraud cases slip through at every threshold. In production this limitation does not exist — real fraudsters trigger multiple signals simultaneously. An ATO attack shows failed logins AND new device AND geo mismatch AND unusual hour together. The architecture is production-ready; the constraint is the synthetic data, not the engine design. **Near-perfect ML metrics indicate data leakage** Fraud labels were generated deterministically from the same features used in training. XGBoost reverse-engineers the labelling rules rather than learning generalised fraud patterns. Real-world fraud models achieve 0.75–0.90 AUC-ROC on production data. ## How to Run **Prerequisites:** - MySQL 8.0+ - Python 3.9+ - Jupyter Notebook **Step 1 — Set up the database:** -- In MySQL Workbench or mysql CLI source data/fraud_db_setup.sql **Step 2 — Run SQL modules in order:** USE fraud_db; source sql/01_exploratory_analysis.sql source sql/02_velocity_bin_geo.sql source sql/03_ato_merchant_structuring.sql source sql/04_hybrid_decision_engine.sql **Step 3 — Run the notebook:** pip install pandas numpy scikit-learn xgboost==2.1.1 mysql-connector-python matplotlib jupyter notebook Open `python/fraud_detection_xgboost.ipynb` and run all cells. Update MySQL password in connection cells before running. ## Skills Demonstrated | Skill | Where | |---|---| | Advanced SQL — CTEs, window functions, self joins | All SQL modules | | Fraud domain knowledge — ATO, BIN attack, geo-velocity, structuring | sql/02, sql/03 | | Feature engineering — raw values vs binary flags comparison | Notebook Steps 2 and 5 | | Class imbalance handling — scale_pos_weight in XGBoost | Notebook Step 3 | | Threshold optimisation — precision/recall trade-off | Notebook Step 10 | | Signal coverage analysis — why recall is capped | Notebook Step 11 | | ML-rule integration — feature importance → SQL weight update | Notebook Steps 7–9 | | Hybrid architecture — hard rules + soft scoring | sql/04, Notebook Step 12 | | Production thinking — approval rate, step-up auth recommendation | Notebook Step 12 | | Honest evaluation — synthetic data limitations acknowledged | Limitations section | ## Author **Vishnu Saseendran** Senior Executive — Payments Performance & Risk Investigations 6+ years in fraud detection, transaction monitoring, and payments risk [LinkedIn](https://linkedin.com/in/vishnu-saseendran-522798148) | [GitHub](https://github.com/VishnuSas21)