kgothatso605/credit-fraud-detection-ml
GitHub: kgothatso605/credit-fraud-detection-ml
Stars: 0 | Forks: 0
# Credit Card Fraud Detection ML Framework
[](https://www.python.org/)
[](LICENSE)
[](https://github.com/psf/black)
## Project Overview
End-to-end fraud detection system built from scratch, achieving 76.28% precision with 9.04% recall through systematic methodology and professional ML practices.
**Key Achievement**: Terminal-based features account for 58% of model importance, demonstrating that *where* transactions occur matters more than *when* or *how much*.
### Dataset
- **Source**: Kaggle - Credit Card Fraud Detection Challenge
- **Size**: 291,231 training transactions, 226,731 test transactions
- **Imbalance**: 1:43 ratio (2.26% fraud rate)
- **Structure**: Multi-table relational data (transactions, customers, terminals)
**Acknowledgment**: We gratefully acknowledge Kaggle for providing this synthetic dataset for educational purposes.
## Results Summary
| Metric | Value | Status |
|--------|-------|--------|
| Precision | 76.28% | High |
| Recall | 9.04% | Conservative |
| F1-Score | 16.17% | Best balance |
| Training Time | 36.7s | Fast |
| Top Feature | terminal_fraud_rate | 29.75% importance |
**Key Finding**: Discovered fundamental precision-recall constraint where 85%+ precision requires <1% recall.
## Architecture
End-to-End Pipeline:
Raw Data → Feature Engineering → Model Training → Evaluation → Deployment → API
### Features Engineered: 58 total
1. **Temporal (14)**: High-risk period (21:00-04:00), cyclical encodings
2. **Geographic (2)**: Distance calculations
3. **Amount Deviations (6)**: Personalized Z-scores per customer
4. **Terminal Risk (5)**: Historical fraud rates (58% of model importance!)
5. **Customer Behavioral (3)**: Usage patterns
6. **Advanced Contextual (7)**: Weekend patterns, customer time habits
7. **Interactions (6)**: Compound effects
### Models Evaluated: 20 configurations
- Logistic Regression, Random Forest, XGBoost, LightGBM, CatBoost
- Techniques: Baseline, Class Weights, SMOTE, Threshold Tuning
- **Selected**: Random Forest Baseline (optimal precision-recall balance)
## Quick Start
### Installation
# Clone repository
git clone https://github.com/yourusername/credit_fraud_ml.git
cd credit_fraud_ml
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install package
pip install -e .
# Or install dependencies only
pip install -r requirements.txt
### Usage
#### 1. Feature Engineering
from src.features.feature_engineering import FeatureEngineer
# Initialize
engineer = FeatureEngineer()
# Load and engineer features
train_engineered = engineer.fit_transform(train_df, customers_df, terminals_df)
test_engineered = engineer.transform(test_df, customers_df, terminals_df)
#### 2. Model Training
from src.models.train import train_model
# Train model
model, metrics = train_model(
X_train, y_train,
model_type='random_forest',
save_path='models/fraud_detector.pkl'
)
print(f"Precision: {metrics['precision']:.2%}")
print(f"Recall: {metrics['recall']:.2%}")
#### 3. Prediction
from src.deployment.predictor import FraudPredictor
# Initialize predictor
predictor = FraudPredictor(model_path='models/fraud_detector.pkl')
# Predict
result = predictor.predict(transaction_data)
print(f"Fraud Probability: {result['fraud_probability']:.2%}")
#### 4. API Deployment
# Start API server
uvicorn src.deployment.api:app --host 0.0.0.0 --port 8000
# Test prediction
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"transaction_id": 12345, "amount": 100.50, ...}'
## Project Structure
credit_fraud_ml/
├── src/ # Source code
│ ├── data/ # Data loading & validation
│ ├── features/ # Feature engineering
│ ├── models/ # Model training & evaluation
│ ├── deployment/ # API & serving
│ └── utils/ # Utilities
├── notebooks/ # Jupyter notebooks
├── tests/ # Unit tests
├── deployment/ # Docker & K8s configs
├── data/ # Data (gitignored)
├── models/ # Saved models (gitignored)
└── reports/ # Figures & results
## Methodology
### Phase 1: EDA (Days 1-3)
**Discoveries**:
- High-risk period: 21:00-04:00 (50% elevated fraud)
- Volume-risk paradox: Low-volume hours = higher fraud rates
- Amount analysis: Only $4.07 difference between fraud/legitimate
- Synthetic data recognition: Tuesday elevation (unusual pattern)
### Phase 2: Feature Engineering (Days 4-7)
**Created 58 features** across 7 categories with focus on:
- Terminal risk metrics (became 58% of model importance!)
- Personalized amount deviations (customer-specific baselines)
- Advanced temporal patterns (high-risk period indicators)
### Phase 3: Model Evaluation (Days 8-12)
**Systematic comparison**:
- Tested 20 configurations (5 models × 4 techniques)
- Discovered precision-recall constraint
- Random Forest baseline selected as optimal
### Phase 4: Deployment (Days 13-17)
**Production-ready**:
- FastAPI REST API
- Docker containerization
- Monitoring & logging
- Model versioning
## Key Learnings
### 1. Feature Engineering > Model Selection
Terminal features dominate (58%) despite testing 5 different model types.
### 2. Data-Driven > Assumption-Driven
Synthetic data showed Tuesday elevation (not weekends), teaching us to trust data over domain assumptions.
### 3. Systematic Evaluation Essential
### 4. Honest Limitation Documentation
85% precision target proved impossible (requires <1% recall). Documenting why matters as much as successes.
## Performance Optimization
### Current Model
- **Precision**: 76.28%
- **Recall**: 9.04%
- **Inference Time**: <10ms per transaction
### Potential Improvements
1. Time-series features (rolling windows)
2. Network features (fraud ring detection)
3. Ensemble methods (stacking)
4. Deep learning (LSTM for sequences)
## Deployment
### Docker
# Build image
docker build -t fraud-detector:latest .
# Run container
docker run -p 8000:8000 fraud-detector:latest
### Kubernetes
# Deploy to K8s
kubectl apply -f deployment/kubernetes/
# Check status
kubectl get pods -l app=fraud-detector
### Monitoring
- Prometheus metrics at `/metrics`
- Health check at `/health`
- Model performance tracking
## Testing
# Run all tests
pytest tests/
# With coverage
pytest tests/ --cov=src --cov-report=html
# Specific test
pytest tests/test_features.py -v
## API Documentation
Once running, visit: `http://localhost:8000/docs`
### Endpoints
**POST /predict** - Predict single transaction
{
"transaction_id": 12345,
"amount": 100.50,
"customer_id": 1,
"terminal_id": 50,
"timestamp": "2026-02-16T21:30:00"
}
**POST /predict/batch** - Batch predictions
**GET /health** - Health check
**GET /metrics** - Prometheus metrics
## Author
**Kgothatso Ntumbe**
MSc Physics Student, University of the Witwatersrand
Aspiring AI Engineer & Entrepreneur
## License
MIT License - See LICENSE file for details
## Acknowledgments
- **Dataset**: Kaggle - Credit Card Fraud Detection Challenge
- **Platform**: CERN SWAN for development
- **Tools**: scikit-learn, XGBoost, LightGBM, CatBoost, FastAPI
## Documentation
- [API Documentation](docs/api_documentation.md)
- [Feature Engineering Guide](docs/feature_engineering.md)
- [Deployment Guide](docs/deployment.md)
## Roadmap
- [x] EDA & Feature Engineering
- [x] Model Training & Evaluation
- [x] API Deployment
- [ ] Real-time Streaming Pipeline
- [ ] A/B Testing Framework
- [ ] AutoML Integration
- [ ] Model Monitoring Dashboard