msburns24/amazon-sentiment-ml-eng
GitHub: msburns24/amazon-sentiment-ml-eng
Stars: 0 | Forks: 0
# amazon-sentiment-ml-eng
A production-ready sentiment classifier built on the [Amazon Customer Reviews 2023](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) dataset. Demonstrates the ML engineering patterns that sit between a notebook prototype and a deployed system: structured error handling, a three-tier fallback chain, drift detection, a FastAPI serving layer, and a Streamlit observability dashboard.
## Project Structure
amazon-sentiment-ml-eng/
├── src/amazon_sentiment/
│ ├── classifier.py # DistilBERT inference, input validation, OOV detection
│ ├── fallback.py # Three-tier fallback chain
│ ├── drift.py # Drift report (OOV rate, KS test, JS divergence)
│ ├── api.py # FastAPI app (/predict, /health, /metrics)
│ └── dashboard.py # Streamlit observability dashboard
├── scripts/
│ ├── download.py # Download Amazon Reviews dataset from HuggingFace
│ ├── preprocess.py # Derive sentiment labels from star ratings
│ ├── split_windows.py # Partition data into early/late time windows
│ ├── train.py # Fine-tune DistilBERT on the training window
│ └── simulate_drift.py # Compare windows and produce a drift report
├── reports/
│ └── drift.json # Drift simulation output
├── tests/
│ ├── unit/ # Classifier, fallback, resource guard unit tests
│ ├── integration/ # End-to-end with real DistilBERT checkpoint
│ ├── drift/ # Drift detection unit and scenario tests
│ └── api/ # FastAPI endpoint tests
├── HW1/ # Original homework artifacts (archived)
│ ├── sentiment_classifier.py
│ ├── fallback_system.py
│ ├── requirements.md
│ └── assumptions.md
├── blog/
│ └── notebook-to-production.md # Full blog post draft
└── pyproject.toml
## Quickstart
### Setup
python -m venv .venv
# Windows
.venv\Scripts\Activate.ps1
# macOS/Linux
source .venv/bin/activate
pip install -e .
### Run the classifier
from amazon_sentiment.classifier import classify
result = classify("This product is absolutely fantastic!")
# {'label': 'positive', 'confidence': 0.93, 'status': 'ok', 'reason': ''}
Every call returns the same four-key dict — `label`, `confidence`, `status`, `reason` — and never raises to the caller. The `status` field tells you which tier responded: `"ok"` (model), `"fallback"` (rule-based or human queue), or `"rejected"` (failed input validation).
### Run the API server
uvicorn amazon_sentiment.api:app --reload
# Classify a review
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "Great quality and fast shipping."}'
# Health check
curl http://localhost:8000/health
# Rolling metrics
curl http://localhost:8000/metrics
### Run the observability dashboard
streamlit run src/amazon_sentiment/dashboard.py
The sidebar has a **Generate demo data** button that populates synthetic prediction logs so you can explore all panels without needing a live API.
### Simulate drift
# Print drift report to stdout
python scripts/simulate_drift.py
# Save JSON for the dashboard's Drift Report tab
python scripts/simulate_drift.py --output-json reports/drift.json
### Run the tests
# Unit + drift + API tests (no model download needed for unit/drift)
pytest tests/unit/ tests/drift/ tests/api/
# Full suite including integration tests
pytest
## Training Your Own Model
# 1. Download the raw Amazon Reviews dataset
python scripts/download.py
# 2. Derive sentiment labels from star ratings (1–2 → negative, 3 → neutral, 4–5 → positive)
python scripts/preprocess.py
# 3. Partition into early/late time windows for drift simulation
python scripts/split_windows.py
# 4. Fine-tune DistilBERT on the early window (writes to models/amazon-sentiment/)
python scripts/train.py
A pre-trained checkpoint is already committed to `models/amazon-sentiment/`. Skip to step 3 or 4 if you just want to reproduce the drift simulation.
## Key Design Decisions
### Classifier: fine-tuned 3-class DistilBERT
`distilbert-base-uncased` fine-tuned directly on Amazon Reviews with three native output classes: `positive`, `negative`, `neutral`. Trained on 18,851 examples for 2 epochs; achieves **91.4% accuracy** on the held-out validation set.
The model checkpoint is not included in this repo (too large for git). Run the training pipeline below to produce it, or point `MODEL_NAME` in `classifier.py` at any HuggingFace-compatible checkpoint. If no local checkpoint is found, `classifier.py` falls back to `distilbert-base-uncased-finetuned-sst-2-english` (SST-2 binary), mapping low-confidence predictions (score < 0.75) to neutral.
### Fallback chain: three tiers, one contract
Every input — valid or not — produces the same response shape. The fallback chain ensures this:
Input → Validation → Resource check
→ Tier 1: DistilBERT (confidence ≥ 0.6)
→ Tier 2: Keyword heuristics (no match or tie → next)
→ Tier 3: Human review queue (always structured response)
### Drift detection: three signals
`compute_drift_report(early_df, late_df)` compares two DataFrames and reports on:
| Signal | Method | Default threshold | Result on Amazon data |
|---|---|---|---|
| OOV rate | Mean delta | > 5% | **No drift** — delta 3.3% |
| Text length | Kolmogorov-Smirnov p-value | < 0.05 | **Drift** — mean 438 → 514 chars, p ≈ 0 |
| Label distribution | Jensen-Shannon divergence | > 0.05 | **Drift** — JS divergence 0.184 |
The label shift is the most striking signal: positive reviews drop from 34% → 13% and negative surge from 61% → 83% in the later window — consistent with a review-bombing pattern. Overall drift is detected.
All thresholds are configurable. The simulation script accepts `--oov-threshold`, `--length-pvalue`, and `--js-threshold` flags.
## Requirements and Assumptions
The `HW1/` directory contains the original requirements and assumptions documents, which informed the production design:
- **`requirements.md`** — Business metrics (−15% churn in 90 days), system performance (p95 < 200 ms, 50 req/s sustained), model quality (macro F1 ≥ 85%), and data quality requirements
- **`assumptions.md`** — World-vs-Machine framework: what the system controls vs. what it assumes about its environment, and what happens when each assumption is violated
## Blog Post
`blog/notebook-to-production.md` is a full practitioner post on the notebook-to-production gap, using this project as the case study. Sections:
1. The gap between notebook accuracy and production readiness
2. Requirements doc walkthrough
3. Failure modes the Amazon dataset triggers
4. The three-tier fallback chain
5. Drift simulation results
6. What comes next (fine-tuning, containerization, retraining loop)