msburns24/amazon-sentiment-ml-eng

GitHub: msburns24/amazon-sentiment-ml-eng

Stars: 0 | Forks: 0

# amazon-sentiment-ml-eng A production-ready sentiment classifier built on the [Amazon Customer Reviews 2023](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) dataset. Demonstrates the ML engineering patterns that sit between a notebook prototype and a deployed system: structured error handling, a three-tier fallback chain, drift detection, a FastAPI serving layer, and a Streamlit observability dashboard. ## Project Structure amazon-sentiment-ml-eng/ ├── src/amazon_sentiment/ │ ├── classifier.py # DistilBERT inference, input validation, OOV detection │ ├── fallback.py # Three-tier fallback chain │ ├── drift.py # Drift report (OOV rate, KS test, JS divergence) │ ├── api.py # FastAPI app (/predict, /health, /metrics) │ └── dashboard.py # Streamlit observability dashboard ├── scripts/ │ ├── download.py # Download Amazon Reviews dataset from HuggingFace │ ├── preprocess.py # Derive sentiment labels from star ratings │ ├── split_windows.py # Partition data into early/late time windows │ ├── train.py # Fine-tune DistilBERT on the training window │ └── simulate_drift.py # Compare windows and produce a drift report ├── reports/ │ └── drift.json # Drift simulation output ├── tests/ │ ├── unit/ # Classifier, fallback, resource guard unit tests │ ├── integration/ # End-to-end with real DistilBERT checkpoint │ ├── drift/ # Drift detection unit and scenario tests │ └── api/ # FastAPI endpoint tests ├── HW1/ # Original homework artifacts (archived) │ ├── sentiment_classifier.py │ ├── fallback_system.py │ ├── requirements.md │ └── assumptions.md ├── blog/ │ └── notebook-to-production.md # Full blog post draft └── pyproject.toml ## Quickstart ### Setup python -m venv .venv # Windows .venv\Scripts\Activate.ps1 # macOS/Linux source .venv/bin/activate pip install -e . ### Run the classifier from amazon_sentiment.classifier import classify result = classify("This product is absolutely fantastic!") # {'label': 'positive', 'confidence': 0.93, 'status': 'ok', 'reason': ''} Every call returns the same four-key dict — `label`, `confidence`, `status`, `reason` — and never raises to the caller. The `status` field tells you which tier responded: `"ok"` (model), `"fallback"` (rule-based or human queue), or `"rejected"` (failed input validation). ### Run the API server uvicorn amazon_sentiment.api:app --reload # Classify a review curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{"text": "Great quality and fast shipping."}' # Health check curl http://localhost:8000/health # Rolling metrics curl http://localhost:8000/metrics ### Run the observability dashboard streamlit run src/amazon_sentiment/dashboard.py The sidebar has a **Generate demo data** button that populates synthetic prediction logs so you can explore all panels without needing a live API. ### Simulate drift # Print drift report to stdout python scripts/simulate_drift.py # Save JSON for the dashboard's Drift Report tab python scripts/simulate_drift.py --output-json reports/drift.json ### Run the tests # Unit + drift + API tests (no model download needed for unit/drift) pytest tests/unit/ tests/drift/ tests/api/ # Full suite including integration tests pytest ## Training Your Own Model # 1. Download the raw Amazon Reviews dataset python scripts/download.py # 2. Derive sentiment labels from star ratings (1–2 → negative, 3 → neutral, 4–5 → positive) python scripts/preprocess.py # 3. Partition into early/late time windows for drift simulation python scripts/split_windows.py # 4. Fine-tune DistilBERT on the early window (writes to models/amazon-sentiment/) python scripts/train.py A pre-trained checkpoint is already committed to `models/amazon-sentiment/`. Skip to step 3 or 4 if you just want to reproduce the drift simulation. ## Key Design Decisions ### Classifier: fine-tuned 3-class DistilBERT `distilbert-base-uncased` fine-tuned directly on Amazon Reviews with three native output classes: `positive`, `negative`, `neutral`. Trained on 18,851 examples for 2 epochs; achieves **91.4% accuracy** on the held-out validation set. The model checkpoint is not included in this repo (too large for git). Run the training pipeline below to produce it, or point `MODEL_NAME` in `classifier.py` at any HuggingFace-compatible checkpoint. If no local checkpoint is found, `classifier.py` falls back to `distilbert-base-uncased-finetuned-sst-2-english` (SST-2 binary), mapping low-confidence predictions (score < 0.75) to neutral. ### Fallback chain: three tiers, one contract Every input — valid or not — produces the same response shape. The fallback chain ensures this: Input → Validation → Resource check → Tier 1: DistilBERT (confidence ≥ 0.6) → Tier 2: Keyword heuristics (no match or tie → next) → Tier 3: Human review queue (always structured response) ### Drift detection: three signals `compute_drift_report(early_df, late_df)` compares two DataFrames and reports on: | Signal | Method | Default threshold | Result on Amazon data | |---|---|---|---| | OOV rate | Mean delta | > 5% | **No drift** — delta 3.3% | | Text length | Kolmogorov-Smirnov p-value | < 0.05 | **Drift** — mean 438 → 514 chars, p ≈ 0 | | Label distribution | Jensen-Shannon divergence | > 0.05 | **Drift** — JS divergence 0.184 | The label shift is the most striking signal: positive reviews drop from 34% → 13% and negative surge from 61% → 83% in the later window — consistent with a review-bombing pattern. Overall drift is detected. All thresholds are configurable. The simulation script accepts `--oov-threshold`, `--length-pvalue`, and `--js-threshold` flags. ## Requirements and Assumptions The `HW1/` directory contains the original requirements and assumptions documents, which informed the production design: - **`requirements.md`** — Business metrics (−15% churn in 90 days), system performance (p95 < 200 ms, 50 req/s sustained), model quality (macro F1 ≥ 85%), and data quality requirements - **`assumptions.md`** — World-vs-Machine framework: what the system controls vs. what it assumes about its environment, and what happens when each assumption is violated ## Blog Post `blog/notebook-to-production.md` is a full practitioner post on the notebook-to-production gap, using this project as the case study. Sections: 1. The gap between notebook accuracy and production readiness 2. Requirements doc walkthrough 3. Failure modes the Amazon dataset triggers 4. The three-tier fallback chain 5. Drift simulation results 6. What comes next (fine-tuning, containerization, retraining loop)