msburns24/amazon-sentiment-ml-eng

GitHub: msburns24/amazon-sentiment-ml-eng

一个基于 Amazon 评论数据的生产级情感分类系统，展示了从 notebook 原型到可部署 ML 系统的工程模式，涵盖回退链、漂移检测与可观测性。

Stars: 0 | Forks: 0

# amazon-sentiment-ml-eng 一个基于 [Amazon Customer Reviews 2023](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023) 数据集构建的生产级情感分类器。它展示了介于 notebook 原型与部署系统之间的机器学习工程模式：结构化错误处理、三级回退链、漂移检测、FastAPI 服务层以及 Streamlit 可观测性仪表板。该项目最初是 DATA 789 (UNC MADS) 关于需求工程的作业，后来被扩展为一个完整的端到端系统。 ## 项目结构 ``` amazon-sentiment-ml-eng/ ├── src/amazon_sentiment/ │ ├── classifier.py # DistilBERT inference, input validation, OOV detection │ ├── fallback.py # Three-tier fallback chain │ ├── drift.py # Drift report (OOV rate, KS test, JS divergence) │ ├── api.py # FastAPI app (/predict, /health, /metrics) │ └── dashboard.py # Streamlit observability dashboard ├── scripts/ │ ├── download.py # Download Amazon Reviews dataset from HuggingFace │ ├── preprocess.py # Derive sentiment labels from star ratings │ ├── split_windows.py # Partition data into early/late time windows │ ├── train.py # Fine-tune DistilBERT on the training window │ └── simulate_drift.py # Compare windows and produce a drift report ├── reports/ │ └── drift.json # Drift simulation output ├── tests/ │ ├── unit/ # Classifier, fallback, resource guard unit tests │ ├── integration/ # End-to-end with real DistilBERT checkpoint │ ├── drift/ # Drift detection unit and scenario tests │ └── api/ # FastAPI endpoint tests ├── HW1/ # Original homework artifacts (archived) │ ├── sentiment_classifier.py │ ├── fallback_system.py │ ├── requirements.md │ └── assumptions.md ├── blog/ │ └── notebook-to-production.md # Full blog post draft └── pyproject.toml ``` ## 快速开始 ### 环境配置 ``` python -m venv .venv # Windows .venv\Scripts\Activate.ps1 # macOS/Linux source .venv/bin/activate pip install -e . ``` ### 运行分类器 ``` from amazon_sentiment.classifier import classify result = classify("This product is absolutely fantastic!") # {'label': 'positive', 'confidence': 0.93, 'status': 'ok', 'reason': ''} ``` 每次调用都会返回包含相同四个键的字典 —— `label`、`confidence`、`status`、`reason` —— 并且永远不会向调用方抛出异常。`status` 字段会告诉你响应来自哪一层：`"ok"`（模型）、`"fallback"`（基于规则或人工队列）或 `"rejected"`（输入验证失败）。 ### 运行 API 服务器 ``` uvicorn amazon_sentiment.api:app --reload ``` ``` # 对评论进行分类 curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{"text": "Great quality and fast shipping."}' # 健康检查 curl http://localhost:8000/health # 滚动指标 curl http://localhost:8000/metrics ``` ### 运行可观测性仪表板 ``` streamlit run src/amazon_sentiment/dashboard.py ``` 侧边栏提供了一个 **生成演示数据** 按钮，可以填充模拟的预测日志，这样你无需依赖实时 API 即可探索所有面板。 ### 模拟漂移 ``` # 将 drift report 打印到 stdout python scripts/simulate_drift.py # 为 dashboard 的 Drift Report 标签页保存 JSON python scripts/simulate_drift.py --output-json reports/drift.json ``` ### 运行测试 ``` # Unit + drift + API 测试（unit/drift 不需要下载模型） pytest tests/unit/ tests/drift/ tests/api/ # 包含 integration tests 的完整套件 pytest ``` ## 训练你的专属模型数据流水线是一个包含四个步骤的脚本序列。每个步骤都会读取和写入 `data/` 目录： ``` # 1. 下载原始 Amazon Reviews 数据集 python scripts/download.py # 2. 根据星级评分推导情感标签（1–2 → negative，3 → neutral，4–5 → positive） python scripts/preprocess.py # 3. 划分为早期/晚期时间窗口以进行 drift 模拟 python scripts/split_windows.py # 4. 在早期窗口上 fine-tune DistilBERT（写入到 models/amazon-sentiment/） python scripts/train.py ``` 仓库中已经包含了一个预训练的 checkpoint，位于 `models/amazon-sentiment/`。如果你只想复现漂移模拟，可以直接跳到第 3 或第 4 步。 ## 关键设计决策 ### 分类器：微调的 3 分类 DistilBERT `distilbert-base-uncased` 直接在 Amazon Reviews 上进行微调，具有三个原生输出类别：`positive`、`negative`、`neutral`。在 18,851 个样本上训练了 2 个 epoch；在留出验证集上达到了 **91.4% 的准确率**。从星级评定中推导标签是一种有据可查的权衡 —— 3 星评价是中性情感的嘈杂代理指标。有关各类别的 precision/recall 目标及其业务合理性，请参阅 `HW1/requirements.md`。此仓库中不包含模型 checkpoint（对于 git 来说太大）。请运行下方的训练流水线来生成它，或者将 `classifier.py` 中的 `MODEL_NAME` 指向任何兼容 HuggingFace 的 checkpoint。如果未找到本地 checkpoint，`classifier.py` 会回退到 `distilbert-base-uncased-finetuned-sst-2-english`（SST-2 二分类），将低置信度预测（score < 0.75）映射为 neutral。 ### 回退链：三级，同一契约每个输入 —— 无论是否有效 —— 都会产生相同的响应结构。回退链确保了这一点： ``` Input → Validation → Resource check → Tier 1: DistilBERT (confidence ≥ 0.6) → Tier 2: Keyword heuristics (no match or tie → next) → Tier 3: Human review queue (always structured response) ``` ### 漂移检测：三个信号 `compute_drift_report(early_df, late_df)` 比较两个 DataFrame 并报告以下情况： | 信号 | 方法 | 默认阈值 | Amazon 数据上的结果 | |---|---|---|---| | OOV 率 | 均值差值 | > 5% | **无漂移** —— 差值为 3.3% | | 文本长度 | Kolmogorov-Smirnov p 值 | < 0.05 | **漂移** —— 均值从 438 变为 514 字符，p ≈ 0 | | 标签分布 | Jensen-Shannon 散度 | > 0.05 | **漂移** —— JS 散度为 0.184 | 标签偏移是最显著的信号：在后期时间窗口中，正面评价从 34% 骤降至 13%，负面评价从 61% 激增至 83% —— 这与恶意差评轰炸的模式一致。系统检测到了整体漂移。所有阈值均可配置。模拟脚本支持 `--oov-threshold`、`--length-pvalue` 和 `--js-threshold` 参数。 ## 需求与假设 `HW1/` 目录包含了最初的需求和假设文档，它们指导了生产环境的设计： - **`requirements.md`** —— 业务指标（90天内流失率降低 15%）、系统性能（p95 < 200 ms，持续 50 req/s）、模型质量（macro F1 ≥ 85%）以及数据质量要求 - **`assumptions.md`** —— 世界与机器（World-vs-Machine）框架：系统控制什么与它对其环境所做的假设，以及当每个假设被违背时会发生什么 ## 博客文章 `blog/notebook-to-production.md` 是一篇关于从 notebook 到生产环境之间鸿沟的实战文章，以本项目为案例研究。章节包括： 1. notebook 准确率与生产环境就绪度之间的差距 2. 需求文档解读 3. Amazon 数据集触发的故障模式 4. 三级回退链 5. 漂移模拟结果 6. 下一步计划（微调、容器化、重训练循环）

标签：AV绕过, DistilBERT, FastAPI, Kubernetes, 安全规则引擎, 情感分析, 机器学习工程, 概念漂移检测, 模型服务, 逆向工具