HarshikaReddyUppula/ab-test-lab

GitHub: HarshikaReddyUppula/ab-test-lab

Stars: 0 | Forks: 0

# ab-test-lab A small Python toolkit for designing, running, and **privacy-preserving** analysis of A/B tests — built around the things production experimentation platforms actually need: correct sample-size math, defensible confidence intervals, sequential testing that doesn't lie when you peek, and differentially-private aggregates when user-level data can't leave the warehouse. ![Python](https://img.shields.io/badge/Python-3.11-3776AB?logo=python&logoColor=white) ![SciPy](https://img.shields.io/badge/SciPy-1.13-8CAAE6?logo=scipy&logoColor=white) ![PySpark](https://img.shields.io/badge/PySpark-3.5-E25A1C?logo=apachespark&logoColor=white) ![BigQuery](https://img.shields.io/badge/BigQuery-4285F4?logo=googlecloud&logoColor=white) ![Differential%20Privacy](https://img.shields.io/badge/Differential%20Privacy-✓-4B0082) ![Tests](https://img.shields.io/badge/tests-pytest-0A9EDC?logo=pytest&logoColor=white) ## Why this exists Most A/B test "tutorials" on GitHub stop at `scipy.stats.ttest_ind`. Real production experimentation platforms have to answer harder questions: - *How many users do I actually need for a 1% lift to be detectable?* → power-aware sample-size design - *We peeked at the dashboard yesterday and called it. Are we sure?* → sequential testing with proper alpha-spending - *Can we run an experiment on EU users without exposing user-level data?* → differential privacy on the aggregates This toolkit gives clean, tested implementations of each, with notebooks showing real-world usage on public datasets. ## Architecture flowchart LR subgraph Design D1[Sample-size calculator] D2[Power simulator] end subgraph Analyze A1[Frequentist tests] A2[Bootstrap CIs] A3[Sequential tests] end subgraph Private P1[Laplace mechanism] P2[Gaussian mechanism] P3[Private proportion test] end subgraph Scale S1[PySpark aggregations] S2[BigQuery user-level → bucketed] end D1 --> A1 D2 --> A1 A1 -.privacy noise.-> P1 A1 -.privacy noise.-> P2 S1 --> A1 S2 --> A1 ## What's in the library from ab_test_lab import design, analyze, private, simulate # 1. Design — how many users do I need? n = design.required_sample_size_proportion( baseline_rate=0.10, mde=0.005, alpha=0.05, power=0.80 ) # 2. Analyze — what's the lift, with a 95% CI? result = analyze.proportion_test( control_conversions=520, control_n=10_000, treatment_conversions=565, treatment_n=10_000, ) # result.lift, result.ci_lower, result.ci_upper, result.p_value # 3. Private — same test, with (epsilon=1.0)-DP noise on the aggregates private_result = private.private_proportion_test( control_conversions=520, control_n=10_000, treatment_conversions=565, treatment_n=10_000, epsilon=1.0, ) # private_result.lift, private_result.ci_lower, ... # 4. Simulate — what's my power, given my actual user-volume curve? power = simulate.power_simulation( baseline_rate=0.10, true_lift=0.01, n_per_arm=10_000, n_sims=2_000 ) ## Notebooks | # | Notebook | What it shows | |---|---|---| | 01 | `01_design_and_analyze.ipynb` | End-to-end: design an experiment, simulate users, analyze results | | 02 | `02_privacy_preserving.ipynb` | Same test, with differential privacy. Privacy/utility frontier | | 03 | `03_case_study_at_scale.ipynb` | Criteo Uplift dataset (~14M rows) bucketed in BigQuery + PySpark | ## Stack | Layer | Tool | | --- | --- | | Core stats | `scipy.stats`, `numpy`, `statsmodels` | | Differential privacy | Custom Laplace/Gaussian mechanisms (and `diffprivlib` for comparison) | | Scale demos | PySpark, BigQuery | | Testing | `pytest` | | Notebooks | Jupyter, Matplotlib, Seaborn | | Packaging | `pyproject.toml` (installable via `pip install -e .`) | ## Repository layout ab-test-lab/ ├── ab_test_lab/ # library code (importable as `ab_test_lab`) │ ├── design.py # sample size & power │ ├── analyze.py # frequentist tests & CIs │ ├── sequential.py # sequential / always-valid inference │ ├── private.py # differential privacy mechanisms │ └── simulate.py # Monte Carlo simulations ├── notebooks/ # end-to-end demonstrations ├── tests/ # pytest suite ├── pipelines/ # PySpark + BigQuery scale demos ├── docs/methodology.md # methodology deep-dive └── pyproject.toml ## Getting started git clone https://github.com/HarshikaReddyUppula/ab-test-lab.git cd ab-test-lab python -m venv .venv && source .venv/bin/activate pip install -e ".[dev]" # Run tests pytest # Open the demo jupyter lab notebooks/01_design_and_analyze.ipynb ## Methodology Full writeup: [docs/methodology.md](docs/methodology.md). Topics: - Why the standard two-proportion formula under-estimates *n* by ~10% when conversion rates are small. - The "peeking problem" and Pocock vs. O'Brien-Fleming spending functions. - The privacy/utility tradeoff: what ε buys you, what it costs in detectable effect. - When to use this vs. a Bayesian framework (and when *both*). ## Roadmap - [ ] **design** — proportion + continuous sample size, cluster-randomized adjustments - [ ] **analyze** — Welch's t-test, two-proportion z-test, bootstrap CIs, CUPED variance reduction - [ ] **sequential** — Pocock and O'Brien-Fleming alpha-spending; mSPRT - [ ] **private** — Laplace + Gaussian mechanisms, composition accounting, private proportion test - [ ] **simulate** — power curves, false-positive rate verification under peeking - [ ] **notebooks** — 01 design+analyze, 02 privacy, 03 scale (Criteo + PySpark + BigQuery) - [ ] **tests** — ≥80% coverage; property-based tests via `hypothesis` for noise mechanisms - [ ] **CI** — GitHub Actions: ruff + pytest on PR ## License MIT — use it, fork it, ship it.