HarshikaReddyUppula/ab-test-lab
GitHub: HarshikaReddyUppula/ab-test-lab
Stars: 0 | Forks: 0
# ab-test-lab
A small Python toolkit for designing, running, and **privacy-preserving** analysis of A/B tests — built around the things production experimentation platforms actually need: correct sample-size math, defensible confidence intervals, sequential testing that doesn't lie when you peek, and differentially-private aggregates when user-level data can't leave the warehouse.






## Why this exists
Most A/B test "tutorials" on GitHub stop at `scipy.stats.ttest_ind`. Real production experimentation platforms have to answer harder questions:
- *How many users do I actually need for a 1% lift to be detectable?* → power-aware sample-size design
- *We peeked at the dashboard yesterday and called it. Are we sure?* → sequential testing with proper alpha-spending
- *Can we run an experiment on EU users without exposing user-level data?* → differential privacy on the aggregates
This toolkit gives clean, tested implementations of each, with notebooks showing real-world usage on public datasets.
## Architecture
flowchart LR
subgraph Design
D1[Sample-size calculator]
D2[Power simulator]
end
subgraph Analyze
A1[Frequentist tests]
A2[Bootstrap CIs]
A3[Sequential tests]
end
subgraph Private
P1[Laplace mechanism]
P2[Gaussian mechanism]
P3[Private proportion test]
end
subgraph Scale
S1[PySpark aggregations]
S2[BigQuery user-level → bucketed]
end
D1 --> A1
D2 --> A1
A1 -.privacy noise.-> P1
A1 -.privacy noise.-> P2
S1 --> A1
S2 --> A1
## What's in the library
from ab_test_lab import design, analyze, private, simulate
# 1. Design — how many users do I need?
n = design.required_sample_size_proportion(
baseline_rate=0.10, mde=0.005, alpha=0.05, power=0.80
)
# 2. Analyze — what's the lift, with a 95% CI?
result = analyze.proportion_test(
control_conversions=520, control_n=10_000,
treatment_conversions=565, treatment_n=10_000,
)
# result.lift, result.ci_lower, result.ci_upper, result.p_value
# 3. Private — same test, with (epsilon=1.0)-DP noise on the aggregates
private_result = private.private_proportion_test(
control_conversions=520, control_n=10_000,
treatment_conversions=565, treatment_n=10_000,
epsilon=1.0,
)
# private_result.lift, private_result.ci_lower, ...
# 4. Simulate — what's my power, given my actual user-volume curve?
power = simulate.power_simulation(
baseline_rate=0.10, true_lift=0.01, n_per_arm=10_000, n_sims=2_000
)
## Notebooks
| # | Notebook | What it shows |
|---|---|---|
| 01 | `01_design_and_analyze.ipynb` | End-to-end: design an experiment, simulate users, analyze results |
| 02 | `02_privacy_preserving.ipynb` | Same test, with differential privacy. Privacy/utility frontier |
| 03 | `03_case_study_at_scale.ipynb` | Criteo Uplift dataset (~14M rows) bucketed in BigQuery + PySpark |
## Stack
| Layer | Tool |
| --- | --- |
| Core stats | `scipy.stats`, `numpy`, `statsmodels` |
| Differential privacy | Custom Laplace/Gaussian mechanisms (and `diffprivlib` for comparison) |
| Scale demos | PySpark, BigQuery |
| Testing | `pytest` |
| Notebooks | Jupyter, Matplotlib, Seaborn |
| Packaging | `pyproject.toml` (installable via `pip install -e .`) |
## Repository layout
ab-test-lab/
├── ab_test_lab/ # library code (importable as `ab_test_lab`)
│ ├── design.py # sample size & power
│ ├── analyze.py # frequentist tests & CIs
│ ├── sequential.py # sequential / always-valid inference
│ ├── private.py # differential privacy mechanisms
│ └── simulate.py # Monte Carlo simulations
├── notebooks/ # end-to-end demonstrations
├── tests/ # pytest suite
├── pipelines/ # PySpark + BigQuery scale demos
├── docs/methodology.md # methodology deep-dive
└── pyproject.toml
## Getting started
git clone https://github.com/HarshikaReddyUppula/ab-test-lab.git
cd ab-test-lab
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Run tests
pytest
# Open the demo
jupyter lab notebooks/01_design_and_analyze.ipynb
## Methodology
Full writeup: [docs/methodology.md](docs/methodology.md). Topics:
- Why the standard two-proportion formula under-estimates *n* by ~10% when conversion rates are small.
- The "peeking problem" and Pocock vs. O'Brien-Fleming spending functions.
- The privacy/utility tradeoff: what ε buys you, what it costs in detectable effect.
- When to use this vs. a Bayesian framework (and when *both*).
## Roadmap
- [ ] **design** — proportion + continuous sample size, cluster-randomized adjustments
- [ ] **analyze** — Welch's t-test, two-proportion z-test, bootstrap CIs, CUPED variance reduction
- [ ] **sequential** — Pocock and O'Brien-Fleming alpha-spending; mSPRT
- [ ] **private** — Laplace + Gaussian mechanisms, composition accounting, private proportion test
- [ ] **simulate** — power curves, false-positive rate verification under peeking
- [ ] **notebooks** — 01 design+analyze, 02 privacy, 03 scale (Criteo + PySpark + BigQuery)
- [ ] **tests** — ≥80% coverage; property-based tests via `hypothesis` for noise mechanisms
- [ ] **CI** — GitHub Actions: ruff + pytest on PR
## License
MIT — use it, fork it, ship it.