benovamurat/dp-analytics-toolkit
GitHub: benovamurat/dp-analytics-toolkit
Stars: 0 | Forks: 0
# DP Analytics Toolkit
[](https://www.python.org/)
[](./LICENSE)
[](https://numpy.org/)
[](https://scipy.org/)
## What this is
`dp-analytics-toolkit` is a small, dependency-light Python package that
implements the core differential-privacy primitives a product-analytics
team actually uses: the Laplace mechanism for counts, sums, and means;
the Gaussian mechanism with Renyi-DP accounting; sensitivity helpers
that bound per-user contributions to a query; basic, advanced, and
Renyi-DP composition for tracking the cumulative privacy budget across
queries; local-DP primitives (randomized response and a simplified
RAPPOR over a Bloom filter) for telemetry where the central server is
not in the trust boundary; DP histograms and top-k via report-noisy-max;
and accuracy-vs-epsilon and empirical-leakage audit utilities for
calibrating and verifying deployments.
There is no hard dependency on `opendp` or `diffprivlib`. The
mechanisms and accountants here are written from scratch against the
standard published references so the math is auditable in one place.
The dependencies are `numpy`, `scipy`, `pandas`, and `matplotlib`.
## Why this exists
Differential privacy is a formal mathematical guarantee with precise
operational consequences: noise injected into aggregates, a privacy
budget that depletes with every query, and accuracy that degrades as
the population gets smaller. The decision to adopt DP is rarely about
the math; it is about whether the guarantee maps to the threat model
and whether the accuracy loss is recoverable. This toolkit is the
companion code for the long-form essay
[Differential Privacy for Product Analytics](https://productphilosophy.com/articles/privacy-preserving-analytics-differential-privacy),
which walks through the formalism, the production deployments (US
Census Bureau, Apple, Google's RAPPOR), and the failure modes. The
code lets you reproduce the accuracy-vs-epsilon curves from the essay,
audit the mechanisms against the published bounds, and prototype
internal deployments without pulling in a heavy production library.
## Install
pip install -e .
To also install the test dependencies:
pip install -e ".[test]"
The package supports Python 3.10 and newer.
## Quickstart
Release a noisy count and a noisy mean over synthetic product-analytics
events at epsilon = 1.0 each, and track the cumulative budget:
import numpy as np
from dp_analytics import (
BasicComposition,
generate_synthetic_events,
laplace_count,
laplace_mean,
)
rng = np.random.default_rng(2024)
events = generate_synthetic_events(n_users=5000, avg_sessions_per_user=4.0, seed=42)
budget = BasicComposition()
# Query 1: noisy count of premium-feature openings.
true_count = int(events["opened_premium_feature"].sum())
noisy_count = laplace_count(true_count, epsilon=1.0, rng=rng)
budget.spend(1.0, label="count_premium_opens")
# Query 2: noisy mean session duration, clipped per session to [0, 60].
noisy_mean = laplace_mean(
events["duration_minutes"].values,
epsilon=1.0,
lower=0.0,
upper=60.0,
n_known=len(events),
rng=rng,
)
budget.spend(1.0, label="mean_duration")
print(f"true count : {true_count}")
print(f"noisy count: {noisy_count:.2f}")
print(f"noisy mean : {noisy_mean:.4f} (true {events['duration_minutes'].mean():.4f})")
print(budget)
Run end-to-end:
python examples/quickstart.py
python examples/accuracy_eps_curve.py
python examples/local_dp_demo.py
## Method
### Epsilon-differential privacy
A randomized algorithm `M` satisfies `epsilon`-DP if, for every pair of
datasets `D` and `D'` that differ in a single record, and every event
`S`,
P(M(D) in S) <= exp(epsilon) * P(M(D') in S)
Smaller `epsilon` is a stronger guarantee. `(epsilon, delta)`-DP allows
the bound to fail with probability `delta`, typically set to at most
`1/n`. The relaxation is what makes Gaussian noise (and tighter
composition) practical.
### Laplace mechanism
For a real-valued query with L1 sensitivity `Delta_1`, releasing
M(D) = q(D) + Lap(0, Delta_1 / epsilon)
satisfies `epsilon`-DP. For a counting query (sensitivity 1) at
`epsilon = 1`, the noise standard deviation is `sqrt(2) approx 1.41`.
### Gaussian mechanism
For a vector-valued query with L2 sensitivity `Delta_2`, releasing
M(D) = q(D) + N(0, sigma^2 * I)
with
sigma >= Delta_2 * sqrt(2 * ln(1.25 / delta)) / epsilon
satisfies `(epsilon, delta)`-DP for `epsilon` in `(0, 1]`. The Gaussian
mechanism composes much better than Laplace under repeated queries; the
right accountant is Renyi DP.
### Composition
Three accountants are supplied:
| Accountant | When to use |
|---|---|
| `BasicComposition` | A handful of independent queries; epsilons add. |
| `advanced_composition` | Many queries with small per-query epsilon; epsilon grows as `O(sqrt(k))`. |
| `RDPAccountant` | Many Gaussian queries; track per-order Renyi divergence and convert to `(epsilon, delta)`-DP at release. |
### Randomized response
For binary attributes, randomized response (Warner 1965) satisfies
`epsilon`-local-DP: each user reports the true bit with probability
`p = e^epsilon / (e^epsilon + 1)` and the flipped bit otherwise. The
unbiased estimator for the population proportion is
pi_hat = (y_bar + p - 1) / (2 * p - 1)
### Simplified RAPPOR
For categorical attributes, a one-shot RAPPOR (Erlingsson et al. 2014)
hashes each user's value into a Bloom filter, applies per-bit
randomized response, and aggregates across many users. The supplied
implementation uses ordinary least squares with a non-negativity
projection over a candidate vocabulary; it is teaching-grade, not the
deployed multi-round system.
## Worked example: DP histogram at eps=2 vs eps=0.1
A heavy-tailed categorical distribution (think emoji-frequency
telemetry over 30,000 users) is recoverable at `epsilon = 2`, where
the noise standard deviation per bucket is `1 / 2 = 0.5`, and is
visibly degraded at `epsilon = 0.1`, where the per-bucket noise
standard deviation is `1 / 0.1 = 10`.
import numpy as np
import pandas as pd
from dp_analytics import dp_histogram
from dp_analytics.data import generate_categorical_telemetry
values, true_counts = generate_categorical_telemetry(n_users=30000, seed=3)
rng = np.random.default_rng(0)
for eps in [2.0, 0.1]:
out = dp_histogram(values, epsilon=eps, rng=rng)
out["rel_err_pct"] = (
(out["released_count"] - out["true_count"]).abs()
/ out["true_count"].clip(lower=1) * 100.0
)
print(f"\nepsilon = {eps}")
print(out.to_string(index=False))
At `epsilon = 2` the relative error on the head of the distribution is
typically under 1 percent. At `epsilon = 0.1` the rank order of the
top few items is usually preserved, but the long tail is dominated by
noise; this is the small-`n` collapse described in the essay.
## API
from dp_analytics import (
# Mechanisms.
laplace_count, laplace_sum, laplace_mean, laplace_scale,
gaussian_mechanism, gaussian_sigma_classic,
# Sensitivity.
clip_contributions, bounded_sensitivity,
# Accounting.
BasicComposition, advanced_composition, RDPAccountant,
# Local DP.
randomize_binary, estimate_proportion_rr, RAPPOR,
# Queries.
dp_histogram, dp_topk,
# Evaluation.
accuracy_vs_epsilon, empirical_leakage_audit,
# Data.
generate_synthetic_events, generate_binary_telemetry,
)
Every public function carries a docstring with the formal definition
and parameter constraints.
## Tests
pytest -q
The test suite covers: mean and variance of the Laplace and Gaussian
noise distributions on `n = 50,000` samples; clipping bounds; basic
composition (epsilons add); RDP closed-form for the Gaussian mechanism
matching `rho = alpha / (2 sigma^2)`; randomized-response estimator
unbiasedness on a large sample; DP histogram mass preservation under
negligible noise; DP top-k recovering the true ranking at large
epsilon.
## Limitations
The point of writing it from scratch is to keep the math auditable, not
to ship a production library. The following matter for any real
deployment:
1. **Utility loss is workload-specific.** The accuracy-privacy frontier
depends on the population size, the query, and the post-processing.
For populations under a few thousand, DP at any defensible epsilon
produces noise that dominates the signal; aggregate up or pick a
different model.
2. **Side-channel risk.** Naive floating-point implementations of the
Laplace and Gaussian distributions leak via the low-order bits of
the floating-point representation (Mironov 2012). This toolkit
relies on NumPy's PRNG, which is good for prototyping and audit but
is not the right RNG for an adversarial production setting.
3. **Composition correctness depends on the accountant matching the
deployment.** The `BasicComposition` tally is correct for
independent (epsilon, delta)-DP releases. If you use the same noisy
intermediate result across queries, you are not in the standard
composition setting and the supplied accountants do not apply.
4. **The simplified RAPPOR has no longitudinal defense.** The deployed
Google system added a "permanent randomized response" layer to
defend against repeated observations of the same user; this toolkit
implements only the one-shot mechanism.
5. **Sensitivity calibration is the user's responsibility.** A wrong
sensitivity argument silently voids the privacy guarantee. The
`bounded_sensitivity` helper is provided as scaffolding, but the
per-record contribution bound is a domain question.
## References
- Dwork, C., McSherry, F., Nissim, K., Smith, A. (2006).
Calibrating Noise to Sensitivity in Private Data Analysis. TCC.
- Dwork, C., Roth, A. (2014). The Algorithmic Foundations of
Differential Privacy. Foundations and Trends in Theoretical
Computer Science.
- Mironov, I. (2017). Renyi Differential Privacy. CSF.
- Mironov, I. (2012). On Significance of the Least Significant Bits for
Differential Privacy. CCS.
- Erlingsson, U., Pihur, V., Korolova, A. (2014). RAPPOR:
Randomized Aggregatable Privacy-Preserving Ordinal Response. CCS.
- Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I.,
Talwar, K., Zhang, L. (2016). Deep Learning with Differential
Privacy. CCS.
- Warner, S. L. (1965). Randomized Response: A Survey Technique for
Eliminating Evasive Answer Bias. JASA.
- Dwork, C., Rothblum, G., Vadhan, S. (2010). Boosting and
Differential Privacy. FOCS.
- Jagielski, M., Ullman, J., Oprea, A. (2020). Auditing Differentially
Private Machine Learning: How Private is Private SGD? NeurIPS.
## BibTeX
@inproceedings{dwork2006calibrating,
title = {Calibrating Noise to Sensitivity in Private Data Analysis},
author = {Dwork, Cynthia and McSherry, Frank and Nissim, Kobbi and Smith, Adam},
booktitle = {Theory of Cryptography Conference (TCC)},
year = {2006}
}
@book{dwork2014algorithmic,
title = {The Algorithmic Foundations of Differential Privacy},
author = {Dwork, Cynthia and Roth, Aaron},
publisher = {Foundations and Trends in Theoretical Computer Science},
year = {2014}
}
@inproceedings{mironov2017renyi,
title = {Renyi Differential Privacy},
author = {Mironov, Ilya},
booktitle = {Computer Security Foundations Symposium (CSF)},
year = {2017}
}
@inproceedings{erlingsson2014rappor,
title = {RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response},
author = {Erlingsson, Ulfar and Pihur, Vasyl and Korolova, Aleksandra},
booktitle = {ACM Conference on Computer and Communications Security (CCS)},
year = {2014}
}
@inproceedings{abadi2016dpsgd,
title = {Deep Learning with Differential Privacy},
author = {Abadi, Martin and Chu, Andy and Goodfellow, Ian and McMahan,
H. Brendan and Mironov, Ilya and Talwar, Kunal and Zhang, Li},
booktitle = {ACM Conference on Computer and Communications Security (CCS)},
year = {2016}
}
## License
MIT, copyright 2026 Murat Ova. See [LICENSE](./LICENSE).