benovamurat/dp-analytics-toolkit

GitHub: benovamurat/dp-analytics-toolkit

Stars: 0 | Forks: 0

# DP Analytics Toolkit [![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](./LICENSE) [![numpy](https://img.shields.io/badge/numpy-%E2%89%A51.24-orange.svg)](https://numpy.org/) [![scipy](https://img.shields.io/badge/scipy-%E2%89%A51.10-blueviolet.svg)](https://scipy.org/) ## What this is `dp-analytics-toolkit` is a small, dependency-light Python package that implements the core differential-privacy primitives a product-analytics team actually uses: the Laplace mechanism for counts, sums, and means; the Gaussian mechanism with Renyi-DP accounting; sensitivity helpers that bound per-user contributions to a query; basic, advanced, and Renyi-DP composition for tracking the cumulative privacy budget across queries; local-DP primitives (randomized response and a simplified RAPPOR over a Bloom filter) for telemetry where the central server is not in the trust boundary; DP histograms and top-k via report-noisy-max; and accuracy-vs-epsilon and empirical-leakage audit utilities for calibrating and verifying deployments. There is no hard dependency on `opendp` or `diffprivlib`. The mechanisms and accountants here are written from scratch against the standard published references so the math is auditable in one place. The dependencies are `numpy`, `scipy`, `pandas`, and `matplotlib`. ## Why this exists Differential privacy is a formal mathematical guarantee with precise operational consequences: noise injected into aggregates, a privacy budget that depletes with every query, and accuracy that degrades as the population gets smaller. The decision to adopt DP is rarely about the math; it is about whether the guarantee maps to the threat model and whether the accuracy loss is recoverable. This toolkit is the companion code for the long-form essay [Differential Privacy for Product Analytics](https://productphilosophy.com/articles/privacy-preserving-analytics-differential-privacy), which walks through the formalism, the production deployments (US Census Bureau, Apple, Google's RAPPOR), and the failure modes. The code lets you reproduce the accuracy-vs-epsilon curves from the essay, audit the mechanisms against the published bounds, and prototype internal deployments without pulling in a heavy production library. ## Install pip install -e . To also install the test dependencies: pip install -e ".[test]" The package supports Python 3.10 and newer. ## Quickstart Release a noisy count and a noisy mean over synthetic product-analytics events at epsilon = 1.0 each, and track the cumulative budget: import numpy as np from dp_analytics import ( BasicComposition, generate_synthetic_events, laplace_count, laplace_mean, ) rng = np.random.default_rng(2024) events = generate_synthetic_events(n_users=5000, avg_sessions_per_user=4.0, seed=42) budget = BasicComposition() # Query 1: noisy count of premium-feature openings. true_count = int(events["opened_premium_feature"].sum()) noisy_count = laplace_count(true_count, epsilon=1.0, rng=rng) budget.spend(1.0, label="count_premium_opens") # Query 2: noisy mean session duration, clipped per session to [0, 60]. noisy_mean = laplace_mean( events["duration_minutes"].values, epsilon=1.0, lower=0.0, upper=60.0, n_known=len(events), rng=rng, ) budget.spend(1.0, label="mean_duration") print(f"true count : {true_count}") print(f"noisy count: {noisy_count:.2f}") print(f"noisy mean : {noisy_mean:.4f} (true {events['duration_minutes'].mean():.4f})") print(budget) Run end-to-end: python examples/quickstart.py python examples/accuracy_eps_curve.py python examples/local_dp_demo.py ## Method ### Epsilon-differential privacy A randomized algorithm `M` satisfies `epsilon`-DP if, for every pair of datasets `D` and `D'` that differ in a single record, and every event `S`, P(M(D) in S) <= exp(epsilon) * P(M(D') in S) Smaller `epsilon` is a stronger guarantee. `(epsilon, delta)`-DP allows the bound to fail with probability `delta`, typically set to at most `1/n`. The relaxation is what makes Gaussian noise (and tighter composition) practical. ### Laplace mechanism For a real-valued query with L1 sensitivity `Delta_1`, releasing M(D) = q(D) + Lap(0, Delta_1 / epsilon) satisfies `epsilon`-DP. For a counting query (sensitivity 1) at `epsilon = 1`, the noise standard deviation is `sqrt(2) approx 1.41`. ### Gaussian mechanism For a vector-valued query with L2 sensitivity `Delta_2`, releasing M(D) = q(D) + N(0, sigma^2 * I) with sigma >= Delta_2 * sqrt(2 * ln(1.25 / delta)) / epsilon satisfies `(epsilon, delta)`-DP for `epsilon` in `(0, 1]`. The Gaussian mechanism composes much better than Laplace under repeated queries; the right accountant is Renyi DP. ### Composition Three accountants are supplied: | Accountant | When to use | |---|---| | `BasicComposition` | A handful of independent queries; epsilons add. | | `advanced_composition` | Many queries with small per-query epsilon; epsilon grows as `O(sqrt(k))`. | | `RDPAccountant` | Many Gaussian queries; track per-order Renyi divergence and convert to `(epsilon, delta)`-DP at release. | ### Randomized response For binary attributes, randomized response (Warner 1965) satisfies `epsilon`-local-DP: each user reports the true bit with probability `p = e^epsilon / (e^epsilon + 1)` and the flipped bit otherwise. The unbiased estimator for the population proportion is pi_hat = (y_bar + p - 1) / (2 * p - 1) ### Simplified RAPPOR For categorical attributes, a one-shot RAPPOR (Erlingsson et al. 2014) hashes each user's value into a Bloom filter, applies per-bit randomized response, and aggregates across many users. The supplied implementation uses ordinary least squares with a non-negativity projection over a candidate vocabulary; it is teaching-grade, not the deployed multi-round system. ## Worked example: DP histogram at eps=2 vs eps=0.1 A heavy-tailed categorical distribution (think emoji-frequency telemetry over 30,000 users) is recoverable at `epsilon = 2`, where the noise standard deviation per bucket is `1 / 2 = 0.5`, and is visibly degraded at `epsilon = 0.1`, where the per-bucket noise standard deviation is `1 / 0.1 = 10`. import numpy as np import pandas as pd from dp_analytics import dp_histogram from dp_analytics.data import generate_categorical_telemetry values, true_counts = generate_categorical_telemetry(n_users=30000, seed=3) rng = np.random.default_rng(0) for eps in [2.0, 0.1]: out = dp_histogram(values, epsilon=eps, rng=rng) out["rel_err_pct"] = ( (out["released_count"] - out["true_count"]).abs() / out["true_count"].clip(lower=1) * 100.0 ) print(f"\nepsilon = {eps}") print(out.to_string(index=False)) At `epsilon = 2` the relative error on the head of the distribution is typically under 1 percent. At `epsilon = 0.1` the rank order of the top few items is usually preserved, but the long tail is dominated by noise; this is the small-`n` collapse described in the essay. ## API from dp_analytics import ( # Mechanisms. laplace_count, laplace_sum, laplace_mean, laplace_scale, gaussian_mechanism, gaussian_sigma_classic, # Sensitivity. clip_contributions, bounded_sensitivity, # Accounting. BasicComposition, advanced_composition, RDPAccountant, # Local DP. randomize_binary, estimate_proportion_rr, RAPPOR, # Queries. dp_histogram, dp_topk, # Evaluation. accuracy_vs_epsilon, empirical_leakage_audit, # Data. generate_synthetic_events, generate_binary_telemetry, ) Every public function carries a docstring with the formal definition and parameter constraints. ## Tests pytest -q The test suite covers: mean and variance of the Laplace and Gaussian noise distributions on `n = 50,000` samples; clipping bounds; basic composition (epsilons add); RDP closed-form for the Gaussian mechanism matching `rho = alpha / (2 sigma^2)`; randomized-response estimator unbiasedness on a large sample; DP histogram mass preservation under negligible noise; DP top-k recovering the true ranking at large epsilon. ## Limitations The point of writing it from scratch is to keep the math auditable, not to ship a production library. The following matter for any real deployment: 1. **Utility loss is workload-specific.** The accuracy-privacy frontier depends on the population size, the query, and the post-processing. For populations under a few thousand, DP at any defensible epsilon produces noise that dominates the signal; aggregate up or pick a different model. 2. **Side-channel risk.** Naive floating-point implementations of the Laplace and Gaussian distributions leak via the low-order bits of the floating-point representation (Mironov 2012). This toolkit relies on NumPy's PRNG, which is good for prototyping and audit but is not the right RNG for an adversarial production setting. 3. **Composition correctness depends on the accountant matching the deployment.** The `BasicComposition` tally is correct for independent (epsilon, delta)-DP releases. If you use the same noisy intermediate result across queries, you are not in the standard composition setting and the supplied accountants do not apply. 4. **The simplified RAPPOR has no longitudinal defense.** The deployed Google system added a "permanent randomized response" layer to defend against repeated observations of the same user; this toolkit implements only the one-shot mechanism. 5. **Sensitivity calibration is the user's responsibility.** A wrong sensitivity argument silently voids the privacy guarantee. The `bounded_sensitivity` helper is provided as scaffolding, but the per-record contribution bound is a domain question. ## References - Dwork, C., McSherry, F., Nissim, K., Smith, A. (2006). Calibrating Noise to Sensitivity in Private Data Analysis. TCC. - Dwork, C., Roth, A. (2014). The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. - Mironov, I. (2017). Renyi Differential Privacy. CSF. - Mironov, I. (2012). On Significance of the Least Significant Bits for Differential Privacy. CCS. - Erlingsson, U., Pihur, V., Korolova, A. (2014). RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. CCS. - Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., Zhang, L. (2016). Deep Learning with Differential Privacy. CCS. - Warner, S. L. (1965). Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias. JASA. - Dwork, C., Rothblum, G., Vadhan, S. (2010). Boosting and Differential Privacy. FOCS. - Jagielski, M., Ullman, J., Oprea, A. (2020). Auditing Differentially Private Machine Learning: How Private is Private SGD? NeurIPS. ## BibTeX @inproceedings{dwork2006calibrating, title = {Calibrating Noise to Sensitivity in Private Data Analysis}, author = {Dwork, Cynthia and McSherry, Frank and Nissim, Kobbi and Smith, Adam}, booktitle = {Theory of Cryptography Conference (TCC)}, year = {2006} } @book{dwork2014algorithmic, title = {The Algorithmic Foundations of Differential Privacy}, author = {Dwork, Cynthia and Roth, Aaron}, publisher = {Foundations and Trends in Theoretical Computer Science}, year = {2014} } @inproceedings{mironov2017renyi, title = {Renyi Differential Privacy}, author = {Mironov, Ilya}, booktitle = {Computer Security Foundations Symposium (CSF)}, year = {2017} } @inproceedings{erlingsson2014rappor, title = {RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response}, author = {Erlingsson, Ulfar and Pihur, Vasyl and Korolova, Aleksandra}, booktitle = {ACM Conference on Computer and Communications Security (CCS)}, year = {2014} } @inproceedings{abadi2016dpsgd, title = {Deep Learning with Differential Privacy}, author = {Abadi, Martin and Chu, Andy and Goodfellow, Ian and McMahan, H. Brendan and Mironov, Ilya and Talwar, Kunal and Zhang, Li}, booktitle = {ACM Conference on Computer and Communications Security (CCS)}, year = {2016} } ## License MIT, copyright 2026 Murat Ova. See [LICENSE](./LICENSE).