hharcolezi/multi-freq-ldpy
GitHub: hharcolezi/multi-freq-ldpy
提供在本地差分隐私保证下进行单次、多维和纵向频率估计的 Python 工具包,用于隐私保护的数据收集与统计。
Stars: 52 | Forks: 5
# Multi-Freq-LDPy: Python 中的本地差分隐私多频率估计
[](https://pypi.org/project/multi-freq-ldpy/)
[](https://pypi.org/project/multi-freq-ldpy/)
[](https://github.com/hharcolezi/multi-freq-ldpy/blob/main/LICENSE)
[](https://pepy.tech/project/multi-freq-ldpy)
[](https://github.com/hharcolezi/multi-freq-ldpy/stargazers)
Multi-Freq-LDPy 是一个 Python 库,用于在本地差分隐私(LDP)保证下执行多频率估计任务(一次性、多维、纵向以及两者的组合)。主要目标是提供一个易于使用且执行快速的工具包,用于基准测试和实验最新的解决方案及 LDP 协议。
## 📚 资源
以下是我们包的简介[视频演示](https://screencast-o-matic.com/watch/c3hhQYVYNDi)、[幻灯片演示](http://hharcolezi.github.io/files/2022_Multi_Freq_LDPy_Presentation.pdf)和[演示论文](https://arxiv.org/abs/2205.02648)。
## 📜 引用
如果我们的代码和工作对您有帮助,我们希望您能引用:
```
@incollection{Arcolezi2022,
doi = {10.1007/978-3-031-17143-7_40},
url = {https://doi.org/10.1007/978-3-031-17143-7_40},
year = {2022},
publisher = {Springer Nature Switzerland},
pages = {770--775},
author = {H{\'{e}}ber H. Arcolezi and Jean-Fran{\c{c}}ois Couchot and S{\'{e}}bastien Gambs and Catuscia Palamidessi and Majid Zolfaghari},
title = {Multi-Freq-{LDPy}: Multiple Frequency Estimation Under Local Differential Privacy in~Python},
booktitle = {Computer Security {\textendash} {ESORICS} 2022}
}
```
```
@incollection{Arcolezi2023,
doi = {10.1007/978-3-031-37586-6_11},
url = {https://doi.org/10.1007/978-3-031-37586-6_11},
year = {2023},
publisher = {Springer Nature Switzerland},
pages = {165--183},
author = {H{\'{e}}ber H. Arcolezi and Selene Cerna and Catuscia Palamidessi},
title = {On the Utility Gain of Iterative Bayesian Update for Locally Differentially Private Mechanisms},
booktitle = {Data and Applications Security and Privacy {XXXVII}}
}
```
## 🚀 安装说明
请使用包管理器 [pip](https://pypi.org/project/multi-freq-ldpy/) 安装 multi-freq-ldpy。
```
pip install multi-freq-ldpy --upgrade
```
为确保您使用的是最新版本。
Multi-Freq-LDPy 需要以下 Python 包。
```
numpy
numba
xxhash
```
## 📁 内容
Multi-Freq-LDPy 及其模块的结构如下。
```
multi-freq-ldpy package
|
|- pure_frequency_oracles (Single Frequency Estimation)
| |- GRR (Generalized Randomized Response[1,2] a.k.a. k-RR or Direct Encoding)
| |- UE (Unary Encoding)
| | |- SUE (Symmetric UE[3] a.k.a. Basic One-Time RAPPOR[11])
| | |- OUE (Optimized UE[3])
| |- HE (Histogram Encoding)
| | |- SHE (Summation with HE[3])
| | |- THE (Thresholding with HE[3])
| |- LH (Local Hashing)
| | |- BLH (Binary LH[3,4])
| | |- OLH (Optimized LH[3])
| |- SS (Subset Selection[5,6])
| |- ADP (Adaptive, i.e., GRR or OUE)
|
|- mdim_freq_est (Multidimensional Frequency Estimation)
| |- SPL_solution (Splitting solution[7,8]): Splits the privacy budget and sanitizes using pure_frequency_oracles LDP protocols
| | |- SPL_GRR, SPL_SUE, SPL_OUE, SPL_BLH, SPL_OLH, SPL_SS, SPL_ADP
| |- SMP_solution (Random Sampling solution[7,8]): Samples a single attribute and sanitizes using pure_frequency_oracles LDP protocols
| | |- SMP_GRR, SMP_SUE, SMP_OUE, SMP_BLH, SMP_OLH, SMP_SS, SMP_ADP
| |- RSpFD_solution (Random Sampling + Fake Data solution[9]): Samples a single attribute to sanitize but also generates fake data for each non-sampled attribute
| | |- RSpFD_GRR (fake data generated following domain size)
| | |- RSpFD_SUE_zero (fake data generated with SUE applied to a zero-vector)
| | |- RSpFD_SUE_rnd (fake data generated with SUE applied to a random bit-vector)
| | |- RSpFD_OUE_zero (fake data generated with OUE applied to a zero-vector)
| | |- RSpFD_OUE_rnd (fake data generated with OUE applied to a random bit-vector)
| | |- RSpFD_ADP (RSpFD_GRR or RSpFD_OUE_z)
|
|- long_freq_est (Longitudinal Single Frequency Estimation)
| |- L_GRR (Longitudinal GRR[10])
| |- L_LH (Longitudinal LH[12])
| | |- L_BLH (Binary LH[12])
| | |- L_OLH (Optimized LH[12])
| |- L_OUE (Longitudinal OUE[10])
| |- L_OSUE (Longitudinal OUE-SUE[10])
| |- L_SUE (Longitudinal SUE[10], a.k.a. Basic RAPPOR[11])
| |- L_SOUE (Longitudinal SUE-OUE[10])
| |- L_ADP (Longitudinal ADP[10], i.e., L-GRR or L-OSUE)
| |- dBitFlipPM[13]
|
|- long_mdim_freq_est (Longitudinal Multidimensional Frequency Estimation)
| |- Longitudinal SPL (L_SPL_Solution[10]): Splits the privacy budget and sanitizes using long_freq_est LDP protocols
| | |- SPL_L_GRR, SPL_L_OUE, SPL_L_OSUE, SPL_L_SUE, SPL_L_SOUE, SPL_L_ADP, SPL_dBitFlipPM
| |- Longitudinal SMP (L_SMP_Solution[10]): Samples a single attribute and sanitizes using long_freq_est LDP protocols
| | |- SMP_L_GRR, SMP_L_OUE, SMP_L_OSUE, SMP_L_SUE, SMP_L_SOUE, SMP_L_ADP, SMP_dBitFlipPM
|
| - estimators (Distribution Estimator Methods)
| |- MI (Matrix Inverse)
| |- IBU (Iterative Bayesian Estimator[14])
| |- MLE (Maximum Likelihood Estimator[15])
```
## ⚡ 使用方法
这是一个基于函数的包,用于模拟用户和服务器的 LDP 数据收集流程。每个功能都有 ```Client``` 和 ```Aggregator``` 函数。如需更多详细信息,请参阅[tutorials](https://github.com/hharcolezi/multi-freq-ldpy/tree/main/tutorials)文件夹,其中涵盖使用真实世界开放数据集([Adult](https://archive.ics.uci.edu/ml/datasets/adult)、[Nursery](https://archive.ics.uci.edu/ml/datasets/nursery)、[MS-FIMU](https://github.com/hharcolezi/OpenMSFIMU))的所有 1--4 项任务。
```
# 常用库
import numpy as np
import matplotlib.pyplot as plt
# L-SUE 协议(又称 Basic RAPPOR[11])的 Multi-Freq-LDPy 函数
from multi_freq_ldpy.long_freq_est.L_SUE import L_SUE_Client, L_SUE_Aggregator_MI, L_SUE_Aggregator_IBU
# 模拟参数
epsilon_perm = 2 # longitudinal privacy guarantee, i.e., upper bound (infinity reports)
epsilon_1 = 0.5 * epsilon_perm # single report privacy guarantee, i.e., lower bound
n = int(1e6) # number of users
k = 5 # attribute's domain size
# 每个用户拥有 [0-k) 范围内数字、包含 n 个用户的模拟数据集
data = np.random.randint(k, size=n)
# 客户端模拟
l_sue_reports = [L_SUE_Client(input_data, k, epsilon_perm, epsilon_1) for input_data in data]
# 采用矩阵求逆(MI)的服务端聚合模拟
l_sue_est_freq_MI = L_SUE_Aggregator_MI(l_sue_reports, epsilon_perm, epsilon_1)
# 采用迭代贝叶斯更新(IBU)[14] 的服务端聚合模拟
l_sue_est_freq_IBU = L_SUE_Aggregator_IBU(l_sue_reports, k, epsilon_perm, epsilon_1)
# 真实频率
real_freq = np.unique(data, return_counts=True)[-1] / n
# 结果可视化
x = np.arange(k) # the label locations
barwidth = 0.3 # the width of the bars
plt.bar(x - barwidth, real_freq, label='Real Freq', width=barwidth)
plt.bar(x , l_sue_est_freq_MI, label='MI Est Freq: L-SUE', width=barwidth)
plt.bar(x + barwidth, l_sue_est_freq_IBU, label='IBU Est Freq: L-SUE', width=barwidth)
plt.ylabel('Normalized Frequency')
plt.xlabel('Domain values')
plt.legend(ncol=3, loc='upper right', bbox_to_anchor=(1., 1.1))
plt.show();
```
## 📬 联系方式
如有任何问题,请联系 [Héber H. Arcolezi](https://hharcolezi.github.io/)。
## 📝 许可证
[MIT](https://github.com/hharcolezi/multi-freq-ldpy/blob/main/LICENSE)
## 📚 主要参考文献
- [1] Kairouz, Peter, Keith Bonawitz, and Daniel Ramage. "Discrete distribution estimation under local privacy." International Conference on Machine Learning. PMLR, 2016.
- [2] Kairouz, Peter, Sewoong Oh, and Pramod Viswanath. "Extremal mechanisms for local differential privacy." Advances in neural information processing systems 27 (2014).
- [3] Wang, Tianhao, et al. "Locally differentially private protocols for frequency estimation." 26th USENIX Security Symposium (USENIX Security 17). 2017.
- [4] Bassily, Raef, and Adam Smith. "Local, private, efficient protocols for succinct histograms." Proceedings of the forty-seventh annual ACM symposium on Theory of computing. 2015.
- [5] Ye, Min, and Alexander Barg. "Optimal schemes for discrete distribution estimation under locally differential privacy." IEEE Transactions on Information Theory 64.8 (2018): 5662-5676.
- [6] Wang, Shaowei, et al. "Mutual information optimally local private discrete distribution estimation." arXiv preprint arXiv:1607.08025 (2016).
- [7] Nguyên, Thông T., et al. "Collecting and analyzing data from smart device users with local differential privacy." arXiv preprint arXiv:1606.05053 (2016).
- [8] Wang, Ning, et al. "Collecting and analyzing multidimensional data with local differential privacy." 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019.
- [9] Arcolezi, Héber H., et al. "Random sampling plus fake data: Multidimensional frequency estimates with local differential privacy." Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021.
- [10] Arcolezi, Héber H., et al. "Improving the utility of locally differentially private protocols for longitudinal and multidimensional frequency estimates." Digital Communications and Networks (2022).
- [11] Erlingsson, Úlfar, Vasyl Pihur, and Aleksandra Korolova. "Rappor: Randomized aggregatable privacy-preserving ordinal response." Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. 2014.
- [12] Arcolezi, Héber H., et al. "Frequency Estimation of Evolving Data Under Local Differential Privacy." arXiv preprint arXiv:2210.00262 (2022).
- [13] Ding, Bolin, Janardhan Kulkarni, and Sergey Yekhanin. "Collecting telemetry data privately." Advances in Neural Information Processing Systems 30 (2017).
- [14] Agrawal, Dakshi, and Charu C. Aggarwal. "On the design and quantification of privacy preserving data mining algorithms." Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 2001.
- [15] Pinzón, Carlos Antonio, et al. "Estimating the True Distribution of Data Collected with Randomized Response." arXiv preprint arXiv:2601.08603 (2026).
标签:BSD, LDP协议, meg, Python, 信息安全, 多维数据分析, 差分隐私, 数据挖掘, 无后门, 本地差分隐私, 纵向分析, 统计估计, 网络安全, 逆向工具, 隐私保护, 隐私计算, 频率估计