hharcolezi/multi-freq-ldpy

GitHub: hharcolezi/multi-freq-ldpy

提供在本地差分隐私保证下进行单次、多维和纵向频率估计的 Python 工具包，用于隐私保护的数据收集与统计。

Stars: 52 | Forks: 5

# Multi-Freq-LDPy: Python 中的本地差分隐私多频率估计 [![PyPI version](https://img.shields.io/pypi/v/multi-freq-ldpy.svg)](https://pypi.org/project/multi-freq-ldpy/) [![Python versions](https://img.shields.io/pypi/pyversions/multi-freq-ldpy.svg)](https://pypi.org/project/multi-freq-ldpy/) [![License](https://img.shields.io/github/license/hharcolezi/multi-freq-ldpy.svg)](https://github.com/hharcolezi/multi-freq-ldpy/blob/main/LICENSE) [![Downloads](https://static.pepy.tech/badge/multi-freq-ldpy)](https://pepy.tech/project/multi-freq-ldpy) [![Stars](https://img.shields.io/github/stars/hharcolezi/multi-freq-ldpy.svg)](https://github.com/hharcolezi/multi-freq-ldpy/stargazers) Multi-Freq-LDPy 是一个 Python 库，用于在本地差分隐私（LDP）保证下执行多频率估计任务（一次性、多维、纵向以及两者的组合）。主要目标是提供一个易于使用且执行快速的工具包，用于基准测试和实验最新的解决方案及 LDP 协议。 ## 📚 资源以下是我们包的简介[视频演示](https://screencast-o-matic.com/watch/c3hhQYVYNDi)、[幻灯片演示](http://hharcolezi.github.io/files/2022_Multi_Freq_LDPy_Presentation.pdf)和[演示论文](https://arxiv.org/abs/2205.02648)。 ## 📜 引用如果我们的代码和工作对您有帮助，我们希望您能引用： ``` @incollection{Arcolezi2022, doi = {10.1007/978-3-031-17143-7_40}, url = {https://doi.org/10.1007/978-3-031-17143-7_40}, year = {2022}, publisher = {Springer Nature Switzerland}, pages = {770--775}, author = {H{\'{e}}ber H. Arcolezi and Jean-Fran{\c{c}}ois Couchot and S{\'{e}}bastien Gambs and Catuscia Palamidessi and Majid Zolfaghari}, title = {Multi-Freq-{LDPy}: Multiple Frequency Estimation Under Local Differential Privacy in~Python}, booktitle = {Computer Security {\textendash} {ESORICS} 2022} } ``` ``` @incollection{Arcolezi2023, doi = {10.1007/978-3-031-37586-6_11}, url = {https://doi.org/10.1007/978-3-031-37586-6_11}, year = {2023}, publisher = {Springer Nature Switzerland}, pages = {165--183}, author = {H{\'{e}}ber H. Arcolezi and Selene Cerna and Catuscia Palamidessi}, title = {On the Utility Gain of Iterative Bayesian Update for Locally Differentially Private Mechanisms}, booktitle = {Data and Applications Security and Privacy {XXXVII}} } ``` ## 🚀 安装说明请使用包管理器 [pip](https://pypi.org/project/multi-freq-ldpy/) 安装 multi-freq-ldpy。 ``` pip install multi-freq-ldpy --upgrade ``` 为确保您使用的是最新版本。 Multi-Freq-LDPy 需要以下 Python 包。 ``` numpy numba xxhash ``` ## 📁 内容 Multi-Freq-LDPy 及其模块的结构如下。 ``` multi-freq-ldpy package | |- pure_frequency_oracles (Single Frequency Estimation) | |- GRR (Generalized Randomized Response[1,2] a.k.a. k-RR or Direct Encoding) | |- UE (Unary Encoding) | | |- SUE (Symmetric UE[3] a.k.a. Basic One-Time RAPPOR[11]) | | |- OUE (Optimized UE[3]) | |- HE (Histogram Encoding) | | |- SHE (Summation with HE[3]) | | |- THE (Thresholding with HE[3]) | |- LH (Local Hashing) | | |- BLH (Binary LH[3,4]) | | |- OLH (Optimized LH[3]) | |- SS (Subset Selection[5,6]) | |- ADP (Adaptive, i.e., GRR or OUE) | |- mdim_freq_est (Multidimensional Frequency Estimation) | |- SPL_solution (Splitting solution[7,8]): Splits the privacy budget and sanitizes using pure_frequency_oracles LDP protocols | | |- SPL_GRR, SPL_SUE, SPL_OUE, SPL_BLH, SPL_OLH, SPL_SS, SPL_ADP | |- SMP_solution (Random Sampling solution[7,8]): Samples a single attribute and sanitizes using pure_frequency_oracles LDP protocols | | |- SMP_GRR, SMP_SUE, SMP_OUE, SMP_BLH, SMP_OLH, SMP_SS, SMP_ADP | |- RSpFD_solution (Random Sampling + Fake Data solution[9]): Samples a single attribute to sanitize but also generates fake data for each non-sampled attribute | | |- RSpFD_GRR (fake data generated following domain size) | | |- RSpFD_SUE_zero (fake data generated with SUE applied to a zero-vector) | | |- RSpFD_SUE_rnd (fake data generated with SUE applied to a random bit-vector) | | |- RSpFD_OUE_zero (fake data generated with OUE applied to a zero-vector) | | |- RSpFD_OUE_rnd (fake data generated with OUE applied to a random bit-vector) | | |- RSpFD_ADP (RSpFD_GRR or RSpFD_OUE_z) | |- long_freq_est (Longitudinal Single Frequency Estimation) | |- L_GRR (Longitudinal GRR[10]) | |- L_LH (Longitudinal LH[12]) | | |- L_BLH (Binary LH[12]) | | |- L_OLH (Optimized LH[12]) | |- L_OUE (Longitudinal OUE[10]) | |- L_OSUE (Longitudinal OUE-SUE[10]) | |- L_SUE (Longitudinal SUE[10], a.k.a. Basic RAPPOR[11]) | |- L_SOUE (Longitudinal SUE-OUE[10]) | |- L_ADP (Longitudinal ADP[10], i.e., L-GRR or L-OSUE) | |- dBitFlipPM[13] | |- long_mdim_freq_est (Longitudinal Multidimensional Frequency Estimation) | |- Longitudinal SPL (L_SPL_Solution[10]): Splits the privacy budget and sanitizes using long_freq_est LDP protocols | | |- SPL_L_GRR, SPL_L_OUE, SPL_L_OSUE, SPL_L_SUE, SPL_L_SOUE, SPL_L_ADP, SPL_dBitFlipPM | |- Longitudinal SMP (L_SMP_Solution[10]): Samples a single attribute and sanitizes using long_freq_est LDP protocols | | |- SMP_L_GRR, SMP_L_OUE, SMP_L_OSUE, SMP_L_SUE, SMP_L_SOUE, SMP_L_ADP, SMP_dBitFlipPM | | - estimators (Distribution Estimator Methods) | |- MI (Matrix Inverse) | |- IBU (Iterative Bayesian Estimator[14]) | |- MLE (Maximum Likelihood Estimator[15]) ``` ## ⚡ 使用方法这是一个基于函数的包，用于模拟用户和服务器的 LDP 数据收集流程。每个功能都有 ```Client``` 和 ```Aggregator``` 函数。如需更多详细信息，请参阅[tutorials](https://github.com/hharcolezi/multi-freq-ldpy/tree/main/tutorials)文件夹，其中涵盖使用真实世界开放数据集（[Adult](https://archive.ics.uci.edu/ml/datasets/adult)、[Nursery](https://archive.ics.uci.edu/ml/datasets/nursery)、[MS-FIMU](https://github.com/hharcolezi/OpenMSFIMU)）的所有 1--4 项任务。 ``` # 常用库 import numpy as np import matplotlib.pyplot as plt # L-SUE 协议（又称 Basic RAPPOR[11]）的 Multi-Freq-LDPy 函数 from multi_freq_ldpy.long_freq_est.L_SUE import L_SUE_Client, L_SUE_Aggregator_MI, L_SUE_Aggregator_IBU # 模拟参数 epsilon_perm = 2 # longitudinal privacy guarantee, i.e., upper bound (infinity reports) epsilon_1 = 0.5 * epsilon_perm # single report privacy guarantee, i.e., lower bound n = int(1e6) # number of users k = 5 # attribute's domain size # 每个用户拥有 [0-k) 范围内数字、包含 n 个用户的模拟数据集 data = np.random.randint(k, size=n) # 客户端模拟 l_sue_reports = [L_SUE_Client(input_data, k, epsilon_perm, epsilon_1) for input_data in data] # 采用矩阵求逆（MI）的服务端聚合模拟 l_sue_est_freq_MI = L_SUE_Aggregator_MI(l_sue_reports, epsilon_perm, epsilon_1) # 采用迭代贝叶斯更新（IBU）[14] 的服务端聚合模拟 l_sue_est_freq_IBU = L_SUE_Aggregator_IBU(l_sue_reports, k, epsilon_perm, epsilon_1) # 真实频率 real_freq = np.unique(data, return_counts=True)[-1] / n # 结果可视化 x = np.arange(k) # the label locations barwidth = 0.3 # the width of the bars plt.bar(x - barwidth, real_freq, label='Real Freq', width=barwidth) plt.bar(x , l_sue_est_freq_MI, label='MI Est Freq: L-SUE', width=barwidth) plt.bar(x + barwidth, l_sue_est_freq_IBU, label='IBU Est Freq: L-SUE', width=barwidth) plt.ylabel('Normalized Frequency') plt.xlabel('Domain values') plt.legend(ncol=3, loc='upper right', bbox_to_anchor=(1., 1.1)) plt.show(); ``` ## 📬 联系方式如有任何问题，请联系 [Héber H. Arcolezi](https://hharcolezi.github.io/)。 ## 📝 许可证 [MIT](https://github.com/hharcolezi/multi-freq-ldpy/blob/main/LICENSE) ## 📚 主要参考文献 - [1] Kairouz, Peter, Keith Bonawitz, and Daniel Ramage. "Discrete distribution estimation under local privacy." International Conference on Machine Learning. PMLR, 2016. - [2] Kairouz, Peter, Sewoong Oh, and Pramod Viswanath. "Extremal mechanisms for local differential privacy." Advances in neural information processing systems 27 (2014). - [3] Wang, Tianhao, et al. "Locally differentially private protocols for frequency estimation." 26th USENIX Security Symposium (USENIX Security 17). 2017. - [4] Bassily, Raef, and Adam Smith. "Local, private, efficient protocols for succinct histograms." Proceedings of the forty-seventh annual ACM symposium on Theory of computing. 2015. - [5] Ye, Min, and Alexander Barg. "Optimal schemes for discrete distribution estimation under locally differential privacy." IEEE Transactions on Information Theory 64.8 (2018): 5662-5676. - [6] Wang, Shaowei, et al. "Mutual information optimally local private discrete distribution estimation." arXiv preprint arXiv:1607.08025 (2016). - [7] Nguyên, Thông T., et al. "Collecting and analyzing data from smart device users with local differential privacy." arXiv preprint arXiv:1606.05053 (2016). - [8] Wang, Ning, et al. "Collecting and analyzing multidimensional data with local differential privacy." 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019. - [9] Arcolezi, Héber H., et al. "Random sampling plus fake data: Multidimensional frequency estimates with local differential privacy." Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021. - [10] Arcolezi, Héber H., et al. "Improving the utility of locally differentially private protocols for longitudinal and multidimensional frequency estimates." Digital Communications and Networks (2022). - [11] Erlingsson, Úlfar, Vasyl Pihur, and Aleksandra Korolova. "Rappor: Randomized aggregatable privacy-preserving ordinal response." Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. 2014. - [12] Arcolezi, Héber H., et al. "Frequency Estimation of Evolving Data Under Local Differential Privacy." arXiv preprint arXiv:2210.00262 (2022). - [13] Ding, Bolin, Janardhan Kulkarni, and Sergey Yekhanin. "Collecting telemetry data privately." Advances in Neural Information Processing Systems 30 (2017). - [14] Agrawal, Dakshi, and Charu C. Aggarwal. "On the design and quantification of privacy preserving data mining algorithms." Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 2001. - [15] Pinzón, Carlos Antonio, et al. "Estimating the True Distribution of Data Collected with Randomized Response." arXiv preprint arXiv:2601.08603 (2026).

标签：BSD, LDP协议, meg, Python, 信息安全, 多维数据分析, 差分隐私, 数据挖掘, 无后门, 本地差分隐私, 纵向分析, 统计估计, 网络安全, 逆向工具, 隐私保护, 隐私计算, 频率估计