microsoft/presidio-research
GitHub: microsoft/presidio-research
Stars: 286 | Forks: 76
# Presidio-research
This package provides evaluation and data-science capabilities for
[Presidio](https://github.com/microsoft/presidio) and PII detection models in general.
It also includes a fake data generator that creates synthetic sentences based on templates and fake PII.
## Who should use it?
- Anyone interested in **developing or evaluating PII detection models**, an existing Presidio instance or a Presidio PII recognizer.
- Anyone interested in **generating new data based on previous datasets or sentence templates** (e.g., to increase the coverage of entity values) for Named Entity Recognition models.
## Getting started
### Using notebooks
The easiest way to get started is by reviewing the notebooks.
- [Notebook 1](notebooks/1_Generate_data.ipynb): Shows how to use the PII data generator.
- [Notebook 2](notebooks/2_PII_EDA.ipynb): Shows a simple analysis of the PII dataset.
- [Notebook 3](notebooks/3_Split_by_pattern_number.ipynb): Provides tools to split the dataset into train/test/validation sets while avoiding leakage due to the same pattern appearing in multiple folds (only applicable for synthetically generated data).
- [Notebook 4](notebooks/4_Evaluate_Presidio_Analyzer.ipynb): Shows how to use the evaluation tools to evaluate how well Presidio detects PII. Note that this is using the vanilla Presidio, and the results aren't very accurate.
- [Notebook 5](notebooks/5_Evaluate_Custom_Presidio_Analyzer.ipynb): Shows how one can configure Presidio to detect PII much more accurately, and boost the f score in ~30%.
### Installation
#### From PyPI
conda create --name presidio python=3.12
conda activate presidio
pip install presidio-evaluator
python -m spacy download en_core_web_sm # for tokenization
python -m spacy download en_core_web_lg # for NER
#### From source
To install the package:
1. Clone the repo
2. Install all dependencies:
# Install package+dependencies
pip install poetry
poetry install --with=dev
# Download tge spaCy pipeline used for tokenization
poetry run python -m spacy download en_core_web_sm
# To install with all additional NER dependencies (e.g. Flair, Stanza), run:
# poetry install --with='ner,dev'
# To use the default Presidio configuration, a spaCy model is required:
poetry run python -m spacy download en_core_web_lg
# Verify installation
pytest
Note that some dependencies (such as Flair and Stanza) are not automatically installed to reduce installation complexity.
## What's in this package?
1. **Fake data generator** for PII recognizers and NER models
2. **Data representation layer** for data generation, modeling and analysis
3. Multiple **Model/Recognizer evaluation** files (e.g. for Presidio, Spacy, Flair, Azure AI Language)
4. **Training and modeling code** for multiple models
5. Helper functions for **results analysis**
## 1. Data generation
See [Data Generator README](presidio_evaluator/data_generator/README.md) for more details.
The data generation process takes a file with templates, e.g. `My name is {{name}}`.
Then, it creates new synthetic sentences by sampling templates and PII values.
Furthermore, it tokenizes the data, creates tags (either IO/BIO/BILUO) and spans for the newly created samples.
- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).
- For an example for running the generation process, see [this notebook](notebooks/1_Generate_data.ipynb).
- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/2_PII_EDA.ipynb).
Once data is generated, it could be split into train/test/validation sets
while ensuring that each template only exists in one set.
See [this notebook for more details](notebooks/3_Split_by_pattern_number.ipynb).
## 2. Data representation
In order to standardize the process,
we use specific data objects that hold all the information needed for generating,
analyzing, modeling and evaluating data and models. Specifically,
see [data_objects.py](presidio_evaluator/data_objects.py).
The standardized structure, `List[InputSample]`, can be translated into different formats:
- CoNLL
- To CoNLL:
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
conll = InputSample.create_conll_dataset(dataset)
conll.to_csv("dataset.csv", sep="\t")
- From CoNLL
from pathlib import Path
from presidio_evaluator.dataset_formatters import CONLL2003Formatter
# Read from a folder containing ConLL2003 files
conll_formatter = CONLL2003Formatter(files_path=Path("data/conll2003").resolve())
train_samples = conll_formatter.to_input_samples(fold="train")
- spaCy v3
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")
- Flair
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
flair = InputSample.create_flair_dataset(dataset)
- json
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.to_json(dataset, output_file="dataset_json")
## 3. PII models evaluation
The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, or a specific PII recognizer for precision, recall, and error analysis. See [Notebook 5](notebooks/5_Evaluate_Custom_Presidio_Analyzer.ipynb) for an example.
## For more information
- [Blog post on NLP approaches to data anonymization](https://towardsdatascience.com/nlp-approaches-to-data-anonymization-1fb5bde6b929)
- [How to evaluate PII Detection output with Presidio Evaluator](https://tranguyen221.medium.com/how-to-evaluate-pii-detection-output-with-presidio-evaluator-3f2684ba3091)
- [Conference talk about leveraging Presidio and utilizing NLP approaches for data anonymization](https://youtu.be/Tl773LANRwY)