BoranT-3000/turkish-pii-detection-openai-privacy-filter

GitHub: BoranT-3000/turkish-pii-detection-openai-privacy-filter

Stars: 1 | Forks: 0

# Turkish PII Detection with OpenAI Privacy Filter This repository documents an end-to-end Turkish PII detection project based on **OpenAI Privacy Filter**. The project adapts the original English-oriented privacy filtering model to Turkish by creating a large synthetic Turkish privacy NER dataset, fine-tuning the model with Turkish-specific privacy labels, and evaluating the resulting checkpoint on a held-out synthetic test split. Author: **Boran Toktay** ## Project Overview The original [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) model is designed for detecting and redacting privacy-sensitive spans in text. However, Turkish contains language-specific challenges such as suffixes, local identifiers, Turkish address patterns, TCKN-like numbers, VKN-like numbers, Turkish phone formats, and IBAN variations. This project fine-tunes OpenAI Privacy Filter for Turkish PII detection using a custom synthetic dataset and a Turkish-specific privacy label space. ## Hugging Face Resources | Resource | Link | |---|---| | Dataset | [`BTX24/turkish-privacy-pii-ner`](https://huggingface.co/datasets/BTX24/turkish-privacy-pii-ner) | | Fine-tuned model | [`BTX24/turkish-privacy-filter-pii`](https://huggingface.co/BTX24/turkish-privacy-filter-pii) | | Base model | [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) | | Base GitHub repository | [`openai/privacy-filter`](https://github.com/openai/privacy-filter) | ## Main Contributions This project includes: - A fully synthetic Turkish privacy NER dataset with **103,923 JSONL rows** - A Turkish-specific privacy label space - A Colab-based fine-tuning workflow for OpenAI Privacy Filter - A fine-tuned Turkish Privacy Filter checkpoint - Evaluation results on a held-out synthetic Turkish test split - Documentation for dataset preparation, model training, inference, and limitations ## Repository Structure Recommended structure: turkish-pii-detection-openai-privacy-filter/ ├── README.md ├── LICENSE ├── privacy_filter_tr_pii_colab.ipynb ├── configs/ │ └── label_space.json ├── reports/ │ └── Turkish_Privacy_Filter_Final_Report_English_Boran_Toktay.docx The full dataset and model weights are hosted on Hugging Face, not directly in this GitHub repository. ## Dataset The dataset used in this project is available at: https://huggingface.co/datasets/BTX24/turkish-privacy-pii-ner ### Dataset Summary | Metric | Value | |---|---:| | Total JSONL rows | 103,923 | | Label classes | 10 | | Training examples | 83,138 | | Validation examples | 10,392 | | Test examples | 10,393 | | Total characters | 7,404,532 | | Average text length | 71.3 characters | | Average entity length | 19.2 characters | | Dataset type | Synthetic Turkish privacy NER | | Annotation type | Character-level spans | The dataset is fully synthetic. No real personal data was intentionally collected or used. ## Label Space The final Turkish privacy label space is: { "category_version": "tr_privacy_v1", "span_class_names": [ "O", "tckn", "secret", "iban", "vkn", "account_number", "private_address", "private_date", "private_phone", "private_email", "private_person" ] } ### Label Descriptions | Label | Description | |---|---| | `O` | Outside / non-PII token | | `tckn` | Synthetic Turkish national identity-like values | | `secret` | Synthetic passwords, OTPs, API-key-like strings, access tokens, recovery codes | | `iban` | Synthetic Turkish IBAN-like values | | `vkn` | Synthetic Turkish tax identification-like values | | `account_number` | Account numbers, customer numbers, reference codes, membership IDs, order/ticket references | | `private_address` | Synthetic Turkish-style address expressions | | `private_date` | Privacy-relevant date expressions | | `private_phone` | Turkish mobile-number-like synthetic phone numbers | | `private_email` | Synthetic non-routable e-mail addresses | | `private_person` | Synthetic Turkish-style person names | ## Dataset Format The dataset uses JSONL format with character-level span annotations. Example: { "text": "Ahmet Yılmaz için telefon numarası 0532 000 00 00 olarak kaydedildi.", "spans": { "private_person: Ahmet Yılmaz": [[0, 12]], "private_phone: 0532 000 00 00": [[35, 49]] }, "info": { "id": "synthetic_tr_000001", "source": "synthetic_tr" } } Character offsets use Python slicing semantics: text[start:end] The `end` index is exclusive. ## Model The fine-tuned model is available at: https://huggingface.co/BTX24/turkish-privacy-filter-pii It is based on: openai/privacy-filter The model was fine-tuned on the Turkish synthetic privacy NER dataset using a custom Turkish privacy label space. ## Training Summary | Metric | Value | |---|---:| | Base model | `openai/privacy-filter` | | Fine-tuned model | `BTX24/turkish-privacy-filter-pii` | | Dataset | `BTX24/turkish-privacy-pii-ner` | | Best epoch | 3 | | Best metric | `validation_loss` | | Best validation loss | 0.002157915852249276 | | Training examples | 83,138 | | Validation examples | 10,392 | | Device | Google Colab Pro GPU environment | ### Epoch Metrics | Epoch | Train Loss | Train Token Accuracy | Validation Loss | Validation Token Accuracy | |---:|---:|---:|---:|---:| | 1 | 0.038908 | 0.990768 | 0.003961 | 0.999100 | | 2 | 0.002714 | 0.999463 | 0.002181 | 0.999583 | | 3 | 0.001393 | 0.999704 | 0.002158 | 0.999641 | The best checkpoint was selected at epoch 3 based on validation loss. ## Test Evaluation Evaluation was performed on the held-out synthetic Turkish test split. | Metric | Value | |---|---:| | Test examples | 10,393 | | Test tokens | 241,020 | | Eval mode | typed | | Loss | 0.0028 | | Token accuracy | 0.9996 | | Inference tokens/sec | 3027.20 | ### Detection Metrics | Metric | Value | |---|---:| | Detection precision | 0.9998 | | Detection recall | 0.9996 | | Detection F1 | 0.9997 | | Detection F2 | 0.9996 | | Span precision | 0.9988 | | Span recall | 0.9978 | | Span F1 | 0.9983 | | Span F2 | 0.9980 | ### Per-Class Span Metrics | Label | Precision | Recall | F1 | F2 | |---|---:|---:|---:|---:| | `tckn` | 1.0000 | 0.9990 | 0.9995 | 0.9992 | | `secret` | 0.9990 | 1.0000 | 0.9995 | 0.9998 | | `iban` | 1.0000 | 1.0000 | 1.0000 | 1.0000 | | `vkn` | 0.9990 | 0.9990 | 0.9990 | 0.9990 | | `account_number` | 1.0000 | 0.9992 | 0.9996 | 0.9993 | | `private_address` | 0.9980 | 0.9940 | 0.9960 | 0.9948 | | `private_date` | 1.0000 | 1.0000 | 1.0000 | 1.0000 | | `private_phone` | 0.9991 | 1.0000 | 0.9995 | 0.9998 | | `private_email` | 1.0000 | 0.9902 | 0.9951 | 0.9922 | | `private_person` | 0.9928 | 0.9959 | 0.9943 | 0.9953 | These results are measured on a synthetic test split. Real-world performance may differ. ## Colab Notebook The main training notebook is: notebooks/privacy_filter_tr_pii_colab.ipynb The notebook covers: - Installing OpenAI Privacy Filter - Downloading the OPF-native base checkpoint - Creating `label_space.json` - Loading and validating the Turkish privacy dataset - Converting span annotations into OPF-compatible format - Running fine-tuning - Running evaluation - Testing inference on Turkish examples - Uploading the fine-tuned checkpoint to Hugging Face - Verifying the uploaded model from Hugging Face ## Important Implementation Notes During development, several practical engineering issues were resolved: ### JSONL newline issue A JSONL writing function originally wrote records without newline characters, causing: JSONDecodeError: Extra data The fix was to ensure each JSON object is written on a separate line: f.write(json.dumps(row, ensure_ascii=False) + "\n") ### OPF checkpoint path issue The OPF training command expects the OPF-native checkpoint under the `original/` directory of `openai/privacy-filter`, not only the root Transformers-style checkpoint. Correct checkpoint loading pattern: from huggingface_hub import snapshot_download snapshot_download( repo_id="openai/privacy-filter", repo_type="model", local_dir=str(BASE_SNAPSHOT_DIR), allow_patterns=["original/*"], ) Then use: --checkpoint path/to/base_openai_privacy_filter_snapshot/original ### OPF JSON output parsing issue Some OPF CLI outputs may include JSON followed by additional color legend text. To avoid `JSONDecodeError`, parse only the first JSON object: payload, end_idx = json.JSONDecoder().raw_decode(stdout.lstrip()) ## Local Usage ### 1. Install OpenAI Privacy Filter git clone https://github.com/openai/privacy-filter.git cd privacy-filter pip install -e . ### 2. Download the fine-tuned checkpoint python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='BTX24/turkish-privacy-filter-pii', local_dir='tr_privacy_filter_pii')" ### 3. Run inference CUDA: opf --checkpoint ./tr_privacy_filter_pii --device cuda --format json "Mehmet Kaya TCKN 12345678901" CPU: opf --checkpoint ./tr_privacy_filter_pii --device cpu --format json "Mehmet Kaya TCKN 12345678901" More examples: opf --checkpoint ./tr_privacy_filter_pii --device cuda --format json "Ahmet Yılmaz için telefon numarası 0532 000 00 00 olarak kaydedildi." opf --checkpoint ./tr_privacy_filter_pii --device cuda --format json "İade için IBAN TR00 0000 0000 0000 0000 0000 00 bilgisi girildi." opf --checkpoint ./tr_privacy_filter_pii --device cuda --format json "Doğrulama kodu OTP-482193 destek kaydına yazılmış." ## Python Download Example from huggingface_hub import snapshot_download checkpoint_dir = snapshot_download( repo_id="BTX24/turkish-privacy-filter-pii", local_dir="tr_privacy_filter_pii", ) print(checkpoint_dir) ## Example Turkish Test Text Hasta Mehmet Kaya, 14.03.2025 tarihinde Ankara Çankaya'daki Atatürk Mahallesi No: 12 Daire: 5 adresinden başvuru yaptı. TCKN bilgisi 12345678901 olarak, VKN bilgisi ise 1234567890 olarak kaydedildi. Ödeme için verilen IBAN: TR330006100519786457841326. İletişim telefonu 0555 123 45 67. E-posta adresi mehmet.kaya@example.com. Sistem entegrasyonu için secret değeri sk-demo-1234567890abcdef olarak not edildi. Expected behavior: - Person names → `private_person` - Dates → `private_date` - Addresses → `private_address` - TCKN-like values → `tckn` - VKN-like values → `vkn` - IBAN-like values → `iban` - Phone numbers → `private_phone` - E-mails → `private_email` - Secret-like values → `secret` ## Limitations This project uses synthetic data. Therefore: - The model may overfit to synthetic templates. - Real-world Turkish text may contain noisier, longer, or more ambiguous expressions. - OCR errors, informal spelling, mixed-language text, and domain-specific documents may reduce performance. - The model may produce false positives for numeric or code-like strings. - The model may miss rare PII formats not represented in the synthetic dataset. - Results should be validated on real or manually curated in-domain test sets before production use. This model should not be treated as a legal anonymization guarantee. ## Ethical Considerations The dataset was generated synthetically to avoid intentionally collecting or distributing real personal information. However, models trained on synthetic data can still make mistakes. For sensitive or production use cases, this model should be used as one component of a broader privacy-preserving pipeline, together with: - Human review - Rule-based checks - Domain-specific validation - Conservative redaction policies - Logging and monitoring - Privacy-by-design safeguards ## Project Deliverables This project includes: - Synthetic Turkish privacy NER dataset - Fine-tuned Turkish Privacy Filter model - Colab training notebook - Final academic report - Dataset card - Model card - Evaluation metrics - Example inference workflow ## License This repository’s code and documentation are released under the **Apache License 2.0**, unless otherwise stated. Related resources: | Resource | License | |---|---| | Project code | Apache-2.0 | | Fine-tuned model | Apache-2.0 | | Training dataset | CC BY 4.0 | | Base model | See `openai/privacy-filter` license | Please review the licenses of the base model, dataset, and fine-tuned checkpoint before use. ## Citation If you use this project or model, please cite: @model{toktay_2026_turkish_privacy_filter_pii, title = {Turkish Privacy Filter PII}, author = {Boran Toktay}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/BTX24/turkish-privacy-filter-pii}}, note = {Fine-tuned OpenAI Privacy Filter checkpoint for Turkish PII span detection} } If you use the dataset, please also cite: @dataset{toktay_2026_turkish_privacy_pii_ner, title = {Turkish Privacy PII NER Dataset}, author = {Boran Toktay}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/BTX24/turkish-privacy-pii-ner}}, note = {Synthetic Turkish privacy-oriented named entity recognition dataset for PII detection} } ## Acknowledgements This project builds on: - [`openai/privacy-filter`](https://github.com/openai/privacy-filter) - [`openai/privacy-filter` on Hugging Face](https://huggingface.co/openai/privacy-filter) - Hugging Face Hub and Datasets tooling - Synthetic Turkish PII data generation workflows ## Contact For questions, suggestions, or issues, please open an issue in this repository.