DataFog/datafog-python

GitHub: DataFog/datafog-python

Stars: 61 | Forks: 14

# DataFog Python DataFog is a Python library for detecting and redacting personally identifiable information (PII). It provides: - Fast structured PII detection via regex - Optional NER support via spaCy and GLiNER - A simple agent-oriented API for LLM applications - Backward-compatible `DataFog` and `TextService` classes ## Installation # Core install (regex engine) pip install datafog # Add spaCy support pip install datafog[nlp] # Add GLiNER + spaCy support pip install datafog[nlp-advanced] # Everything pip install datafog[all] ## Quick Start import datafog text = "Contact john@example.com or call (555) 123-4567" clean = datafog.sanitize(text, engine="regex") print(clean) # Contact [EMAIL_1] or call [PHONE_1] ## For LLM Applications import datafog # 1) Scan prompt text before sending to an LLM prompt = "My SSN is 123-45-6789" scan_result = datafog.scan_prompt(prompt, engine="regex") if scan_result.entities: print(f"Detected {len(scan_result.entities)} PII entities") # 2) Redact model output before returning it output = "Email me at jane.doe@example.com" safe_result = datafog.filter_output(output, engine="regex") print(safe_result.redacted_text) # Email me at [EMAIL_1] # 3) One-liner redaction print(datafog.sanitize("Card: 4111-1111-1111-1111", engine="regex")) # Card: [CREDIT_CARD_1] ### Guardrails import datafog # Reusable guardrail object guard = datafog.create_guardrail(engine="regex", on_detect="redact") @guard def call_llm() -> str: return "Send to admin@example.com" print(call_llm()) # Send to [EMAIL_1] ## Engines Use the engine that matches your accuracy and dependency constraints: - `regex`: - Fastest and always available. - Best for structured entities: `EMAIL`, `PHONE`, `SSN`, `CREDIT_CARD`, `IP_ADDRESS`, `DATE`, `ZIP_CODE`. - `spacy`: - Requires `pip install datafog[nlp]`. - Useful for unstructured entities like person and organization names. - `gliner`: - Requires `pip install datafog[nlp-advanced]`. - Stronger NER coverage than regex for unstructured text. - `smart`: - Cascades regex with optional NER engines. - If optional deps are missing, it degrades gracefully and warns. ## Backward-Compatible APIs The existing public API remains available. ### `DataFog` class from datafog import DataFog result = DataFog().scan_text("Email john@example.com") print(result["EMAIL"]) ### `TextService` class from datafog.services import TextService service = TextService(engine="regex") result = service.annotate_text_sync("Call (555) 123-4567") print(result["PHONE"]) ## CLI # Scan text datafog scan-text "john@example.com" # Redact text datafog redact-text "john@example.com" # Replace text with pseudonyms datafog replace-text "john@example.com" # Hash detected entities datafog hash-text "john@example.com" ## Telemetry DataFog telemetry is disabled by default. To opt in: export DATAFOG_TELEMETRY=1 To force telemetry off: export DATAFOG_NO_TELEMETRY=1 # or export DO_NOT_TRACK=1 Telemetry does not include input text or detected PII values. ## Development git clone https://github.com/datafog/datafog-python cd datafog-python python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -e ".[all,dev]" pip install -r requirements-dev.txt pytest tests/