PreethiAndichamy342/datagate-llm

GitHub: PreethiAndichamy342/datagate-llm

datagate-llm 是一个零依赖的本地推理边界层库，在文本发送到 LLM API 之前扫描并处置 PII、密钥等敏感数据。

Stars: 0 | Forks: 0

# datagate-llm [![PyPI version](https://img.shields.io/pypi/v/datagate-llm.svg)](https://pypi.org/project/datagate-llm/) [![Python versions](https://img.shields.io/pypi/pyversions/datagate-llm.svg)](https://pypi.org/project/datagate-llm/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Tests](https://github.com/datagate-llm/datagate-llm/actions/workflows/test.yml/badge.svg)](https://github.com/datagate-llm/datagate-llm/actions/workflows/test.yml) **您的数据与出站 AI 请求之间的推理边界层。** 在文本离开您的系统并到达 LLM API 之前，扫描其中的敏感数据 —— 包括 PII、密钥、凭证以及特定行业的标识符。 ## 存在的问题 2023 年，三星工程师因将专有源代码和内部会议记录粘贴到 ChatGPT 中而意外泄露了这些信息。这些数据被保留了下来，并可能被用于模型训练。这并非假设的风险 —— 当您将不受限制的文本发送给外部 AI 模型时，这就是其默认行为。 datagate-llm 是您放置在该 API 调用前方的防护层。它会检查您即将发送的内容，告知您发现了什么，并让您决定：标记、脱敏还是阻止。 ## 安装 ``` pip install datagate-llm ``` 零依赖。要求 Python 3.9+。支持离线运行。 ## 快速入门 ``` from datagate_llm import scan # 基础扫描 result = scan("Contact Alice at alice@company.com or call 415-555-0192") print(result["safe"]) # False print(result["risk_score"]) # 0.8 (or similar) print(result["findings"]) # list of matched spans # Redact 模式 — 在发送给 LLM 之前替换 PII result = scan( "My SSN is 123-45-6789 and card number 4111111111111111", mode="redact" ) print(result["redacted_text"]) # "My SSN is [REDACTED:universal/ssn] and card number [REDACTED:universal/credit_card]" # Block 模式 — 对高风险内容强制阻断 result = scan("AKIAIOSFODNN7EXAMPLEKEY", sectors=["technology"], mode="block") if result["action"] == "block": raise ValueError("Refusing to send credentials to LLM") # 多行业扫描 result = scan( "Patient MRN: AB12345, account 123456789012", sectors=["healthcare", "finance"] ) for finding in result["findings"]: print(finding["rule_id"], finding["severity"], finding["confidence"]) ``` ## 检测能力 | 类别 | 规则 ID | 严重程度 | |----------|---------|----------| | 电子邮箱地址 | `universal/email` | high | | 美国电话号码 | `universal/phone_us` | medium | | 社会安全号码 (SSN) | `universal/ssn` | critical | | 信用卡号码 | `universal/credit_card` | critical | | IP 地址 | `universal/ip_address` | low | | AWS 访问密钥 | `technology/aws_access_key` | critical | | OpenAI API 密钥 | `technology/openai_key` | critical | | Anthropic API 密钥 | `technology/anthropic_key` | critical | | GitHub token | `technology/github_token` | critical | | Stripe 密钥 | `technology/stripe_key` | critical | | JWT token | `technology/jwt_token` | high | | 私钥 (PEM) | `technology/private_key` | critical | | 数据库连接字符串 | `technology/connection_string` | critical | | NPI 号码 | `healthcare/npi_number` | high | | ICD-10 诊断代码 | `healthcare/icd10_code` | medium | | 保险会员 ID | `healthcare/insurance_member_id` | high | | 医疗记录号码 | `healthcare/medical_record_number` | critical | | DEA 号码 | `healthcare/dea_number` | critical | | IBAN | `finance/iban` | high | | SWIFT/BIC 代码 | `finance/swift_bic` | medium | | ABA 汇款路线号码 | `finance/routing_number` | high | | 银行账号 | `finance/bank_account` | high | | 税号 / EIN | `finance/tax_id_ein` | critical | | 比特币地址 | `finance/crypto_btc` | medium | | 以太坊地址 | `finance/crypto_eth` | medium | ## 工作原理 ``` text input │ ▼ tokenize() ← NFKC normalization, zero-width char removal │ ▼ match() ← regex scan against compiled rule set │ ▼ score() ← context-aware confidence (boost / suppress words) │ ▼ resolve() ← remove overlapping spans, keep highest confidence │ ▼ aggregate() ← single risk_score in [0.0, 1.0] │ ▼ build_result() ← assemble final dict with action, findings, fingerprint ``` 每一步都是纯函数。没有网络调用。没有磁盘写入。除了进程内的规则缓存外，没有任何全局状态。 ## 扫描模式 | 模式 | 当风险 > 0 时 | 用例 | |------|---------------|----------| | `flag`（默认） | `action = "flag"` | 在发送前进行日志记录和审查 | | `redact` | `action = "flag"`，文本片段会在 `redacted_text` 中被替换 | 移除 PII，发送净化后的文本 | | `block` | `action = "block"` | 强制拦截 —— 向上游引发错误 | ## 客观的局限性 - **仅限正则匹配**：datagate-llm 使用确定性的模式匹配。它无法捕捉嵌入在混淆文本、改写内容或从未见过的新格式中的 PII。 - **以英语为中心**：电话和 ID 号码的模式目前主要针对美国格式。可能会遗漏国际变体。 - **无语义理解**：“The patient's temperature was 98.6” 不会被标记为健康数据，因为没有对应的匹配模式。语义扫描需要可选的 `onnxruntime` 层（尚未发布）。 - **可能存在误报**：像 SWIFT 代码这样的短模式可能会匹配到任意的大写字符串。请在您的规则 JSON 中使用 `context.suppress` 词来减少误报干扰。 - **并非合规工具**：通过扫描并不意味着文档符合 HIPAA、GDPR 或 PCI-DSS 标准。请将其视为多层防御中的一层，而不是唯一的一层。 ## 贡献指南请参阅 [CONTRIBUTING.md](CONTRIBUTING.md)。简而言之：以 JSON 格式添加规则，添加测试，然后提交一个 PR。 ## 许可证 MIT。请参阅 [LICENSE](LICENSE)。

标签：Python, 大语言模型(LLM), 数据脱敏, 无后门, 网络安全, 逆向工具, 隐私保护