testpatterndev/patterns

GitHub: testpatterndev/patterns

一个开源的DLP检测模式注册库，以标准化格式维护敏感数据检测规则，可导出至Microsoft Purview、GCP DLP和AWS Macie等主流平台直接使用。

Stars: 1 | Forks: 0

# TestPattern — DLP 检测模式 DLP（数据防泄漏）检测模式的开放注册库。包含用于检测敏感数据（包括 PII、PHI、财务记录、政府标识符和凭据）的正则表达式、关键词列表和分类规则。 **Sigma 之于 SIEM，犹如 TestPattern 之于 DLP。** 在 [testpattern.dev](https://testpattern.dev) 浏览模式。 ## 快速开始获取预编译的 `patterns.json` 以供直接使用： ``` curl -sL https://raw.githubusercontent.com/testpatterndev/patterns/main/patterns.json -o patterns.json ``` 或者从 YAML 源码克隆并编译： ``` git clone https://github.com/testpatterndev/patterns.git cd patterns npm install npm run compile ``` ## 仓库结构 ``` data/ patterns/ 1,407 detection pattern YAML files collections/ 14 curated pattern bundles keywords/ 105 keyword dictionary YAMLs reference/ Large consolidated reference lists (JSON) scripts/ compile.js YAML → patterns.json compiler patterns.json Pre-compiled output (checked in for direct consumption) ``` ## 模式结构每个模式都遵循 `testpattern/v1` 架构： ``` schema: testpattern/v1 name: Australian Tax File Number slug: au-tax-file-number version: 1.0.0 type: regex # regex | keyword_list | keyword_dictionary | fingerprint engine: boost_regex # boost_regex | pcre2 | ecma | python_re | universal description: >- Human-readable description of what this pattern detects. operation: >- Technical details: validation algorithm, regex approach, corroborative evidence config. pattern: \b\d{3}[\s-]?\d{3}[\s-]?\d{3}\b confidence: high # high | medium | low confidence_justification: ... jurisdictions: - au regulations: - privacy-act-1988 data_categories: - pii - financial - government-id corroborative_evidence: keywords: - tax file number - TFN proximity: 300 keyword_lists: - au-identity-tfn test_cases: should_match: - value: 123 456 789 description: Standard spaced format should_not_match: - value: 12 345 678 description: Only 8 digits false_positives: - description: Generic nine-digit numbers mitigation: Require corroborative evidence keywords. exports: - purview_xml - yaml - regex_copy scope: narrow # wide | narrow | specific purview: # Optional: full Microsoft SIT definition patterns_proximity: 300 recommended_confidence: 85 pattern_tiers: - confidence_level: 85 id_match: Regex_1 matches: - ref: Keyword_1 - confidence_level: 65 id_match: Regex_1 regexes: - id: Regex_1 pattern: '\b\d{3}\s?\d{3}\s?\d{3}\b' validators: [Validator_1] keywords: - id: Keyword_1 groups: - match_style: word terms: [tax file number, tfn] validators: - id: Validator_1 type: Checksum weights: '1 4 3 7 5 8 6 9 10' mod: 11 check_digit: last created: '2026-02-08' updated: '2026-02-08' author: testpattern-community license: MIT ``` `purview` 块是可选的。当存在时，网站会使用它进行包含多重置信度级别、校验和验证器、过滤器以及嵌套 AND/OR/NOT 匹配树的完整 Purview XML 导出。当不存在时，简单导出路径会根据顶层的 `pattern` 和 `corroborative_evidence` 字段生成基本的 XML。 ## 导出格式模式可导出为以下格式： - **Microsoft Purview XML** — 可通过 `New-DlpSensitiveInformationTypeRulePackage` 导入的 RulePack 格式 - **Purview 部署脚本** — 用于创建关键词词典并导入 SIT 的 PowerShell 脚本 - **GCP DLP JSON** — 用于 Google Cloud DLP 的 InspectTemplate 格式 - **AWS Macie JSON** — 用于 Amazon Macie 的自定义数据标识符格式 - **Raw YAML** — 完整的模式定义 - **Regex** — 可复制到剪贴板的正则表达式模式 ## 数据类别模式涵盖以下敏感数据类别： | 类别 | 描述 | |---|---| | `pii` | 个人身份信息 | | `phi` | 受保护的健康信息 | | `financial` | 财务记录、账号 | | `government-id` | 政府颁发的标识符 | | `credentials` | API 密钥、令牌、机密信息 | | `security` | 安全敏感信息 | | `location` | 地理和地址数据 | | `business-id` | 企业标识符 | | `healthcare` | 常规医疗保健信息 | | `government` | 政府记录 | | `network` | 网络标识符 | | `device-id` | 设备标识符 | ## 司法管辖区模式按司法管辖区进行标记：`au`、`us`、`uk`、`eu`、`global` 以及特定国家/地区代码（`es`、`fr`、`de`、`br`、`ca`、`in`、`jp`、`kr`、`sg`、`za` 等）。 ## 贡献我们欢迎各种贡献。若要添加或改进模式： 1. Fork 本仓库 2. 在 `data/patterns/` 目录下创建或编辑 YAML 文件 3. 确保您的模式符合质量要求： - 至少 3 个 `should_match` 和 2 个 `should_not_match` 测试用例 - 包含误报处理文档及缓解策略 - 至少包含一个法规和司法管辖区标签 - 带有合理依据的置信度等级 4. 运行 `npm run compile` 进行验证 5. 提交拉取请求详情请参阅[贡献指南](https://testpattern.dev/contributing)。 ## 生成模式 ### 通过描述生成（AI 辅助）将 `prompts/generate-pattern.md` 中的提示作为上下文提供给任何 AI 助手（Claude、ChatGPT 等），然后描述您需要的模式： AI 将输出一个完整的 `testpattern/v1` YAML 文件，可直接保存在 `data/patterns/` 中。 ### 通过样本数据生成（本地脚本）分析 CSV 或文本文件以自动检测敏感数据模式： ``` node scripts/generate-from-sample.js sample.csv node scripts/generate-from-sample.js sample.csv --output-dir ./drafts node scripts/generate-from-sample.js sample.csv --verbose ``` 该脚本可检测电子邮件、信用卡、IBAN、IP 地址、UUID、AWS 密钥、SSN、电话号码、日期、URL 以及未知的结构化格式。输出为草稿 YAML，您可以在提交前对其进行审查和完善。 ### 通过样本数据生成（AI 辅助）将 `prompts/generate-from-sample.md` 中的提示作为上下文提供给任何 AI 助手，然后粘贴您的样本数据。AI 将分析数据、识别所有敏感类型，并生成完整的模式 YAML 文件。 ## 许可证 MIT。请参阅 [LICENSE](LICENSE)。 ## 赞助方 [Compl8](https://aairii.com) — TestPattern 是一个社区项目，而非 Compl8 的产品。

标签：CMS安全, DLP, DNS解析, GNU通用公共许可证, Homebrew安装, JavaScript, JSON, MITM代理, Node.js, npm, PHI, PII, Regex, YAML, 个人健康信息, 个人身份信息, 关键词匹配, 合规, 安全库, 安全规则库, 开源项目, 政府标识符, 敏感数据检测, 数据分类, 数据可视化, 暗色界面, 网络安全, 自定义脚本, 金融数据安全, 隐私保护