huajielong/SensFinder

GitHub: huajielong/SensFinder

基于 LLM 的敏感信息检测与分类系统，自动识别文本中的人名、地名、公司名等实体并输出带置信度评分的结构化结果。

Stars: 5 | Forks: 0

🔍 SensFinder

基于 LLM 的敏感信息检测与分类系统

🛡️ 脱敏前检查 · 📋 数据泄露风险评估 · ✅ 合规性检查

🚀 快速开始 • 🏗️ 系统架构 • ⚡ 核心功能 • 📊 分类标准 • ⚙️ 配置 • ❓ 常见问题

## 🤔 您真的了解文本中隐藏着哪些敏感数据吗？数据脱敏和合规性检查对每个组织都至关重要，但手动审查成千上万的文本字段几乎是不可能的： | 痛点 | SensFinder 解决方案 | |:------------|:--------------------| | ❓ 成千上万条记录——手动审查不切实际 | ✅ **自动化检测** —— 基于 LLM 的批量处理，几分钟内完成 | | ❓ 姓名、地点、公司混杂在一起，难以分类 | ✅ **智能分类** —— 6 个精确类别，支持自定义扩展 | | ❓ 切换 LLM 提供商很麻烦 | ✅ **多引擎** —— OpenAI / DeepSeek / 本地模型，一键切换 | | ❓ 结果需要人工验证 | ✅ **置信度评分** —— 自动标记低置信度字段，便于重点审查 | | ❓ 批量处理可能会遗漏异常 | ✅ **多级验证** —— 规则冲突检测 + 低置信度过滤 | ### 🔥 使用场景 ## 🚀 快速开始 ### 前置条件 | 依赖 | 版本 | |:-----------|:-------:| | Python | 3.8+ | | pandas | — | | openai | — | ### 一键安装 ``` # 克隆 repo git clone https://github.com/huajielong/SensFinder.git cd SensFinder # 创建 virtual environment 并安装 deps python -m venv venv # Windows venv\Scripts\activate # macOS / Linux # source venv/bin/activate pip install -r requirements.txt ``` ### 配置 LLM 服务创建 `.env` 文件（或直接修改 `config/config.py`）： ``` # Service 选择: OPENAI / DEEPSEEK / LOCAL LLM_SERVICE=DEEPSEEK # DeepSeek 配置 DEEPSEEK_API_KEY=sk-your-key DEEPSEEK_BASE_URL=https://api.deepseek.com # OpenAI 配置 OPENAI_API_KEY=sk-your-key OPENAI_BASE_URL=https://api.openai.com/v1 # Local LLM 配置 LOCAL_LLM_URL=http://localhost:8000/v1/chat/completions ``` ### 运行 ``` python script/sens_finder.py ``` **就是这样。** 程序将自动运行：数据预处理 → LLM 分类 → 结果验证 → 报告生成。 ## 🏗️ 系统架构 ``` ┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ │ │ │ │ Data Preprocess │───>│ LLM Classification │───>│ Result Verify │ │ data_preprocess │ │ llm_classify │ │ result_verify │ │ │ │ │ │ │ │ • Clean invalid │ │ • Call LLM API │ │ • Confidence score │ │ • Batch sharding │ │ • Smart tagging │ │ • Conflict detect │ │ • Dedup optimize │ │ • Multi-thread │ │ • Problem summary │ └─────────────────────┘ └─────────────────────┘ └─────────────────────┘ │ │ │ └──────────────────────────┼──────────────────────────┘ ▼ ┌─────────────────────┐ │ │ │ Output │ │ • Results CSV │ │ • Problem Report │ │ • Detailed Log │ └─────────────────────┘ ``` ### 模块职责 | 模块 | 文件 | 核心职责 | |:-------|:-----|:--------------------| | **主控** | `script/sens_finder.py` | 编排整个 pipeline（预处理 → 分类 → 验证） | | **数据预处理** | `script/data_preprocess.py` | 清洗、去重、批量分片 | | **LLM 分类** | `script/llm_classify.py` | 调用 LLM 对每个字段进行分类 | | **本地 LLM 客户端** | `script/local_llm_client.py` | 支持私有/本地模型访问 | | **结果验证** | `script/result_verify.py` | 置信度评分 + 冲突检测 | | **配置** | `config/config.py` | 集中管理系统和 LLM 配置 | ## ⚡ 核心功能 | 功能 | 描述 | |:--------|:------------| | 🤖 **多 LLM 支持** | OpenAI GPT、DeepSeek、本地/私有模型，一键切换 | | 📦 **批量处理** | 针对大型文本语料库自动分片（每批 1000-2000 行） | | 🎯 **智能分类** | 准确识别姓名、地点、公司、组织、产品 | | ✅ **置信度评分** | 对每个结果进行评分；自动标记低置信度项 | | 🔍 **结果验证** | 规则冲突检测 + 低置信度过滤 | | ⚡ **并行处理** | 多线程实现高吞吐量 | | 📊 **结构化输出** | CSV 格式，便于审查和集成 | ## 📊 分类标准 | 类别 | 示例 | |:---------|:---------| | 👤 **人名** | John Smith, 张三, 李四 | | 🌍 **地名** | 伦敦, 加利福尼亚州, 北京 | | 🏢 **公司名** | Apple Inc, 阿里巴巴, 腾讯 | | 🏛️ **组织机构** | UN, WHO, 世界卫生组织 | | 🔧 **产品/技术** | iPhone 15, GPT-4, TensorFlow | | 📧 **其他 PII** | 电子邮件、电话号码、日期/时间 | ## ⚙️ 配置配置文件：[`config/config.py`](config/config.py) ### LLM 配置 ``` CURRENT_LLM_SERVICE = "DEEPSEEK" # OPENAI | DEEPSEEK | LOCAL LLM_TEMPERATURE = 0.1 # Lower = more stable (0.1-0.3) ``` ### 预处理 ``` BATCH_SIZE = 1000 # Rows per batch MIN_FIELD_LENGTH = 2 # Minimum field length filter ``` ### 验证 ``` LOW_CONFIDENCE_THRESHOLD = 80 # Below this → manual review ``` ## 📁 项目结构 ``` SensFinder/ ├── config/ # Configuration │ ├── config.py # Main config │ └── prompt_template.txt # LLM prompt template ├── data/ # Data directory │ ├── input_raw/ # Raw input │ ├── preprocessed_batches/ # Preprocessed batches │ ├── classify_results/ # Classification results │ └── verify_problems/ # Verification issues ├── script/ # Core scripts │ ├── sens_finder.py # Main entry point │ ├── data_preprocess.py # Data preprocessing │ ├── llm_classify.py # LLM classification │ ├── local_llm_client.py # Local LLM client │ └── result_verify.py # Result verification ├── test/ # Tests ├── 产品设计PRD.md # Product PRD (Chinese) ├── 技术实现方案.md # Tech design doc (Chinese) ├── requirements.txt # Python dependencies └── README.md # 💡 You are here --- ## ❓ FAQ

Which LLM providers are supported?

OpenAI (GPT-4o-mini/GPT-4o), DeepSeek (deepseek-chat), and any OpenAI-compatible local model (via Ollama/vLLM, etc.).

Processing is slow — what can I do?

1. Increase `BATCH_SIZE` (watch context limits)
2. Check LLM API response speed
3. Use a faster model (e.g., GPT-4o-mini)
4. Check network stability

How to improve classification accuracy?

1. Lower `LLM_TEMPERATURE` to ~0.1
2. Use a more capable LLM
3. Add more examples to `prompt_template.txt`
4. Lower `LOW_CONFIDENCE_THRESHOLD` for wider review scope

How to use the results?

Output is CSV with field content, classification, confidence score, and reasoning. Use for: pre-masking field marking, compliance audit evidence, data leak risk assessment reports.

Is my API key safe?

API keys are configured via `.env` or `config.py`. The project includes `.gitignore` to prevent accidental key commits.

--- ## 🧪 开发与扩展 ### 添加新的 Classification Types 1. Edit `config/prompt_template.txt` — add definition and examples 2. Add corresponding validation rules in `result_verify.py` ### 添加新的 LLM Service 1. Add config item in `config/config.py` 2. Add API call logic in `llm_classify.py` ### Testing ```bash python test/test_sens_finder.py ``` ## 📄 许可证 MIT © [huajielong](https://github.com/huajielong)

⭐ 如果 SensFinder 能帮助您保护数据，请点个 Star！

标签：C2, DLL 劫持, LLM应用, Petitpotam, 人工智能, 大语言模型, 敏感信息识别, 数据脱敏, 用户模式Hook绕过, 逆向工具