nemmusu/report-anonymizer

GitHub: nemmusu/report-anonymizer

一款基于本地 LLM 的渗透测试报告匿名化工具，在不联网的环境下自动识别并替换客户品牌、网络凭据、云资源 ID 等敏感信息，同时完整保留报告的技术内容。

Stars: 1 | Forks: 0

# 🛡️ Report Anonymizer **本地 LLM 渗透测试报告匿名化工具。** 支持直接导入 PDF、Office 文档、Markdown 或代码。该流水线会重写客户品牌、真实 IP、电话号码、硬编码凭据、漏洞咨询 ID、专有 HTTP 头部、应用包名和 Bundle 标识符、AD SID 以及云资源 ID。漏洞利用代码、有效载荷（payload）和 Shell 输出将保持原样。可在普通笔记本电脑上本地运行，无需 GPU。 [![CI](https://static.pigsec.cn/wp-content/uploads/repos/2026/05/50faee2e36201519.svg)](https://github.com/nemmusu/report-anonymizer/actions/workflows/ci.yml) [![文档](https://img.shields.io/badge/docs-mkdocs--material-526CFE?logo=materialformkdocs&logoColor=white)](https://nemmusu.github.io/report-anonymizer/) [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue?logo=python&logoColor=white)](pyproject.toml) [![PySide6 6.11](https://img.shields.io/badge/UI-PySide6%206.11-41CD52?logo=qt&logoColor=white)](https://doc.qt.io/qtforpython/) [![llama.cpp](https://img.shields.io/badge/backend-llama.cpp-orange)](https://github.com/ggerganov/llama.cpp) [![测试](https://img.shields.io/badge/tests-284%20passing-brightgreen)](tests/) [![License GPL-3.0](https://img.shields.io/badge/license-GPL--3.0-blue?logo=gnu&logoColor=white)](LICENSE) [![隐私：仅限本地](https://img.shields.io/badge/privacy-local--only-success?logo=protondrive&logoColor=white)](#-privacy) [![Stars](https://img.shields.io/github/stars/nemmusu/report-anonymizer?style=social)](https://github.com/nemmusu/report-anonymizer/stargazers) [功能特性](#-features) · [匿名化范围](#-what-it-anonymizes) · [快速开始](#-quick-start) · [基准测试](BENCHMARKS.md) · [文档站](https://nemmusu.github.io/report-anonymizer/) · [路线图](#-roadmap)

Side-by-side diff: original PDF on the left with NimbusBoard, the engagement code 2026-Q1-NimbusBoard-Web and author Lorenzo De Falco; the anonymized PDF on the right with VendorBoard, 2026-Q1-VendorBoard-Web and Marco Rossi. Layout, fonts and structure are preserved.

Side-by-side native render. Every customer-identifying value is rewritten with a plausible dummy, while layout and fonts are preserved.

## ✨ 功能特性 | 功能特性 | 具体作用 | |---|---| | 🧠 **多层级 LLM 检测** | 第 0 层（Tier-0）为确定性的正则表达式，第 1 层（Tier-1）为 LLM 检测器，外加 critic 模块和自洽性投票。绝大多数泄漏会被自动提升处理，只有模棱两可的内容会进入人工审核（Review）。 | | 🖼️ **图像涂黑** | 新增功能：PDF / DOCX / PPTX 中嵌入的每张图像都会在“审核”标签页中生成一个缩略图。打开编辑器，通过颜色选择器绘制**涂黑 / 模糊 / 像素化 / 文本覆盖**矩形。实时烘焙（bake）会在您绘制时渲染实际的像素（而非半透明覆盖层）。在不同页面中的相同 image_id = 一个决定，全局应用；输出保持相同的 xref / 形状位置，从而确保版式在字节级别忠实还原。 | | 🔍 **真实的处理后状态预览** | 三个审核标签页：**文本候选**（带实时预览）、**图像**（画廊 + 编辑器）、**构建预览**（带有原生文本选择、Ctrl+C 和搜索功能的 PDF.js 查看器）。高亮部分被烘焙为 PDF 注释，因此在它们之上依然可以进行文本选择。 | | 📑 **右键“添加到替换映射表”** | 在 PDF 查看器中拖动选择文本（或仅右键单击某个单词），一键即可将其与一个可配置的占位符一起添加到映射表中。Toast 通知确认；预览会在几秒钟内重新渲染并高亮显示新的映射。 | | 🧾 **格式感知适配器** | 原生支持 `.pdf`（原地涂黑或重新生成）、`.docx`、`.doc`、`.pptx`、`.odt`、`.rtf`、`.xlsx`、`.html`、`.md` 以及 60 多种文本和代码扩展名。PDF 会保留原有的布局、字体和字节长度。PDF / DOCX / PPTX 还支持原地图像涂黑（保持相同的 xref / 形状位置）。 | | 🪄 **一键运行** | 只需单击一下即可编排整个流水线（扫描、检测、critic、分类）。默认会在“审核”阶段暂停，以确保操作员在“应用”前能看到每一个候选项。应用后，残余的泄漏会被自动解决。 | | 📝 **内联审核** | 在一个单一的树状结构中编辑、批准、跳过或取消批准占位符（包含已映射、自动提升、待处理的项）。或者输入自定义词汇进行匿名化。每一项更改都会立即持久化到磁盘。 | | 🔁 **防崩溃与可恢复** | 原子写入（`*.tmp` 加 `os.replace`）、按阶段检查点、完整的 `run_manifest.json`。重新打开项目即可恢复进度。 | | ⚙️ **服务端预设 + 自动启动** | 六个精心挑选的 llama.cpp 配置文件（BF16、Q4_K_M、Q5_K_M，12K 到 16K 上下文），附带在实际语料库上测量得出的 VRAM 占用和运行时间。可从 GUI 进行热编辑，保存至用户或项目范围。侧边栏通过绿/黄/红圆点显示连接状态；单击即可触发预检并启动，遇到问题时会弹出特定的错误对话框。可选开启**启动时自动开启**开关。 | | 🤗 **Hugging Face 集成** | 精选的 GGUF 目录、自由文本搜索、带有进度和预计到达时间的可恢复流式下载，以及受限仓库（gated-repo）访问流程。 | | 🛑 **随时停止** | 每个耗时阶段均可通过全局的“停止”按钮进行协作式取消。 | | 🔒 **隐私优先设计** | 无遥测。唯一会连接的网络端点是 `huggingface.co`，且仅在您明确下载模型时才会连接。 | ## 🎯 匿名化范围检测器会为每次泄漏输出 12 种类别之一。每次泄漏都会被替换为一个合理的虚拟值，绝不进行涂黑，也绝不使用 `XXX`、`[REDACTED]` 或星号进行替换。占位符还会保持原有长度和形状，因此 PDF 涂黑无需重新排版。包含示例和反例的完整分类体系详见 [docs/anonymization-scope.md](docs/anonymization-scope.md)。 | 类别 | 包含内容 | 占位符示例 | |---|---|---| | `brand` | 客户和产品名称、套件、供应商 | `VendorApp`, `VendorVoice` | | `network` | 真实的公网 IP、客户主机名和域名 | `203.0.113.NN`, `vendor.example` | | `phones` | E.164 号码，任何国家 | `+1 (415) 555-0001` | | `emails` | 客户域名上的地址或真实相关人员的邮箱 | `user01@vendor.example` | | `credentials` | 明文用户名和密码、`Authorization: Basic …`、会话 Cookie、NTLM 哈希转储 | `u.demo` / `Aaaaaaa00!` | | `keys` | 硬编码 token、十六进制哈希、JWT、SAML 或 OAuth Bearer 令牌、Apple Team ID、APNs 设备 token | 保持长度一致的十六进制字符串，并保留前 8 个字符 | | `headers` | 专有 HTTP 头部（`X-AcmeBank-Auth`, `X-ContosoServer-Token`） | `X-Vendor-Auth` | | `app_packages` | 包含客户信息的反向域名 Android 包名和 iOS Bundle ID（`com.acme.app`, `it.acmebank.mobile`） | `com.vendor.app` | | `user_agents` | 与客户应用绑定的客户端 UA 字符串，包括 iOS CFNetwork 形式 | `VendorApp/1.0-android`, `VendorApp/1.0 CFNetwork/0000.0 Darwin/0.0.0` | | `ids` | 前缀包含品牌名称的内部追踪和漏洞咨询 ID（`ACME-VULN-12`） | `VENDOR-VULN-12` | | `infra_ids` | 云、Active Directory 和基础设施资源标识符（AWS ARN 和账户 ID、EC2 / EBS / S3 ID、Azure UUID、GCP 项目 ID、AD SID 和 ObjectGUID、包含品牌的 `DC=…` 可分辨名称片段） | `arn:aws:iam::000000000001:role/vendor-1`, `i-0a1b2c3d000000001`, `S-1-5-21-0000000001` | | `other` | 专有 URI 方案和深度链接（`acme-app://…`）以及其他与供应商绑定的 token | `vapp://x.vendor.example/foo` | 该流水线绝不会触及报告的技术内容：漏洞利用代码、有效载荷（payload）、Shell 输出、RFC 范围、知名库（NaCl、OAuth、JWT 等）、通用操作系统或 SDK 版本、日期、通用文件名、代码标识符和变量名。[深度解析](docs/anonymization-scope.md)列出了完整的排除集合。 ## 🎬 演示

_{Pipeline. Visual stepper, live progress, log streaming for every stage. The run pauses at Approve and promote, then resumes through Apply, Build, Verify and Auto-resolve in one click.}	_{Review. Already-mapped, auto and pending rows in a single tree. The right pane is a live render of the anonymized output, so you see exactly what Apply will write.}
_{Server. Preset gallery with quality score, disk and VRAM fit per card, command preview, one-click start and stop, in-app Model Manager.}	_{Model Manager. Curated GGUF catalog with quality, VRAM and time-on-bench badges, recommended-file highlight, resumable streaming downloads with a Queue tab.}
_{Review » Images. Per-image editor for embedded screenshots: blackout, blur, pixelate, text overlay with colour picker. Live bake renders the actual pixels as you draw. Same image_id across pages = single decision applied everywhere.}	_{Preview of build. PDF.js viewer with native drag-select text + Ctrl+C. Right-click on the highlight to add to substitution map in one click. Final confirmation gate before Apply runs.}

更多截图：向导、导入、预设编辑器、设置

_{First-run wizard. Hardware detection plus preset recommendation, with an optional one-click llama.cpp image pull.}	_{Model download. Resumable streaming with live speed and ETA. Skippable for offline-first installs.}
_{Import. Drag and drop or pick files, choose PDF strategy and export template up front.}	_{Preset editor. Every llama-server knob is exposed, with Save in project or Save as user-scope.}
_{Deployment chooser. Local binary, GUI-managed Docker, or attach to a server you already run.}	_{Settings. Pipeline thresholds, self-consistency, residual auto-resolve, optional final LLM audit pass.}

## 🚀 快速开始按照操作阻力从低到高的顺序，共有四种安装方式。 ### 选项 A. AppImage（一键安装，桌面端推荐） ``` # 下载一次，永久运行。无需系统 Python、pip、venv 或构建工具。 chmod +x Report-Anonymizer-x86_64.AppImage ./Report-Anonymizer-x86_64.AppImage ``` 该 AppImage 打包了一个便携的 Python 解释器、所有 Python 依赖项（如 PySide6、WeasyPrint 等）、`pandoc` 和 `pdftotext`。它唯一需要的系统库是 Pango 和 Cairo，这些库在所有现代的桌面 Linux 系统中都自带。最新的构建版本已附加在 [GitHub release](https://github.com/nemmusu/report-anonymizer/releases/latest) 中。 ### 选项 B. `.deb`（Debian, Ubuntu, Mint） ``` # 从最新的 GitHub release 下载，然后通过 apt 安装。 sudo apt install ./report-anonymizer__amd64.deb report-anonymizer ``` 该软件包将安装到 `/opt/report-anonymizer/`，并在 `/usr/bin/report-anonymizer` 创建一个启动器。它明确声明了对系统包 `pandoc`、`poppler-utils`、`libpango-1.0-0`、`libcairo2` 和 `python3-venv` 的 `Depends:` 依赖。`postinst` 钩子会为本次安装构建一个 Python venv 环境，并通过 pip 从 PyPI 安装运行时依赖（该软件包本身在磁盘上仅占用约 230 KB）。卸载请使用 `sudo apt remove report-anonymizer`。 ### 选项 C. 一键安装脚本（适用任何发行版） ``` curl -fsSL https://raw.githubusercontent.com/nemmusu/report-anonymizer/master/install.sh | bash ``` 在 `~/.local/share/report-anonymizer` 下设置按用户安装，并在 `~/.local/bin/report-anonymizer` 创建启动器。该脚本会检测缺失的系统工具（pandoc, poppler-utils, Pango）并提供通过 `apt-get`、`dnf`、`pacman`、`zypper` 或 `brew` 安装它们的选项。卸载请使用 `report-anonymizer uninstall [--all]`。 ### 选项 D. 从源码构建（开发者） ``` git clone https://github.com/nemmusu/report-anonymizer cd report-anonymizer python3.12 -m venv .venv && . .venv/bin/activate pip install -r requirements.txt # 构建 llama.cpp（或将 preset 指向您现有的 build） # https://github.com/ggerganov/llama.cpp#build python -m gui.main ``` 首次启动时，向导将引导您完成硬件检测和预设选择。推荐的预设会根据检测到的 GPU 内存自动高亮显示：19 GB 及以上选择 `ministral-3-8b-reasoning-bf16`，10 GB 及以上选择 `ministral-3-8b-reasoning-q5`，否则使用内置的 `default` 配置文件（Jackrong Qwen3.5 4B Claude-Opus distill Q4_K_M，无需 GPU，需下载约 2.5 GB，质量评分 78/100）。 ### 自行构建 .deb 或 AppImage ``` ./packaging/build-all.sh # both ./packaging/build-all.sh --only deb # just the .deb ./packaging/build-all.sh --only appimage # just the AppImage ./packaging/build-all.sh --version 0.2.8 # bump deb version ./packaging/build-all.sh --clean # wipe build caches first ``` 输出结果将生成在 `packaging/deb/dist/` 和 `packaging/appimage/dist/` 中。内部的构建脚本（[packaging/deb/build.sh](packaging/deb/build.sh)、[packaging/appimage/build.sh](packaging/appimage/build.sh)）也可以单独运行。 ### 命令行界面 (CLI) ``` # 单个文件夹上的 Full pipeline python bin/anonymize-dossier all -o # Individual stages python bin/anonymize-dossier scan python bin/anonymize-dossier promote python bin/anonymize-dossier apply -o python bin/anonymize-dossier build python bin/anonymize-dossier verify # One-off helpers python bin/anonymize-dossier selftest # probe pandoc, WeasyPrint, etc. python bin/anonymize-dossier migrate-map # upgrade legacy substitution_map.yml python bin/anonymize-dossier export-pdf -t pentest_modern # Server lifecycle python bin/anonymize-dossier server {start,stop,status} ``` ## 🧪 基准测试测试集包含 5 个渗透测试 PDF 文档及 44 个手动管理的真实基准值。质量指标计算公式为 `F1 × 100`。完整的排行榜（34 个模型）详见 [BENCHMARKS.md](BENCHMARKS.md)。 ### 精选预设（前 5 名） | # | 配置文件 | **质量** | 精确率 | 召回率 | 磁盘占用 | VRAM 占用 | 总耗时 | |---|---|---|---|---|---|---|---| | 🥇 | `ministral-3-8b-reasoning-bf16` | **83** | 75.5 % | **90.9 %** | 16.0 GB | ~18.9 GB | 244 s | | 🥈 | `rtila-qwen3.59b-q4` | 82 | 74.1 % | **90.9 %** | 5.2 GB | **~7.1 GB** | **79 s** | | 🥉 ★ | `jackrong-qwen3.5-4b-distill-q4` | 78 | **79.1 %** | 77.3 % | **2.5 GB** | **~4.8 GB** | 185 s | | 4 | `qwen3.5-9b-bf16` | 78 | **80.5 %** | 75.0 % | 18.4 GB | ~18.0 GB | 210 s | | 5 | `ministral-3-8b-reasoning-q5` | 76 | 65.6 % | **90.9 %** | 5.8 GB | ~9.2 GB | 112 s | **`jackrong-qwen3.5-4b-distill-q4`** 是**推荐的起点**：磁盘占用最小（2.5 GB），适用于 6 GB 显存的 GPU，并且其精确率（79.1 %）在精选集合中位列第二（仅以微弱差距落后于 `qwen3.5-9b-bf16` 的 80.5 %，但后者所需的 VRAM 接近其 4 倍）。内置的 `default` 预设使用相同的权重，但配置为仅使用 CPU（`n_gpu_layers: 0`），因此同一个模型可同时覆盖“有一块普通 GPU”和“没有任何 GPU”的情况。 ### 如何选择 | 硬件或目标 | 选择 | |---|---| | 18 GB 及以上 VRAM，追求最高质量 | `ministral-3-8b-reasoning-bf16` | | 18 GB 及以上 VRAM，追求最少误报 | `qwen3.5-9b-bf16` | | 约 7 GB VRAM，质量接近最优 | `rtila-qwen3.5-9b-q4` | | 约 6 GB VRAM，寻找最小的“优质”模型 | `jackrong-qwen3.5-4b-distill-q4` ★ 推荐 | | 约 10 GB VRAM，兼顾推理质量 | `ministral-3-8b-reasoning-q5` | | 无 GPU | `default`（在 CPU 上运行 Jackrong 4B distill Q4_K_M） | 其他 27 个经过基准测试的模型位于 [BENCHMARKS.md](BENCHMARKS.md) 的精选列表之外。所有这些模型都可以通过模型管理器的自由文本搜索找到，并附带有相应的标签（低质量、不兼容）。 ## 🏗️ 架构 ``` flowchart LR scan["Scan & detect
_{Tier-0 regex + Tier-1 LLM
+ image inventory}"] approve["Approve & promote
_{Text review queue}"] images["Review » Images
_{4 tools, live bake}"] bpreview["Preview of build
_{PDF.js, selectable text}"] apply["Apply
_{text in-place + image bytes,
atomic *.tmp + os.replace}"] build["Build
_{pandoc / WeasyPrint}"] verify["Verify
_{residual-leak sweep
+ image inventory check}"] auto["Auto-resolve
_{re-apply on residuals}"] smap[("substitution_map.yml")] iredact[("image_redactions.yml")] applied[("applied_substitutions.json")] llm[["llama-server (local)
chunked HTTP, ~5 800 tok/req,
cache_prompt = true"]] scan --> approve --> images --> bpreview --> apply --> build --> verify --> auto auto -. residuals .-> apply approve --> smap images --> iredact apply --> applied smap -.read.-> apply iredact -.read.-> apply scan -- detector / critic --> llm ``` - **检测器** 通过支持 Markdown 的分割器对每个片段进行分块，且绝不会破坏表格、代码块或标题。 - **Critic 模块** 通过独立的 LLM 运行对每个候选项进行二次检查。带有 `placeholder_safe: no` 的拒绝项将进入人工审核。 - **应用器** 通过 `*.tmp` 加 `os.replace` 的方式原子写入每个输出。 - **验证器** 扫描输出以查找残余的泄漏（执行 NFKC 标准化、实体解码、去除零宽字符）。完整的数据流和磁盘数据结构详见 [docs/architecture.md](docs/architecture.md)。 ## 📦 仓库布局 ``` anonymize/ engine package format_adapters/ docx / xlsx / pptx / odt / rtf / pdf / text pipeline.py high-level stage orchestration scanner.py file inventory (symlink-safe, gitignore-aware) detector.py / critic.py LLM detection and critic (parallel) structure_chunker.py Markdown-aware splitter applier.py deterministic substitution + atomic writes builder.py pandoc + WeasyPrint renderer verifier.py post-build residual-leak sweep image_inventory.py per-format embedded-image catalog (PDF/DOCX/PPTX) image_redactor.py PIL pixel-ops (blackout / blur / pixelate / text overlay) server_profile.py llama-server profile schema server_manager.py process supervisor with diagnose() hf_models.py HF search / curated catalog / downloads app_settings.py small key-value store (auto-start toggle, etc.) budget.py token-budget pre-flight check hardware.py GPU / CPU / RAM / disk + VRAM estimate gui/ PySide6 application app.py / main.py MainWindow, splash, sidebar pipeline_view.py stepper + progress card + residuals review_view.py 3-tab Review (Text candidates / Images / Preview of build) image_review_panel.py embedded image gallery + thumbnail strip image_editor.py per-image canvas editor with live bake + colour pickers build_preview_panel.py final-state preview tab with native text selection diff_view.py side-by-side rendered diff (synced scroll) server_panel.py preset gallery + server controls + auto-start toggle preset_editor.py preset editor with command preview model_manager_dialog.py library + curated downloads + HF search config/ server_profiles.yml built-in presets (6 profiles, 5 by quality + default) leak_patterns.yml Tier-0 regex rules safe_terms.yml whitelist substitution_map.yml example empty schema (project-scope maps live elsewhere) prompts/ system + user prompts (Jinja templates) bin/anonymize-dossier CLI entry point tests/ 284 pytest tests, all passing assets/ app icon + splash + hero (SVG) docs/ index.md MkDocs landing page anonymization-scope.md the 12 leak categories with examples architecture.md full data flow and on-disk schema presets.md preset catalog and how to choose faq.md common questions contributing.md development setup and PR guidelines screenshots/ PNGs referenced from this README BENCHMARKS.md leaderboard and methodology mkdocs.yml MkDocs site config (used by the Pages workflow) ``` ## 🔒 隐私 - **无遥测。** 该应用程序绝不会将使用数据发送到任何地方。 - **无云端 LLM。** 所有推理都在您本地的 llama.cpp 服务器上运行。 - **唯一的可选网络端点：** `huggingface.co`，且仅在您明确下载模型时才会连接。可在“设置”中禁用，即离线模式。 - **HF token**（如果您进行了设置）存储在 `~/.config/document-anonymizer/hf.token` 中，权限为 `0600`。 - **替换映射表保留在您的项目文件夹中。** 它们永远不会离开本机。 ## 🧰 自检 ``` make selftest # 或 python bin/anonymize-dossier selftest ``` 检查 `pandoc`、`WeasyPrint`（Python 包以及 Pango 和 Cairo 系统库）、`poppler-utils`、`tesseract`、`ocrmypdf`、`qdf` 及 llama.cpp 构建是否存在。为缺失的项目打印安装提示。该检查在设计上仅限命令行使用（GUI 不需要自检入口）。 ## 🧪 测试 ``` make test # full pytest suite (284 tests) pytest -k chunker # one module QT_QPA_PLATFORM=offscreen pytest tests/test_gui_smoke.py ``` 该测试套件涵盖了第 0 层（Tier-0）规则、第 1 层（Tier-1）LLM 模拟、格式适配器（往返和一致性检查）、流水线阶段、GUI 工作线程以及验证器的标准化层。 ## 🗺️ 路线图本项目没有固定的路线图。对于“在本地对一批 PDF 进行脱敏处理”的日常工作流而言，该项目在功能上已经稳定：扫描、审核、应用、构建、验证，外加模型管理器和基准测试工具。未来的变更将基于实际使用情况（错误报告、真实语料库中的漏报、大型文档暴露出的性能瓶颈）进行。如果您有功能请求，请开启一个带有 `enhancement` 标签的 issue，并描述您希望支持的文档或工作流。具体的实际用例将优先于推测性的功能获得处理。 ## 📜 许可证 [GNU GPL v3.0](LICENSE) © Report Anonymizer 贡献者。这是一款 Copyleft 许可证：您可以自由使用、学习、修改和重新分发 Report Anonymizer。衍生作品（分叉、重新分发、基于修改副本构建的二进制文件）也必须在 GPL 下发布，并附上相应的源代码。如果您希望将此引擎集成到无法满足上述条件的产品中，请先开 issue 进行说明。

_{专为无法将客户数据发送到云端 LLM 的安全团队而构建。}

标签：AI安全, C++, Chat Copilot, DLP, GitHub, GPL-3.0, GUI应用, IaC 扫描, IP匿名化, llama.cpp, Markdown处理, meg, NLP, Office文档处理, PDF处理, PySide6, Python, 个人电脑工具, 信息安全, 凭据匿名化, 安全合规, 开源, 报告脱敏, 敏感信息替换, 数据擦除, 数据脱敏, 文本匿名化, 文档处理, 无后门, 无需GPU, 本地LLM, 本地大语言模型, 本地推理, 渗透测试报告, 离线运行, 网络代理, 网络安全, 网络安全, 逆向工具, 隐私保护, 隐私保护