firecrawl/pdf-inspector

GitHub: firecrawl/pdf-inspector

一个用 Rust 实现的快速 PDF 分类与文本提取库，通过智能识别扫描件与文本内容，避免不必要的 OCR，提升大规模文档处理的速度与成本效率。

Stars: 1634 | Forks: 149

# pdf-inspector 用于 PDF 分类和文本提取的快速 Rust 库。支持检测 PDF 是基于文本的还是扫描件，提取带位置信息的文本，并转换为干净的 Markdown —— 全程无需 OCR。提供 [Python](docs/python.md) 和 [Node.js](napi/README.md) 绑定。由 [Firecrawl](https://firecrawl.dev) 构建，可在 200 毫秒内本地处理基于文本的 PDF，跳过约 54% 不需要 OCR 的 PDF 的高昂开销。 ## 功能 - **智能分类** — 通过采样内容流，在约 10-50 毫秒内检测 TextBased、Scanned、ImageBased 或 Mixed PDF。返回置信度（0.0-1.0）以及每页的 OCR 路由建议。 - **文本提取** — 带字体信息、X/Y 坐标的位置感知提取，支持自动多列阅读顺序。 - **Markdown 转换** — 基于字体大小比例的标题（H1-H4）、项目符号/编号/字母列表、等宽字体检测的代码块、基于矩形的表格与启发式表格、粗体/斜体格式、URL 链接和分页。 - **表格检测** — 双模式：基于 PDF 绘制操作的矩形检测，以及基于文本对齐的启发式检测。支持财务表格、脚注和跨页延续表格。 - **CID 字体支持** — ToUnicode CMap 解码 Type0/Identity-H 字体，支持 UTF-16BE、UTF-8 和 Latin-1 编码。 - **多列布局** — 自动检测报纸风格的列，支持顺序阅读和 RTL 文本。 - **编码问题检测** — 自动标记损坏的字体编码，以便调用方回退到 OCR。 - **单文档加载** — 文档仅解析一次，并在检测与提取阶段共享，避免重复 I/O。 - **轻量级** — 纯 Rust 实现，不含 ML 模型，不依赖外部服务，仅依赖 `lopdf` 进行 PDF 解析。 ## 基准测试在 [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench) 数据集（200 个 PDF）上进行评估。仅展示直接文本提取引擎 —— 不使用 OCR 或 ML 模型。评分范围为 0-1，数值越高越好。 | 引擎 | Overall | Reading Order (NID) | Tables (TEDS) | Headings (MHS) | Speed (200 docs) | |---|---|---|---|---|---| | pdf-inspector | 0.78 | 0.87 | 0.59 | 0.57 | 4s | | opendataloader | 0.84 | 0.91 | 0.49 | 0.74 | 11s | | pymupdf4llm | 0.73 | 0.89 | 0.40 | 0.41 | 18s | | markitdown | 0.58 | 0.88 | 0.00 | 0.00 | 8s | 相比之下，使用 OCR/ML 的引擎（docling、marker、mineru）在整体评分上为 0.83-0.88，但处理同一数据集需要 2 到 180 分钟。 **我们的优势：** 速度（所有引擎中最快）、阅读顺序、相较于其他直接文本工具的表格检测。 **待改进之处：** 标题检测落后于 opendataloader —— 许多 PDF 使用正文字体大小的粗体文本作为标题，或标题仅比正文稍大；表格检测落后于基于 OCR 的引擎，因为后者能直接看到视觉表格结构。 ## 快速开始 ### Python ``` pip install maturin maturin develop --release ``` ``` import pdf_inspector result = pdf_inspector.process_pdf("document.pdf") print(result.pdf_type) # "text_based", "scanned", "image_based", "mixed" print(result.markdown) # Markdown string or None ``` ### Node.js ``` npm install firecrawl-pdf-inspector ``` ``` import { readFileSync } from 'fs'; import { processPdf, classifyPdf } from 'firecrawl-pdf-inspector'; const result = processPdf(readFileSync('document.pdf')); console.log(result.pdfType); // "TextBased", "Scanned", "ImageBased", "Mixed" console.log(result.markdown); // Markdown string or null ``` ### Rust ``` [dependencies] pdf-inspector = { git = "https://github.com/firecrawl/pdf-inspector" } ``` ``` use pdf_inspector::process_pdf; let result = process_pdf("document.pdf")?; println!("Type: {:?}", result.pdf_type); if let Some(markdown) = &result.markdown { println!("{}", markdown); } ``` ### CLI ``` # 将 PDF 转换为 Markdown cargo run --bin pdf2md -- document.pdf # JSON 输出（用于管道传输） cargo run --bin pdf2md -- document.pdf --json # 仅原始 Markdown（无标题） cargo run --bin pdf2md -- document.pdf --raw # 插入分页标记（） cargo run --bin pdf2md -- document.pdf --pages # 仅处理特定页面 cargo run --bin pdf2md -- document.pdf --select-pages 1,3,5-10 # 仅检测（不提取） cargo run --bin detect-pdf -- document.pdf cargo run --bin detect-pdf -- document.pdf --json # 检测 + 布局分析（表格、列） cargo run --bin detect-pdf -- document.pdf --analyze --json ``` ## 架构 ``` PDF bytes │ ├─► detector → PdfType (TextBased / Scanned / ImageBased / Mixed) │ └─► extractor ├─ fonts → font widths, encodings ├─ content_stream → walk PDF operators → TextItems + PdfRects ├─ xobjects → Form XObject text, image placeholders ├─ links → hyperlinks, AcroForm fields └─ layout → column detection → line grouping → reading order │ ├─► tables │ ├─ detect_rects → rectangle-based tables (union-find) │ ├─ detect_heuristic → alignment-based tables │ ├─ grid → column/row assignment → cells │ └─ format → cells → Markdown table │ └─► markdown ├─ analysis → font stats, heading tiers ├─ preprocess → merge headings, drop caps ├─ convert → line loop + table/image insertion ├─ classify → captions, lists, code └─ postprocess → cleanup → final Markdown ``` 文档通过 `load_document_from_path` / `load_document_from_mem` **加载一次**，并在检测与提取阶段共享，避免重复解析。 ### 项目结构 ``` src/ lib.rs — Public API, PdfOptions builder, convenience functions python.rs — PyO3 Python bindings types.rs — Shared types: TextItem, TextLine, PdfRect, ItemType text_utils.rs — Character/text helpers (CJK, RTL, ligatures, bold/italic) process_mode.rs — ProcessMode enum (DetectOnly, Analyze, Full) detector.rs — Fast PDF type detection without full document load glyph_names.rs — Adobe Glyph List → Unicode mapping tounicode.rs — ToUnicode CMap parsing for CID-encoded text extractor/ — Text extraction pipeline tables/ — Table detection and formatting markdown/ — Markdown conversion and structure detection bin/ — CLI tools (pdf2md, detect_pdf) napi/ — Node.js/Bun bindings (napi-rs) ``` ## 分类工作原理 1. 解析 xref 表和页面树（不加载完整对象） 2. 根据 `ScanStrategy` 选择页面（默认：扫描所有页面，支持提前退出） 3. 在内容流中查找 `Tj`/`TJ`（文本操作符）和 `Do`（图像操作符） 4. 基于文本操作符在采样页面中的存在情况进行分类这可在毫秒级检测 300+ 页的 PDF。结果包含 `pages_needing_ocr` —— 缺少文本的特定页号列表，支持按页 OCR 路由而非全有或全无。 ### 扫描策略 | 策略 | 行为 | 适用场景 | |---|---|---| | `EarlyExit`（默认） | 扫描所有页面，遇到首个非文本页即停止 | 将文本类 PDF 路由至快速提取 | | `Full` | 扫描所有页面，不提前退出 | 准确区分 Mixed 与 Scanned | | `Sample(n)` | 采样 `n` 个均匀分布的页面（首、末、中部） | 超大 PDF 且更看重速度 | | `Pages(vec)` | 仅扫描指定的 1-indexed 页号 | 调用方已知需要检查的页面 | ## Markdown 输出转换器支持处理： | 元素 | 检测方式 | |---|---| | 标题（H1-H4） | 基于与正文字体大小的比例层级，0.5pt 聚类 | | 粗体/斜体 | 字体名称模式（Bold、Italic、Oblique） | | 项目符号列表 | `*`、`-`、`•`、`○`、`●`、`◦` 前缀 | | 编号列表 | `1.`、`1)`、`(1)` 模式 | | 字母列表 | `a.`、`a)`、`(a)` 模式 | | 代码块 | 等宽字体（Courier、Consolas、Monaco、Menlo、Fira Code、JetBrains Mono）与关键字检测 | | 表格 | 基于 PDF 绘制操作的矩形检测 + 基于文本对齐的启发式检测 | | 财务表格 | 令牌拆分以整合聚合数值 | | 标注 | “Figure”、“Table”、“Source:” 前缀检测 | | 上标/下标 | 字体大小与相对于基线的 Y 偏移 | | URL | 转换为 Markdown 链接 | | 断词 | 跨行连接的单词重新合并 | | 页码 | 从输出中过滤 | | 首字下沉 | 大号首字与后续文本合并 | | 浮点省略号 | TOC 风格省略号折叠为 “ ... ” | ## 使用场景：智能 PDF 路由 pdf-inspector 为大规模 PDF 处理流水线设计。无需将每个 PDF 都送入 OCR： ``` PDF arrives → pdf-inspector classifies it (~20ms) → TextBased + high confidence? YES → extract locally (~150ms), done NO → send to OCR service (2-10s) ``` 这可为大多数已是文本类的 PDF（报告、论文、发票、法律文档）节省成本与延迟。 ## 调试请参考 [docs/debugging.md](docs/debugging.md) 了解 `RUST_LOG` 环境变量的使用方法。 ## 许可证 MIT

标签：200ms响应, CID字体, CMap解码, Firecrawl, Latin-1编码, Markdown转换, MITM代理, Node.js绑定, OCR规避, OCR路由建议, OpenDataLoader, PDF分类, PDF文本提取, PDF检查, PDF解析, Python绑定, RTL支持, Rust库, UTF-16, X/Y坐标, 位置感知提取, 内容流采样, 单依赖lopdf, 可视化界面, 多列布局, 字体信息提取, 扫描件检测, 文本提取基准, 文本版PDF, 无ML模型, 智能路由, 本地处理, 混合PDF, 纯Rust, 编码问题检测, 置信度评分, 表格检测, 逆向工具, 通知系统, 阅读顺序, 高性能PDF