andyhuo520/ppocrv6-studio

GitHub: andyhuo520/ppocrv6-studio

基于 PP-OCRv6 三档模型搭建的本地 OCR 工作台,支持 Apple Silicon CoreML 加速、多档位一键切换及 OmniDocBench 评测。

Stars: 2 | Forks: 0

🇨🇳 中文 # PP-OCRv6 Studio 围绕 **PP-OCRv6** 搭建的本地 OCR 工作台。PP-OCRv6 是飞桨最新的三档 OCR 模型家族(Tiny / Small / Medium),三档模型可以在本机一键切换,同时支持 [OmniDocBench](https://github.com/opendatalab/OmniDocBench) 标准评测集的本地跑分。 测试环境,macOS Apple Silicon(M 系列芯片)。ONNX Runtime 会自动启用 CoreML 加速,不需要额外配置。 ## 界面截图 ### 上传识别页 ![上传界面](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/74815e773b114457.png) *拖拽上传区,支持批量处理和剪贴板粘贴(⌘V 截图直接粘)。* ### 历史记录 ![历史记录网格](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/55bf514a37114504.png) *39 条历史记录,每张卡片显示缩略图、文件名、识别框数量和处理耗时。* ### 识别结果详情 ![OCR 结果详情](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/660febb62a114512.png) *报纸页,共识别出 193 个文本框。右侧面板展示完整文字转录结果。* ### 参数设置 ![参数设置页](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/e3dbeb99c5114516.png) *模型档位切换(Tiny / Small / Medium)、CoreML 开关、检测参数滑块。* ## 目录结构 | 组件 | 说明 | |------|------| | `webapp/` | FastAPI 后端 + 单页 Web UI。支持图片上传、模型切换、结果导出(CSV / Markdown / Excel)。 | | `ppocrv6_browser.html` | 零依赖浏览器 Demo,PP-OCRv6 Tiny 通过 ONNX Runtime Web 完全在浏览器内运行,无需服务器。 | | `bench_local_v2.py` | 通过本地 API 运行 OmniDocBench 评测。 | | `run_apple_vision.py` | 用 macOS 内置 Apple Vision 对同一批 18 张图跑分对照。 | | `gen_result_vis.py` | 为任意图片生成检测+识别结果可视化面板。 | | `assets/realworld_ocr/` | 四张真实场景测试图 + PP-OCRv6 Medium 输出面板。 | ## 评测分数 在 OmniDocBench 演示集(18 张文档页)上评测,指标,`text_block` 编辑距离,**越低越好**。 | 模型 | 文本块编辑距离 ↓ | 备注 | |------|----------------|------| | PP-OCRv6 Medium(34.5 MB) | **0.425** | 精度最高,Apple Silicon 本地运行 | | PP-OCRv6 Small(7.7 MB) | 0.443 | 性能均衡,移动端体量 | | PP-OCRv6 Tiny(1.5 MB) | 0.446 | 可在浏览器内跑,无需服务器 | | Apple Vision(系统内置) | 0.448 | 零配置,0.16–0.54 秒/张 | | PaddleOCR-VL(云端 API) | ~0.38* | 多模态,云端延迟 6–16 秒/张 | *PaddleOCR-VL 单独评测,针对文档版面优化,非孤立文本块场景,数字仅供参考。 ## 真实场景测试 四张超出标准文档扫描范围的测试图,分别覆盖透视变形、点阵字体、浮雕低对比度文字、七段数码管四个场景。 ### 名片,斜拍透视 ![名片 OCR 结果](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/e605e57178114523.jpg) 斜拍加彩色底,透视角度让字体变形。Medium 完整读出,DESIGN SOUL / JONATHAN DOE / Creative Designer / 电话 / 网址。 ### 点阵字体,字形断裂 ![点阵字体 OCR 结果](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/6cba0ed51b114530.jpg) 两行文字均准确识别,DotMatrix(标题)+ RETRO PRINT CHARM(副标题)。字符集覆盖方面 Small 表现最稳。 ### 轮胎压印,低对比浮雕字 ![轮胎侧壁 OCR 结果](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/2dd22274b7114537.jpg) 曲面金属上的浮雕字,约 30° 斜角拍摄。Medium 读出,TREADWEAR 220 / PLACARD IN VEHICLE / TO SEAT BEADS / ME AXLE。这是最难的一个场景,Apple Vision 只读出了「220」,大多数多模态模型在低对比度下都吃力。 ### 电梯数码屏,七段字体加反光金属底 ![电梯显示屏 OCR 结果](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/6791c51358114544.jpg) 四块面板的产品编号(BVY413HSW、BVY411HSW)、品牌名(ORB ELEVATOR)、网址(orbelevator.en.alibaba.com)均正确识别。 ## 环境要求 | | 最低 | 推荐 | |-|------|------| | 操作系统 | macOS 13 Ventura | macOS 14+ Sonoma / Sequoia | | 芯片 | Apple M1 | Apple M2 / M3 / M4 | | 内存 | 8 GB | 16 GB | | Python | 3.10 | 3.11 / 3.12 | | 磁盘空间 | ~500 MB(模型 + 依赖) | — | ## 快速上手 ### 第一步,克隆仓库 git clone https://github.com/andyhuo520/ppocrv6-studio.git cd ppocrv6-studio ### 第二步,创建虚拟环境 python3 -m venv .venv source .venv/bin/activate ### 第三步,安装依赖 pip install -r requirements-webapp.txt Apple Silicon 推荐额外安装 CoreML 加速版, pip uninstall onnxruntime -y pip install onnxruntime-silicon ### 第四步,下载 ONNX 模型 bash scripts/download_models.sh all 脚本从 GitHub Releases 页面下载三档模型压缩包并解压, ppocrv6_onnx/ ← Tiny(官方参数量 1.5 MB) ppocrv6_small_onnx/ ← Small(7.7 MB) ppocrv6_medium_onnx/ ← Medium(34.5 MB) ### 第五步,启动工作台 python webapp/server.py 浏览器打开 **http://localhost:8765**。 ## 原理简述 PP-OCRv6 把 OCR 拆成两个阶段, 1. **检测**,用 DB(Differentiable Binarization)模型找出文字区域,骨干网络是 LCNetV4,感受野从 3×3 扩展到 7×7,对小字和密集排版的处理更稳。 2. **识别**,把检测到的区域逐个裁出来,用 CTC 模型加轻量注意力模块读字符。三档模型共用同一套骨干,只在宽度和深度上有差别。 ONNX Runtime 在 Apple Silicon 上把两个阶段都走 CoreML,实测单张耗时 3–52 秒,取决于档位和图片分辨率。 ## 许可证 MIT。详见 [LICENSE](LICENSE)。 PP-OCRv6 模型权重由飞桨团队以 [Apache 2.0 许可证](https://github.com/PaddlePaddle/PaddleOCR/blob/main/LICENSE) 发布。 *完整的评测思路、真实场景分析和横向对比,见 [`PP-OCRv6_Khazix.html`](PP-OCRv6_Khazix.html)(中文长文)。*
🇺🇸 English # PP-OCRv6 Studio A local OCR workbench built around **PP-OCRv6** — PaddlePaddle's latest three-tier OCR model family (Tiny / Small / Medium). Run all three tiers on your own machine, switch between them in one click, and benchmark against real-world edge cases and the [OmniDocBench](https://github.com/opendatalab/OmniDocBench) standard evaluation set. Built and tested on **macOS with Apple Silicon** (M-series). CoreML acceleration is enabled automatically via ONNX Runtime. ## Screenshots ### Upload & Recognize ![Upload interface](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/74815e773b114457.png) *Drag-and-drop upload zone. Supports batch processing and clipboard paste (⌘V).* ### History Grid ![History grid](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/55bf514a37114504.png) *39-item history grid. Each card shows the thumbnail, filename, box count, and processing time.* ### OCR Result Detail ![OCR result detail](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/660febb62a114512.png) *Newspaper page with 193 detection boxes. The right panel shows the full text transcript.* ### Settings ![Settings page](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/e3dbeb99c5114516.png) *Model tier selector (Tiny / Small / Medium), CoreML toggle, and detection parameter sliders.* ## What's inside | Component | Description | |-----------|-------------| | `webapp/` | FastAPI backend + single-page web UI. Upload images, switch models, export results as CSV / Markdown / Excel. | | `ppocrv6_browser.html` | Zero-dependency browser demo — no server, runs PP-OCRv6 Tiny entirely in-browser via ONNX Runtime Web. | | `bench_local_v2.py` | Run OmniDocBench evaluation against the local API server. | | `run_apple_vision.py` | Benchmark Apple Vision Framework (macOS built-in) on the same 18-image set. | | `gen_result_vis.py` | Generate side-by-side detection + recognition visualisation panels for arbitrary images. | | `assets/realworld_ocr/` | Four real-world test images + PP-OCRv6 Medium output panels. | ## Benchmark results Evaluated on OmniDocBench demo set (18 document pages). Metric: `text_block` Edit Distance — **lower is better**. | Model | Text Block ↓ | Notes | |-------|-------------|-------| | PP-OCRv6 Medium (34.5 MB) | **0.425** | Best accuracy; runs locally on Apple Silicon | | PP-OCRv6 Small (7.7 MB) | 0.443 | Good balance; mobile-class size | | PP-OCRv6 Tiny (1.5 MB) | 0.446 | Runs in browser via ONNX Runtime Web | | Apple Vision (built-in) | 0.448 | Zero-setup; 0.16–0.54 s/image | | PaddleOCR-VL (cloud API) | ~0.38* | Multimodal; cloud latency 6–16 s/image | *PaddleOCR-VL evaluated separately; optimised for document layout, not isolated text blocks. ## Real-world test cases Four images that push OCR beyond clean document scans: perspective angles, dot-matrix fonts, embossed low-contrast text, and seven-segment LED displays. ### Business card — perspective shot ![Business card OCR result panel](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/e605e57178114523.jpg) Detection correctly locates all text regions despite the skewed angle and coloured background. Medium reads: **DESIGN SOUL / JONATHAN DOE / Creative Designer / phone / url**. ### Dot-matrix font — fragmented glyphs ![Dot matrix OCR result panel](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/6cba0ed51b114530.jpg) Both lines detected and recognised cleanly: **DotMatrix** (title) + **RETRO PRINT CHARM** (subtitle). Small model performs best here due to its character-set coverage. ### Tire sidewall — low-contrast embossed text ![Tire sidewall OCR result panel](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/2dd22274b7114537.jpg) Embossed text on curved metal at ~30° angle. Medium reads: **TREADWEAR 220 / PLACARD IN VEHICLE / TO SEAT BEADS / ME AXLE**. Hardest case — Apple Vision reads only "220", most multimodal models struggle with the low contrast. ### Elevator LED display — seven-segment digits + reflective metal ![Elevator display OCR result panel](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/6791c51358114544.jpg) All product codes (**BVY413HSW**, **BVY411HSW**), brand name (**ORB ELEVATOR**), and URL (**orbelevator.en.alibaba.com**) correctly detected across four panels. ## Requirements | | Minimum | Recommended | |-|---------|-------------| | OS | macOS 13 Ventura | macOS 14+ Sonoma / Sequoia | | CPU | Apple M1 | Apple M2 / M3 / M4 | | RAM | 8 GB | 16 GB | | Python | 3.10 | 3.11 / 3.12 | | Disk | ~500 MB (models + deps) | — | ## Setup ### 1 — Clone git clone https://github.com/andyhuo520/ppocrv6-studio.git cd ppocrv6-studio ### 2 — Create virtual environment python3 -m venv .venv source .venv/bin/activate ### 3 — Install dependencies pip install -r requirements-webapp.txt For Apple Silicon CoreML acceleration (recommended): pip uninstall onnxruntime -y pip install onnxruntime-silicon ### 4 — Download PP-OCRv6 ONNX models bash scripts/download_models.sh all This downloads three tarballs from the GitHub Releases page and extracts them: ppocrv6_onnx/ ← Tiny (1.5 MB official params) ppocrv6_small_onnx/ ← Small (7.7 MB) ppocrv6_medium_onnx/ ← Medium (34.5 MB) ### 5 — Start the studio python webapp/server.py Open **http://localhost:8765** in your browser. ## How it works PP-OCRv6 splits OCR into two stages: 1. **Detection** — finds text regions using a DB (Differentiable Binarization) model with LCNetV4 backbone. Receptive field expanded from 3×3 to 7×7 for better small-text and dense-text handling. 2. **Recognition** — crops each detected region and reads characters using a CTC model with a lightweight attention module. One backbone serves all three tiers (Tiny / Small / Medium differ only in width and depth). ONNX Runtime on Apple Silicon routes both stages through **CoreML**, giving 3–52 s/image depending on model tier and image resolution. ## License MIT. See [LICENSE](LICENSE). PP-OCRv6 model weights are released under the [Apache 2.0 license](https://github.com/PaddlePaddle/PaddleOCR/blob/main/LICENSE) by PaddlePaddle. *Built as part of a hands-on benchmark series — see the full write-up in [`PP-OCRv6_Khazix.html`](PP-OCRv6_Khazix.html) (Chinese).*
标签:AV绕过, CoreML, FastAPI, OCR, PP-OCR, 人工智能, 后端开发, 机器学习评测, 用户模式Hook绕过, 计算机视觉, 逆向工具