glmmedia-ocr
Self-contained command-line tool that converts PDFs and images to structured Markdown using a local GLM-OCR model through Ollama. Manages the full pipeline automatically — Ollama lifecycle, model pulling, PDF rendering, layout detection via PP-DocLayoutV3, and result merging. Dual distribution through npm and pip with 50+ configurable options.
Notes
A production-grade OCR CLI that runs entirely locally with no cloud dependencies.
Pipeline
PDF rendering (pypdfium2) → image normalization → layout detection (PP-DocLayoutV3 via HuggingFace) → OCR via Ollama's generate endpoint → structured Markdown output with page separators.
Highlights
- Zero-config: auto-starts Ollama, pulls the ~2.2GB model if missing, manages virtual environments
- Handles PDFs, PNG, JPEG, WebP, BMP, TIFF, GIF
- Error recovery — failed pages get placeholders, processing continues
- Optional cloud mode via Zhipu Cloud MaaS backend
- Clean shutdown with signal traps