Projects

glmmedia-ocr

CLI · Tool · Development · 2026

Self-contained command-line tool that converts PDFs and images to structured Markdown using a local GLM-OCR model through Ollama. Manages the full pipeline automatically — Ollama lifecycle, model pulling, PDF rendering, layout detection via PP-DocLayoutV3, and result merging. Dual distribution through npm and pip with 50+ configurable options.

pythoncliollamaocrpytorch

Notes

A production-grade OCR CLI that runs entirely locally with no cloud dependencies.

Pipeline

PDF rendering (pypdfium2) → image normalization → layout detection (PP-DocLayoutV3 via HuggingFace) → OCR via Ollama's generate endpoint → structured Markdown output with page separators.

Highlights

  • Zero-config: auto-starts Ollama, pulls the ~2.2GB model if missing, manages virtual environments
  • Handles PDFs, PNG, JPEG, WebP, BMP, TIFF, GIF
  • Error recovery — failed pages get placeholders, processing continues
  • Optional cloud mode via Zhipu Cloud MaaS backend
  • Clean shutdown with signal traps