GLM-OCR Explained: Promptable OCR That Outputs Clean JSON

This is a simplified guide to an AI model called GLM-OCR maintained by zai-org. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.

Model overview

GLM-OCR is a multimodal optical character recognition model designed for complex document understanding. Built on the GLM-V encoder-decoder architecture, it combines a CogViT visual encoder with a lightweight cross-modal connector and a GLM-0.5B language decoder. The model ranks first overall on OmniDocBench V1.5 with a score of 94.62, demonstrating state-of-the-art performance across document understanding benchmarks. Unlike simpler OCR systems, GLM-OCR handles diverse document layouts through a two-stage pipeline that performs layout analysis followed by parallel recognition using PP-DocLayout-V3. Similar models like glm-4v-9b and dots.ocr also tackle multimodal document understanding, though GLM-OCR distinguishes itself through its integration of Multi-Token Prediction loss and stable full-task reinforcement learning for improved training efficiency.

Model inputs and outputs

GLM-OCR accepts images of documents and text prompts describing the extraction task. The model outputs structured text content extracted from the document, formatted according to the specified prompt. This design enables both document parsing tasks and structured information extraction, making it flexible for various downstream applications.

Inputs

Document images in standard formats like PNG or JPG
Text prompts specifying the extraction task (text recognition, formula recognition, table recognition, or structured information extraction)

Outputs

Extracted text content from documents in raw format
Structured JSON data for information extraction tasks
Formula and table recognition results when prompted for these specific document elements

Capabilities

The model excels at recognizing text across complex layouts, including formulas, tables, and multi-column documents. It maintains robust performance on challenging real-world scenarios such as code-heavy documents, official seals, and documents with varied formatting. The model can extract structured information by adhering to predefined JSON schemas, making it suitable for automated data collection from documents like identification cards or forms. With only 0.9B parameters, it supports efficient inference through vLLM, SGLang, and Ollama, reducing latency and computational requirements compared to larger alternatives.

What can I use it for?

Organizations can deploy this model for automated document processing pipelines, such as digitizing paper records, extracting data from invoices or receipts, and automated form processing. Financial institutions might use it for loan application processing or document verification. Educational platforms could automate transcript processing. The model's efficiency makes it practical for high-concurrency services and edge deployments where computational resources are limited. The open-source nature and comprehensive SDK from zai-org enable developers to integrate it into production systems quickly.

Things to try

Experiment with different prompt types to extract specific information—use "Text Recognition:" for raw content extraction, "Table Recognition:" for structured table data, or custom JSON schemas for targeted information extraction. Test the model's performance on documents with mixed languages or unusual layouts to understand its generalization capabilities. Try deploying the model via different inference engines to evaluate which offers the best performance-to-latency tradeoff for your use case. Additionally, explore how the model handles documents with low-quality scans or unusual orientations to establish reliability boundaries for your application.