Chandra OCR Reignites the Structured Document Race: Why Layout Understanding Matters Again

The OCR That LLMs Will Not Replace

There is a widespread narrative in the AI ecosystem that multimodal LLMs (GPT-4o, Gemini, Claude) make traditional OCR obsolete. Send a document image to GPT-4o, it extracts the text. Problem solved. This narrative is wrong for a simple reason: generic LLMs do not understand layout.

When you send a complex financial table with merged cells to a multimodal LLM, it extracts text but loses structure. Columns mix, cell merges disappear, heading hierarchies flatten. For a simple document with linear text, it works. For an annual report, an administrative form, or a scientific publication with equations, it is unusable.

Chandra OCR, developed by Datalab, a Brooklyn-based startup founded in 2024 by Vik Paruchuri and Sandy Kwon, is the first model to solve this problem convincingly in open source. With 85.9% on the olmOCR benchmark (the new de facto standard), Chandra 2 surpasses all alternatives, including dots.ocr (83.9%), olmOCR 2 (78.5%), and DeepSeek OCR (75.4%). And it is on tables (89.9%), mathematics (89.3%), and headers/footers (92.5%) that the gaps are most pronounced.

The Technical Architecture: Full-Page Decoding Changes Everything

What distinguishes Chandra from traditional OCR is its full-page decoding approach with layout awareness. Classic OCR systems, including Datalab's earlier tools (Marker and Surya), follow a pipeline approach: segment the document into blocks, identify each block type, then process each block separately. This approach loses spatial relationships between elements.

Chandra uses a vision-language model (based on Qwen3 VL contributions) that processes the entire page in a single pass. It simultaneously identifies content types (text, tables, images, formulas, checkboxes), extracts and captions images, preserves table structures (including colspan and rowspan), reconstructs forms, and handles handwriting and mathematical equations (output as LaTeX).

The Chandra 2 model has 4 billion parameters and supports over 90 languages. It produces structured outputs in Markdown, HTML, or JSON with bounding box coordinates, enabling complete programmatic exploitation.

In terms of inference performance, Chandra processes 4 pages per second on an H100 GPU, approximately 345,000 pages per day. With vLLM and 96 concurrent requests, throughput drops to 1.44 pages per second, which remains sufficient for most batch use cases.

Metric	Chandra 2	dots.ocr	olmOCR 2	DeepSeek OCR	Gemini 2.5 Flash
Overall olmOCR score	85.9%	83.9%	78.5%	75.4%	67.6%
Tables	89.9%	N/A	N/A	N/A	N/A
Mathematics	89.3%	N/A	N/A	N/A	N/A
Headers/footers	92.5%	N/A	N/A	N/A	N/A
Multilingual (43 langs)	77.8%	N/A	N/A	N/A	67.6%

Why Layout Understanding Is a Hot Topic Again

Chandra's timing is no accident. Three trends are converging to put layout understanding back at center stage.

The first trend is the explosion of RAG on enterprise documents. RAG systems indexing PDF documents (financial reports, legal contracts, technical documentation) need high-quality structured extraction. A RAG system that ingests a table as raw text loses the structural information that is often the most important information.

The second trend is the rise of AI agents that process documents. Autonomous agents processing invoices, administrative forms, or medical files require layout understanding to extract the right information from the right fields. Raw text OCR is not sufficient.

The third trend is growing demand for processing historical and handwritten documents. Archive digitization, handwritten note transcription, and processing of historical documents are expanding markets that require OCR capable of handling complex layouts and handwriting.

Installation and Practical Usage

One of Chandra's strengths is its installation simplicity. The command pip install chandra-ocr followed by chandra input.pdf output/ is enough to get started. The Python API via InferenceManager allows finer integration into existing pipelines.

For larger-scale deployments, the vLLM server enables higher throughput. Quantized versions (8B and 2B parameters) are available commercially for organizations needing performance on more modest hardware.

The model is available on Hugging Face under the identifier datalab-to/chandra. The code is Apache 2.0 licensed, and model weights are under a modified OpenRAIL-M license that is free for startups under $2 million annual revenue. Beyond that, a commercial license is required.

Datalab also offers a free hosted playground at datalab.to and hosted APIs for organizations that prefer not to manage infrastructure. The community is active on Discord.

Practical Applications for Agencies

For agencies like Bridgers, Chandra OCR opens possibilities in several domains.

The first domain is feeding RAG pipelines. If you are building a RAG system for a client in finance, legal, or healthcare, the quality of document extraction determines the quality of the entire system. Replacing basic OCR with Chandra for the ingestion phase can significantly improve answer relevance, particularly on documents containing tables and structured data.

The second domain is automating document workflows. Data extraction from invoices, purchase orders, administrative forms, or audit reports can be automated with Chandra. A use case reported by Purchaser.ai shows six-figure savings on purchase document processing.

The third domain is archive digitization. For clients with significant paper archives (law firms, cultural institutions, government agencies), Chandra offers superior extraction quality on historical, handwritten, and multi-column documents.

The fourth domain is multilingual processing. With 77.8% average score across 43 languages (versus 67.6% for Gemini 2.5 Flash), Chandra is particularly relevant for international clients processing documents in multiple languages.

Limitations and Alternatives to Consider

Chandra is not the universal solution for all OCR needs.

For simple documents with linear text without tables or formulas, a multimodal LLM like GPT-4o or Gemini may be sufficient and easier to integrate. Chandra's overhead is only justified on complex structured documents.

The modified OpenRAIL-M license imposes restrictions for companies above $2 million in revenue. Agencies deploying Chandra for large clients must evaluate the commercial license cost.

The need for an H100 GPU or equivalent for optimal performance can be a constraint for on-premise deployments. Quantized versions reduce hardware requirements but with quality impacts to document on a case-by-case basis.

Alternatives to keep on radar include olmOCR 2 (open source, 78.5% on the benchmark), PaddleOCR (Apache 2.0, strong on tables), and commercial APIs like dots.ocr (83.9%) for organizations preferring a managed service.

Structured Documents as the Foundation of Enterprise AI

The ability to extract structured information from complex documents is a fundamental building block of enterprise AI that many underestimate. The most powerful language models in the world are useless if the data fed to them is poorly extracted.

Chandra OCR, with its full-page decoding approach and state-of-the-art performance on tables, formulas, and complex layouts, fills a critical gap in the ecosystem. For agencies building AI solutions for their clients, the quality of the OCR layer often determines the success or failure of the entire project.

The race for structured documents is just beginning, and Chandra has set the new standard. Agencies integrating this building block into their architectures today are building on stronger foundations than those that simply send images to an LLM and hope for the best.

A strategic point for agency decision-makers: OCR quality determines the performance ceiling of every downstream AI system. You can invest in the best language model on the market for your chatbot, research assistant, or analysis agent. If the documents it consults were poorly extracted, the answers will be flawed. Chandra OCR does not sell itself as a spectacular product. It is an infrastructure building block, invisible but decisive, that makes the difference between a RAG system that correctly answers questions about financial tables and one that hallucinates numbers because columns were mixed up during extraction.

The Datalab team's trajectory is also worth watching. Having progressed from Marker (a PDF-to-markdown tool) through Surya (a multi-language OCR) to Chandra (a full-page vision-language model), they have systematically climbed the complexity ladder of document understanding. With $3.5 million in seed funding from Pebblebed and a clear technical roadmap, Datalab is positioning itself as the definitive document intelligence layer for the AI ecosystem. For agencies building long-term partnerships with infrastructure providers, this is the kind of focused, technically excellent company that makes a reliable foundation.

For agencies working with clients in finance, legal, or healthcare, the recommendation is clear: evaluate Chandra OCR on a representative sample of your client's documents, compare the structured outputs with those from your current OCR, and measure the impact on your RAG system's answer quality. The results will speak for themselves.

The practical integration path is straightforward. Start with the pip install and CLI tool on a sample of your most challenging documents: the ones with complex tables, merged cells, multi-column layouts, or handwritten annotations. Compare the structured output (Markdown with table syntax, LaTeX for equations, JSON with bounding boxes) against what your current pipeline produces. If the quality difference is significant on your documents, proceed to integrating the Python API via InferenceManager into your ingestion pipeline. For production deployments handling thousands of pages daily, set up the vLLM server for throughput optimization.

One final consideration: the community around Chandra is active and growing. The Discord server provides direct access to the development team, and the open-source nature of the code means that bugs and edge cases get resolved quickly through community contributions. For agencies that need to customize the model for specific document types (medical forms, legal filings, engineering drawings), the open architecture makes this feasible in a way that proprietary OCR services simply do not allow.

Want to automate?

Free 30-min audit. We identify your 3 AI quick wins.

Book a free audit →

Chandra OCR Reignites the Structured Document Race: Why Layout Understanding Matters Again

The OCR That LLMs Will Not Replace

The Technical Architecture: Full-Page Decoding Changes Everything

Why Layout Understanding Is a Hot Topic Again

Installation and Practical Usage

Practical Applications for Agencies

Limitations and Alternatives to Consider

Structured Documents as the Foundation of Enterprise AI

Want to automate?

Also read

AI Web Development: How We Deliver Sites at €5,000 Instead of €20,000

GLM-5.1 and 8-Hour Autonomous Agents: What Long-Horizon AI Means for Your Stack

NVIDIA Agent Toolkit Breakdown: OpenShell, Nemotron and AI-Q for Enterprise Teams