Baidu Qianfan Team Releases Qianfan-OCR: 4B Integrated Document Intelligence Model

0 0 2 minutes read

Baidu Qianfan Team Releases Qianfan-OCR: 4B Integrated Document Intelligence Model

Baidu Qianfan team is introduced Qianfan-OCRan end-to-end 4B parameter model designed to integrate document parsing, structure analysis, and document understanding within the framework of a single visual language. Unlike typical multi-stage OCR pipelines that include separate modules for structure detection and text recognition, Qianfan-OCR performs direct image-to-Markdown conversion and supports quick-run operations such as table extraction and document querying.

Architectural and Technical Specifications

Qianfan-OCR uses a multimodal bridge structure from the Qianfan-VL framework. The program consists of three main parts:

Vision Encoder (Qianfan-ViT): He hires the Any decision design that overlays images into 448 x 448 patches. Supports up to 4K resolution input, generates up to 4,096 visual tokens per image to maintain spatial resolution of small fonts and dense text.
Cross-Modal Adapter: A lightweight two-layer MLP using GELU that transforms physical features into the embedding space of a language model.
Core Language Model (Qwen3-4B): 4.0B parametric model with 36 layers and native 32K content window. It uses Grouped-Query Attention (GQA) to reduce KV cache memory usage by 4x.

‘Structure-as-thinking’ Method

The main feature of the model is Structure-as-thoughtthe optional thinking phase initiated by tokens. During this phase, the model generates representations of the structured structure—including bounding boxes, object types, and learning order—before producing the final output.

Active ingredient: This technique restores the ability to analyze a specific structure (feature localization and type classification) that is often lost in end-to-end paradigms.
Functional Features: The test is open OmniDocBench v1.5 shows that enabling the logic class provides a consistent advantage for documents with “structural label entropy”—those that contain a variety of objects such as mixed text, formulas, and diagrams.
Efficiency: Link boxes are represented as special dedicated tokens ( to ), reduces the length of the thought result by about 50% compared to the empty digit sequence.

Key Performance and Ratings

Qianfan-OCR was evaluated by comparing both specialized OCR systems and conventional visual language models (VLMs).

Document Parsing and Standard OCR

The model ranks first among the last models in several important benchmarks:

OmniDocBench v1.5: You have earned the points of 93.12outperforms DeepSeek-OCR-v2 (91.09) and Gemini-3 Pro (90.33).
OlmOCR bench: The goal 79.8leading the end-to-end phase.
OCRBench: You have earned the points of 880ranked first among all tested models.

In the public KIE benchmarks, Qianfan-OCR received a medium to high score (87.9), outperforming the larger models.^{^{^{^.}}}

Model	Overall Mean (KIE)	OCRBench KIE	Nanonets KIE (F1)
Qianfan-OCR (4B)	87.9	95.0	86.5
Qwen3-4B-VL	83.5	89.0	83.3
Qwen3-VL-235B-A22B	84.2	94.0	83.8
Gemini-3.1-Pro	79.2	96.0	76.1

Understanding the Document

Comparative tests revealed that two-stage OCR+LLM pipelines often fail in tasks that require spatial reasoning.^{. For example, all two-tier systems are scored 0.0 to CharXiv benchmarks, as the text extraction phase discards the visual context (axis relationships, data point locations) needed to define the chart^.}

Distribution and definition

The effectiveness of the ideas was measured Pages Per Second (PPS) using a single NVIDIA A100 GPU^{^{^{^.}}}

Estimating the value: With W8A8 (AWQ) calibrationQianfan-OCR has been achieved 1.024 PPS2x speed over the base W16A16 with negligible accuracy loss.
Property Advantage: Unlike pipeline systems that rely on CPU-based structural analysis—which may be a bottleneck—Qianfan-OCR GPU-centric. This avoids inter-stage processing delays and allows efficient batch definition.

Check it out Paper, Repo again Model in HF. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.