DeepSeek, the Chinese artificial intelligence research company that has repeatedly challenged assumptions about AI development costs, has released a recent model that fundamentally reimagines how large language models process information—and the implications extend far beyond its modest branding as an optical character recognition tool.
The company’s DeepSeek-OCR model, released Monday with full open-source code and weights, achieves what researchers describe as a paradigm inversion: compressing text through visual representation as much as 10 times more efficiently than traditional text tokens. The finding challenges a core assumption in AI development and will pave the way in which for language models with dramatically expanded context windows, potentially reaching tens of hundreds of thousands of tokens.
“We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping,” the research team wrote of their technical paper. “Experiments show that when the variety of text tokens is inside 10 times that of vision tokens (i.e., a compression ratio < 10Ă—), the model can achieve decoding (OCR) precision of 97%."
The implications have resonated across the AI research community. Andrej Karpathy, co-founder of OpenAI and former director of AI at Tesla, said in a post that the work raises fundamental questions on how AI systems should process information. “Maybe it makes more sense that each one inputs to LLMs should only ever be images,” Karpathy wrote. “Even when you occur to have pure text input, possibly you’d prefer to render it after which feed that in.”
How DeepSeek achieved 10x compression by treating text as images
While DeepSeek marketed the discharge as an OCR model — a technology for converting images of text into digital characters — the research paper reveals more ambitious goals. The model demonstrates that visual representations can function a superior compression medium for textual information, inverting the traditional hierarchy where text tokens were considered more efficient than vision tokens.
“Traditionally, vision LLM tokens almost gave the look of an afterthought or ‘bolt on’ to the LLM paradigm,” wrote Jeffrey Emanuel, an AI researcher, in an in depth evaluation of the paper. “And 10k words of English would take up far more room in a multimodal LLM when expressed as intelligible pixels than when expressed as tokens…But that gets inverted now from the ideas on this paper.”
The model’s architecture consists of two primary components: DeepEncoder, a novel 380-million-parameter vision encoder, and a 3-billion-parameter mixture-of-experts language decoder with 570 million activated parameters. DeepEncoder combines Meta’s Segment Anything Model (SAM) for local visual perception with OpenAI’s CLIP model for global visual understanding, connected through a 16x compression module.
To validate their compression claims, DeepSeek researchers tested the model on the Fox benchmark, a dataset of diverse document layouts. The results were striking: using just 100 vision tokens, the model achieved 97.3% accuracy on documents containing 700-800 text tokens — representing an efficient compression ratio of seven.5x. Even at compression ratios approaching 20x, accuracy remained around 60%.
The practical impact: Processing 200,000 pages per day on a single GPU
The efficiency gains translate on to production capabilities. According to the corporate, a single Nvidia A100-40G GPU can process greater than 200,000 pages per day using DeepSeek-OCR. Scaling to a cluster of 20 servers with eight GPUs each, throughput reaches 33 million pages day by day — sufficient to rapidly construct training datasets for other AI models.
On OmniDocBench, a comprehensive document parsing benchmark, DeepSeek-OCR outperformed GOT-OCR2.0 (which uses 256 tokens per page) while using only 100 vision tokens. More dramatically, it surpassed MinerU2.0 — which requires greater than 6,000 tokens per page on average — while using fewer than 800 vision tokens.
DeepSeek designed the model to support five distinct resolution modes, each optimized for various compression ratios and use cases. The “Tiny” mode operates at 512Ă—512 resolution with just 64 vision tokens, while “Gundam” mode combines multiple resolutions dynamically for complex documents. “Gundam mode consists of nĂ—640Ă—640 tiles (local views) and a 1024Ă—1024 global view,” the researchers wrote.
Why this breakthrough could unlock 10 million token context windows
The compression breakthrough has immediate implications for probably the most pressing challenges in AI development: expanding the context windows that determine how much information language models can actively consider. Current state-of-the-art models typically handle context windows measured in a whole bunch of 1000’s of tokens. DeepSeek’s approach suggests a path to windows ten times larger.
“The potential of getting a frontier LLM with a ten or 20 million token context window is pretty exciting,” Emanuel wrote. “You could principally cram all of an organization’s key internal documents right into a prompt preamble and cache this with OpenAI after which just add your specific query or prompt on top of that and never should cope with search tools and still have it’s fast and cost-effective.”
The researchers explicitly frame their work when it comes to context compression for language models. “Through DeepSeek-OCR, we reveal that vision-text compression can achieve significant token reduction (7-20Ă—) for various historical context stages, offering a promising direction for addressing long-context challenges in large language models,” they wrote.
The paper features a speculative but intriguing diagram illustrating how the approach could implement memory decay mechanisms much like human cognition. Older conversation rounds might be progressively downsampled to lower resolutions, consuming fewer tokens while maintaining key information — a type of computational forgetting that mirrors biological memory.
How visual processing could eliminate the ‘ugly’ tokenizer problem
Beyond compression, Karpathy highlighted how the approach challenges fundamental assumptions about how language models should process text. Traditional tokenizers—the systems that break text into units for processing—have long been criticized for his or her complexity and limitations.
“I already ranted about how much I dislike the tokenizer,” Karpathy wrote. “Tokenizers are ugly, separate, not end-to-end stage. It ‘imports’ all of the ugliness of Unicode, byte encodings, it inherits plenty of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look an identical to the attention look as two completely different tokens internally within the network.”
Visual processing of text could eliminate these issues while enabling recent capabilities. The approach naturally handles formatting information lost in pure text representations: daring text, colours, layout, embedded images. “Input can now be processed with bidirectional attention easily and as default, not autoregressive attention – lots more powerful,” Karpathy noted.
The implications resonate with human cognitive science. Emanuel drew a parallel to Hans Bethe, the renowned physicist who memorized vast amounts of reference data: “Having vast amounts of task-specific knowledge in your working memory is incredibly useful. This looks like a really clever and additive approach to potentially expanding that memory bank by 10x or more.”
The model’s training: 30 million PDF pages across 100 languages
The model’s capabilities rest on an intensive training regimen using diverse data sources. DeepSeek collected 30 million PDF pages covering roughly 100 languages, with Chinese and English accounting for 25 million pages. The training data spans nine document types — academic papers, financial reports, textbooks, newspapers, handwritten notes, and others.
Beyond document OCR, the training incorporated what the researchers call “OCR 2.0” data: 10 million synthetic charts, 5 million chemical formulas, and 1 million geometric figures. The model also received 20% general vision data for tasks like image captioning and object detection, plus 10% text-only data to take care of language capabilities.
The training process employed pipeline parallelism across 160 Nvidia A100-40G GPUs (20 nodes with 8 GPUs each), with the vision encoder divided between two pipeline stages and the language model split across two others. “For multimodal data, the training speed is 70B tokens/day,” the researchers reported.
Open source release accelerates research and raises competitive questions
True to DeepSeek’s pattern of open development, the corporate released the entire model weights, training code, and inference scripts on GitHub and Hugging Face. The GitHub repository gained over 4,000 stars inside 24 hours of release, in line with Dataconomy.
The breakthrough raises questions on whether other AI labs have developed similar techniques but kept them proprietary. Emanuel speculated that Google’s Gemini models, which feature large context windows and robust OCR performance, might employ comparable approaches. “For all we all know, Google could have already discovered something like this, which could explain why Gemini has such an enormous context size and is so good and fast at OCR tasks,” Emanuel wrote.
Google’s Gemini 2.5 Pro offers a 1-million-token context window, with plans to expand to 2 million, though the corporate has not publicly detailed the technical approaches enabling this capability. OpenAI’s GPT-5 supports 400,000 tokens, while Anthropic’s Claude 4.5 offers 200,000 tokens, with a 1-million-token window available in beta for eligible organizations.
The unanswered query: Can AI reason over compressed visual tokens?
While the compression results are impressive, researchers acknowledge necessary open questions. “It’s not clear how exactly this interacts with the opposite downstream cognitive functioning of an LLM,” Emanuel noted. “Can the model reason as intelligently over those compressed visual tokens as it may possibly using regular text tokens? Does it make the model less articulate by forcing it right into a more vision-oriented modality?”
The DeepSeek paper focuses totally on the compression-decompression capability, measured through OCR accuracy, relatively than downstream reasoning performance. This leaves open whether language models could reason effectively over large contexts represented primarily as compressed visual tokens.
The researchers acknowledge their work represents “an initial exploration into the boundaries of vision-text compression.” They note that “OCR alone is insufficient to totally validate true context optical compression” and plan future work including “digital-optical text interleaved pretraining, needle-in-a-haystack testing, and other evaluations.”
DeepSeek has established a pattern of achieving competitive results with dramatically lower computational resources than Western AI labs. The company’s earlier DeepSeek-V3 model reportedly cost just $5.6 million to coach—though this figure represents only the ultimate training run and excludes R&D and infrastructure costs—in comparison with a whole bunch of hundreds of thousands for comparable models from OpenAI and Anthropic.
Industry analysts have questioned the $5.6 million figure, with some estimates placing the corporate’s total infrastructure and operational costs closer to $1.3 billion, though still lower than American competitors’ spending.
The greater picture: Should language models process text as images?
DeepSeek-OCR poses a fundamental query for AI development: should language models process text as text, or as images of text? The research demonstrates that, at the very least for compression purposes, visual representation offers significant benefits. Whether this translates to effective reasoning over vast contexts stays to be determined.
“From one other perspective, optical contexts compression still offers substantial room for research and improvement, representing a promising recent direction,” the researchers concluded of their paper.
For the AI industry, the work adds one other dimension to the race for longer context windows — a contest that has intensified as language models are applied to increasingly complex tasks requiring vast amounts of knowledge. The open-source release ensures the technique will probably be widely explored, tested, and potentially integrated into future AI systems.
As Karpathy framed the deeper implication: “OCR is just considered one of many useful vision -> text tasks. And text -> text tasks could be made to be vision ->text tasks. Not vice versa.” In other words, the trail forward for AI won’t run through higher tokenizers — it would bypass text tokens altogether.

