Chinese researchers introduce LLaVA-o1 to challenge OpenAI's o1 model

November 23, 2024

343

OpenAIThe o1 model has shown that inference time scaling – the usage of more computing power during inference – can significantly improve a language model's reasoning capabilities. LLaVA-o1a brand new model developed by researchers at several universities in China applies this paradigm to open-source vision-language models (VLMs).

Early open source VLMs typically use a direct prediction approach and generate answers without fascinated about the prompt and the steps required to resolve the prompt. Without a structured thought process, they’re less effective at tasks that require logical pondering. Advanced prompting techniques equivalent to Chain-of-Think (CoT) prompts, where the model is inspired to generate intermediate steps for reasoning, end in modest improvements. But VLMs often produce errors or hallucinations.

The researchers found that a key problem is that the argumentation process in existing VLMs is just not sufficiently systematic and structured. The models don’t produce chains of reasoning and infrequently get stuck in reasoning processes where they have no idea what stage they’re at and what specific problem they need to resolve.

“We observe that VLMs often initiate responses without adequately organizing the issue and the available information,” the researchers write. “Furthermore, they often deviate from logical reasoning toward conclusions, quite than rapidly stating a conclusion after which attempting to justify it.” Given that language models generate answers token by token, the model continues as soon as a faulty conclusion is introduced will typically proceed along a flawed path of reasoning.”

Multi-level pondering

OpenAI o1 uses inference time scaling to resolve the issue of systematic and structured pondering, allowing the model to stop and inspect its results because it steadily solves the issue. Although OpenAI has not published many details concerning the underlying mechanism of o1, its results show promising ways to enhance the reasoning capabilities of basic models.

Inspired by o1, researchers developed LLaVA-o1 to perform stepwise pondering. Instead of generating a direct chain of arguments, LLaVA-o1 divides the argumentation process into 4 different phases:

Summary: The model first provides a general summary of the query and descriptions the core problem that should be solved.

Subtitle: When a picture is present, the model describes the relevant parts, specializing in elements related to the query.

Argumentation: Building on the summary, the model performs structured, logical reasoning to derive a tentative answer.

Diploma: Finally, the model presents a concise summary of the reply based on the above argument.

Only the completion phase is visible to the user. The other three phases represent the interior reasoning technique of the model, much like the hidden reasoning trace of o1. This structured approach allows LLaVA-o1 to independently manage its reasoning process, leading to improved performance on complex tasks.

“This structured approach allows the model to administer its pondering process independently, improving its adaptability and performance on complex reasoning tasks,” the researchers write.

LLaVA-o1 also introduces a novel inference time scaling technique called “Stage-Level Beam Search”. Stage-level beam search generates multiple candidate outputs at each reasoning stage. It then selects the most effective candidate at each stage to proceed the generation process. This is in contrast to the classic best-of-N approach, which asks the model to generate multiple complete answers before choosing one.

“Notably, it’s the structured output design of LLaVA-o1 that makes this approach possible, enabling efficient and accurate verification at every stage,” the researchers write. “This confirms the effectiveness of structured output in improving inference time scaling.”

Training LLaVA-o1

To train LLaVA-o1, the researchers assembled a brand new dataset with around 100,000 image-question-answer pairs mined from several widely used VQA datasets. The dataset covers quite a lot of tasks, from answering multi-round inquiries to interpreting diagrams and geometric reasoning.

The researchers used GPT-4o to generate the detailed four-stage reasoning processes for every example, including the summary, caption, reasoning, and conclusion stages.

The researchers then optimized Llama-3.2-11B-Vision-Instruct based on this dataset to acquire the ultimate LLaVA-o1 model. The researchers haven’t yet published the model, but plan to release the dataset called LLaVA-o1-100k.

LLaVA-o1 in motion

Researchers evaluated LLaVA-o1 using several benchmarks for multimodal reasoning. Although LLaVA-o1 was only trained on 100,000 examples, it showed significant performance improvements over the bottom model Llama, with a mean benchmark rating increase of 6.9%.

Additionally, stage-level beam search resulted in additional performance gains and demonstrated the effectiveness of inference time scaling. Due to computing resource limitations, the researchers were only capable of test the technique with a beam size of two. They expect even greater improvements with larger beam sizes.

Impressively, LLaVA-o1 outperformed not only other open source models of the identical or larger size, but in addition some closed source models equivalent to GPT-4-o-mini and Gemini 1.5 Pro.

“LLaVA-o1 establishes a brand new standard for multimodal reasoning in VLMs, providing robust performance and scalability, especially in inference time,” the researchers write. “Our work paves the best way for future research on structured pondering in VLMs, including possible extensions with external verifiers and the usage of reinforcement learning to further improve complex multimodal reasoning capabilities.”

Chinese researchers introduce LLaVA-o1 to challenge OpenAI's o1 model

Multi-level pondering

Training LLaVA-o1

LLaVA-o1 in motion

LEAVE A REPLY Cancel reply

Must Read

New Zealand's low productivity is commonly attributed to the undeniable fact that corporations remain small. That might be a strength in 2026

I used AI chatbots as a news source for a month they usually were unreliable and buggy

As a part of the “physical AI” takeover of CES 2026

Humanoid robots or human connection? What Elon Musk's Optimus reveals about our AI ambitions

3 questions: How AI could optimize the ability grid

Decoding the Arctic to predict winter weather

Gmail introduces personalized AI inbox, AI digests in search, and more

Latest articles

New Zealand's low productivity is commonly attributed to the undeniable fact that corporations remain small. That might be a strength in 2026

I used AI chatbots as a news source for a month they usually were unreliable and buggy

As a part of the “physical AI” takeover of CES 2026

Our Newsletter

Chinese researchers introduce LLaVA-o1 to challenge OpenAI's o1 model

Multi-level pondering

Training LLaVA-o1

LLaVA-o1 in motion

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter