The recent release of OpenAI o1 has brought significant attention to Large Reasoning Models (LRMs), inspiring latest models aimed toward solving complex problems that classic language models often struggle with. Building on the success of o1 and the concept of LRMs, researchers at Alibaba introduced: Marco-o1that improves pondering skills and approaches problems with open-ended solutions that lack clear standards and quantifiable rewards.
OpenAI o1 uses “inference time scaling” to enhance the model’s reasoning ability by giving it “time to think.” Basically, during inference, the model uses more computational cycles to generate more tokens and confirm its answers, which improves its performance on tasks that require reasoning. o1 is understood for his impressive reasoning skills, especially relating to standard answer tasks like math, physics, and programming.
However, many applications are open problems for which there aren’t any clear solutions and quantifiable advantages. “Our goal was to push the boundaries of LLMs even further and improve their pondering skills to tackle complex, real-world challenges,” Alibaba researchers write.
Marco-o1 is a fine-tuned version of Alibaba's Qwen2-7B-Instruct, integrating advanced techniques resembling Chain-of-Thought (CoT) fine-tuning. Monte Carlo tree search (MCTS) and logical motion strategies.
The researchers trained Marco-o1 on a mixture of datasets including the Open-O1 CoT dataset; the Marco-o1 CoT dataset, an artificial dataset generated using MCTS; and the Marco-o1 Instruction Dataset, a set of custom instruction-following data for reasoning tasks.
MCTS is a search algorithm that has proven effective in complex problem-solving scenarios. It intelligently explores different solution paths by repeatedly choosing options, simulating results, and steadily constructing a call tree. It has proven to be very effective on complex AI problems, resembling beating the sport of Go.
Marco-o1 uses MCTS to explore multiple reasoning paths when generating response tokens. The model uses the arrogance values of the candidates' answer tokens to construct its decision tree and explore different branches. This allows the model to think about a wider range of possibilities and reach more informed and nuanced conclusions, especially in open-solution scenarios. The researchers also introduced a versatile reasoning-action strategy that enables them to customize the granularity of the MCTS steps by defining the variety of tokens generated at each node within the tree. This offers a trade-off between accuracy and computational cost, giving users the pliability to balance performance and efficiency.
Another vital innovation in Marco-o1 is the introduction of a mirrored image mechanism. During the argumentation process, the model recurrently asks itself to say “Wait!” Maybe I made a number of mistakes! I actually have to think from scratch.” This causes the model to reevaluate its reasoning steps, discover potential errors, and refine its thought process.
“This approach allows the model to act as its own critic and discover potential errors in its reasoning,” the researchers write. “By explicitly asking the model to query its original conclusions, we encourage it to re-express and refine its thought process.”
To evaluate Marco-o1's performance, researchers conducted experiments on several tasks, including the MGSM benchmark, a dataset for multilingual elementary school math tasks. Marco-o1 significantly outperformed the bottom model Qwen2-7B, especially when the MCTS component was adjusted to the granularity of a single token.
However, the fundamental goal of Marco-o1 was to handle the challenges of pondering in open-ended scenarios. To this end, the researchers tested the model for translating colloquial and colloquial expressions, a task that requires understanding subtle nuances of language, culture and context. The experiments showed that Marco-o1 could capture and translate these expressions more effectively than traditional translation tools. For example, the model appropriately translated a colloquial expression in Chinese that literally means “This shoe provides a step-on-poo feeling” into the English equivalent: “This shoe has a snug sole.” The model's chain of reasoning shows the way it evaluates different possible meanings and arrives at the right translation.
This paradigm can prove useful for tasks resembling product design and strategy, which require deep and contextual understanding and don’t have clearly defined benchmarks and metrics.
A brand new wave of argumentation models
Since the publication of o1, AI laboratories have been racing to publish reasoning models. Last week, Chinese AI lab DeepSeek released R1 Lite Preview, its o1 competitor, which is currently only available through the corporate's online chat interface. R1 Lite Preview reportedly beats O1 in several key benchmarks.
The open source community can also be catching up with the private model market and publishing models and datasets that make the most of inference time scaling laws. The Alibaba team released Marco-o1 on Hugging Face along with a Partial justification data set with which researchers can train their very own argumentation models. Another recently released model is LLaVA-o1, developed by researchers from several universities in China, which transfers the inference-time reasoning paradigm to open-source vision-language models (VLMs).
The release of those models comes amid uncertainty in regards to the way forward for model scaling laws. Various reports suggest that the returns from training larger models are decreasing and should be reaching their limits. What is for certain, nonetheless, is that we are only starting to explore the chances of scaling inference time.