Very small voice models (SLMS) can exceed leading major language models (LLMS) in argumentation New study From Shanghai Ai Laboratory. The authors show that an SLM with 1 billion parameters with the best tools and the test time scaling techniques can outperform a 405B LLM for classy mathematical benchmarks.
The possibility of providing SLMs in complex argumentation tasks might be very useful because firms are in search of latest opportunities to make use of these latest models in various environments and applications.
Explained test time calming
Test-time scaling (TTS) is the method during which LLMS provide additional computing cylences throughout the inference to enhance their performance on various tasks. Leading argumentation models similar to Openaai O1 and Deepseek-R1 use “internal TTS”, which suggests that they’re trained to think slowly by creating a protracted series of tokens (cot) of the chain (cot).
An alternative approach is “external TTS”, during which the model output (because the name suggests) is improved external help. External TTS are suitable for circulation of models for the argumentation of tasks without further fire. An external TTS setup normally consists of a “guideline model”, which is the important -llm that generates the reply, and a process reward model (PRM) that evaluates the answers to the rule model. These two components are coupled together by a sample or search method.
The simplest setup is “best-of-n”, whereby the rule model generates several answers and the PRM selects a number of best answers to write down the ultimate answer. Advanced external TTS methods use the search. In “Beam search” the model divides the reply into several steps.
For each step, several answers are scanned and executed by the PRM. Then it selects a number of suitable candidates and generates the subsequent step of the reply. And in “Various Purier Tree Search” (DVTS), the model generates several answers to create a more diverse series of candidate answers before they’re synthesized for a final answer.
What is the best scaling strategy?
The choice of the best TTS strategy is determined by several aspects. The authors of the study carried out a scientific examination of how different political models and PRMS influence the efficiency of TTS methods.
Their results show that efficiency largely is determined by the rules and PRM models. For example, exceed search base for small guideline models best-of-n. For large political models, nonetheless, best-of-N is more practical since the models have higher argumentation functions and don’t need a reward model to ascertain every step of their argument.
Their results also show that the best TTS strategy is determined by the problem of the issue. For example, for small guideline models with fewer than 7b parameters, best-of-nbing-N works higher for easy problems, while trying to find jet works higher for difficult problems. For guideline models with parameters of 7b and 32b, a various tree search is sweet for easy and medium problems and the seek for jet is best fitted to hard problems. For large political models (72b parameters and more) best-of-N is the optimal method for all levels of difficulty.
Why small models can beat large models

Based on these findings, developers can create computers-optimal TTS strategies that consider the rule model, the PRM and the issue difficulties with the intention to best use the budget of the computing budget to unravel argumentation problems.
For example, the researchers found that a Lama-3,2-3b model with the calcen-optimal TTS strategy exceeds the Lama-3,1-405B via Math-500 and AIME24, two complicated math tabbar. This shows that an SLM can outperform a model that’s larger by 135 times when using the strategy of computing optimal TTS.
In other experiments, they found that a QWen2.5 model with 500 million parameters GPT-4O can outperform optimal TTS strategy. With the identical strategy, the 1.5B distilled version of Deepseek-R1 O1-Preview and O1-Mini exceeded Math-500 and AIME24.
When considering training and inference budgets, the outcomes show that SLMs with arithmetic-optimal scaling strategies can outperform larger models with fewer flops with 100-1000X.
The results of the researchers show that the calculation of optimal TTS significantly improves the argumentation skills of voice models. However, when the political model increases, the advance of the TTS progressively decreases.
“This indicates that the effectiveness of TTS is directly related to the argumentation of the political model,” the researchers write. “Especially for models with weak argumentation skills, the scaling of test-time computers results in a major improvement, while the profit is proscribed for models with strong argumentation skills.”
The study validates that SLMs can do higher than larger models when using computation-optimization scaling methods. While this study focuses on mathematical benchmarks, the researchers are planning to expand their study to other argumentation tasks similar to coding and chemistry.

