Argumentation by the chain of thought (Cot)-The process through which models divide problems into manageable “thoughts” before pulling answers to an integral a part of the newest generation of borderline length models (LLMS).
However, the inference costs of argumentation models can quickly stack because models create excess cot tokens. In A New paperResearchers at Carnegie Mellon University propose an LLM training technology that offers developers more control over the length of the cot.
The technology is known as the length of controlled guideline optimization (LCPO), and the model that gives the model correct answers and at the identical time retains its “thoughts” inside a given token budget. Experiments show that models which might be trained on LCPO offer a smooth compromise between accuracy and costs and surprisingly exceed larger models with the identical argumentation lengths. LCPO can assist to dramatically reduce the prices for the inference in corporate applications by saving 1000’s of tokens with an LLM in every discussion round.
The LLM performance results in longer red
Argumentation models resembling Openaai O1 and Deepseek-R1 are trained by learning reinforcement (RL) in an effort to use the scaling of the test time and create children's traces before creating a solution. Empirical evidence show that models, in the event that they “think” longer, are inclined to cut up tasks higher.
For example, R1 was initially marked on pure RL without examples. One of the findings was that the performance of the model also learned to create longer children's traces.
While generally long cot chains generally result in more precise answers, in addition they create a calculation size when using argumentation models on a scale. At the moment there may be very little control over the budget of the test time, and sequences can easily extend to tens of 1000’s of tokens without making significant profits. Some efforts were made to manage the duration of the chains of arguments, but they typically affect the ability of the model.
Length controlled guideline optimization (LCPO) explained
The classic RL method only trains LLMS to attain the proper answer. LCPO changes this paradigm by introducing two training goals: 1) Get the proper result and a pair of) limited the cot chain inside a certain token length. If the model creates the proper answer but creates too many cot tokens, it gets a punishment and is forced to develop a sequence of argument that achieves the identical answer, but with a smaller token budget.
“LCPO-trained models learn to fulfill length restrictions and at the identical time optimize the argumentation performance as a substitute of counting on handmade Heuristics,” the researchers write.
They propose two flavors from LCPO: (1) LCPO-Exact, which suggests that the considering generated corresponds exactly to the goal length, and (2) LCPO-Max, which suggests that the output doesn’t need to be longer than the goal length.
In order to check the technology, the researchers have set a 1.5b parameter argumentation model (QWen-distilled R1-1.5b) on the 2 proposed LCPO schemes to create the L1-Max and L1 stables models. The training was based on mathematical problems with different and verifiable results. However, the assessment included each mathematical problems and tasks outside of distribution resembling the huge multitasking understanding (understanding of the multitasking language (MMLU) Technology and the Google Proof-Q & A benchmark at graduate level (Gpqu).
Their results show that L1 models can reconcile the budget and argumentation performance of the token and the performance of the token, which interpolently interpolated between a brief, efficient argument and an extended, more precise argument by arranging the model with different length restrictions. In some tasks, the L1 models can reproduce the performance of the unique argumentation model in a lower token budget.
Compared to S1 -the only other method that limits the length of COT models, L1 models shows as much as 150% performance increases for various token budgets.
“This essential difference will be attributed to 2 key aspects,” the researchers write. “(1) L1 adapts his cot intelligently to the defined length restrictions without disturbing the argumentation process, while S1 often performs with the meantime. And (2) L1 is expressly trained to create high -quality argumentation chains of various lengths, and distilling effective argumentation patterns from longer chains to losing. “
L1 also exceeds its non-occupied counterpart by 5% and GPT-4O by 2% after length of the identical production. “With regard to our knowledge, that is the primary demonstration that a 1.5B model can outperform the border models resembling GPT-4O despite the identical generation length,” the researchers write.
Interestingly, the model of the model shows that it learns to adapt its argumentation process based on its token budget. For example, for longer budgets, the model is more likely token, that are related to self -correction and review (ie “but” and “wait”) and conclusion (“so” and “so”).

Apart from an improved length control within the mathematical setting of argumentation, the L1 models surprisingly generalize to tasks outside of the distribution, including GPQA and MMLU.
This latest research line for models that may adapt your argumentation budget can have necessary uses for real applications in an effort to give corporations the chance to scale argumentation models without outstanding expenses. It is a robust alternative to easily use larger, costlier models-and may very well be an important factor for the economic renae for highly volume, real applications.
The researchers have the open procurement of the Code of LCPO and the Weights for the L1 models.