One of the good things about generative AI models – each large language models (LLMs) and diffusion-based image generators – is that they’re “non-deterministic.” This implies that despite their fame amongst some critics as “fancy autocorrection,” generative AI models actually generate their results by choosing from a distribution of the most certainly next tokens (units of data) to finish their answer.
Asking an LLM: “What is the capital of France?” will sample its probability distribution for France, capitals, cities, etc. to reach at the reply “Paris”. But that answer might are available the shape: “The capital of France is Paris” or just “Paris” or “Paris, even though it was once Versailles.”
Still, those of us who use these models continuously each day will find that their answers can sometimes feel annoyingly repetitive or similar. A typical joke about coffee is reused across generations of searches. Story prompts create similar storylines. Even tasks that ought to yield many plausible answers—like naming U.S. states—are likely to boil all the way down to just a couple of. This phenomenon, referred to as mode collapse, occurs during post-training alignment and limits the utility of otherwise high-performing models.
Particularly when LLMs are used to generate latest creative work in writing, communications, strategy or illustration, we wish their outputs to be real much more varied than they already are.
Now a Research team from Northeastern University, Stanford University and West Virginia University have developed an ingeniously easy method for getting language and image models to generate a greater number of responses to almost any user prompt Add a single, easy sentence: “Generate 5 answers with corresponding probabilities chosen from the complete distribution.”
The method called Verbalized sampling (VS) helps models like GPT-4, Claude and Gemini produce more diverse and human-like results – without retraining or access to internal parameters. It is described in a Paper published online within the open access journal arxiv.org in early October 2025.
When prompted this manner, the model not defaults to the safest and commonest output. Instead, it verbalizes its internal distribution across potential completions and examples in a broader range of possibilities. This single-line change ends in significant gains in output diversity across multiple domains.
As Weiyan Shi, assistant professor at Northeastern University and co-author of the article, said: wrote on X: “The potential of LLMs is just not yet fully realized! As shown in our article, timely optimization could be guided and theoretically proven by excited about how LLMs are trained and aligned.”
Why models break down – and the way VS reverses it
According to the research team, the important reason behind fashion collapse lies not only in algorithms reminiscent of Reinforcement Learning from Human Feedback (RLHF), but within the structure of human preferences. People are likely to rate more familiar or typical answers as higher, leading LLMs to make “secure” moderately than diverse decisions when fine-tuning.
However, this bias doesn’t erase the model's underlying knowledge, it simply suppresses it. VS works by bypassing this suppression. Instead of asking for the only most certainly end result, the model is asked to uncover a variety of plausible answers and their relative probabilities. This distribution-level prompt restores access to the greater diversity present in the bottom model before training.
Practical performance in all tasks
The research team tested verbalized sampling in several common use cases:
-
Creative writing: For story generation, VS increased diversity scores by as much as 2.1x compared to straightforward prompts while maintaining quality. A story prompt—“Without Farewell”—produced formulaic breakup scenes under direct prompting, but delivered narratives about cosmic events, silent emails, and music that stopped mid-dance when prompted via VS.
-
Dialogue simulation: In persuasive dialogue tasks, VS enabled the models to simulate human-like patterns reminiscent of hesitation, resistance, and alter of mind. The donation behavior distributions under VS are more consistent with real human data in comparison with baseline methods.
-
Open quality assurance: When asked to enumerate valid answers (e.g. naming US states), models using VS generated answers that were more consistent with the range of real-world data. They covered a wider range of answers without compromising factual accuracy.
-
Synthetic data generation: When generating mathematical problems for model training, VS creates more diverse data sets. These, in turn, improved downstream performance on competitive math benchmarks, outperforming synthetic data generated via direct prompts.
Tunable diversity and higher use of larger models
A notable advantage of VS is Tunability. Users can set a probability threshold within the prompt to sample lower probability “tails” of the model distribution. Lower thresholds correspond to higher diversity. This tuning could be done from the prompt text alone, without the necessity to change decoding settings reminiscent of temperature or top P.
In a test using the Gemini 2.5 Flash model, diversity in story writing increased steadily because the probability threshold dropped from 1 to 0.001. The graph accompanying the study showed that VS outperformed each direct and sequence-based prompting in any respect thresholds.
Interestingly, the strategy scales well with model size. Larger models reminiscent of GPT-4.1 and Claude-4 showed even greater VS gains in comparison with smaller models. While smaller models benefited, the development in diversity was about 1.5 to 2 times stronger for larger models – suggesting that VS helps unlock more of the latent capabilities in advanced models.
Deployment and availability
The Verbalized Sampling method is now available as a Python package:
pip install verbalized-sampling
The package includes integration with LangChain and supports an easy interface for sampling from the verbalized distribution. Users also can customize parameters like k (variety of answers), thresholds and temperature appropriate to your applications.
A live Colab notebook and documentation could be found at an enterprise-friendly Apache 2.0 license on GitHub at: https://github.com/CHATS-lab/verbalized-sampling
Practical suggestions and customary problems
While the strategy works in all major LLMs, some users may initially experience rejections or errors.
In these cases, the authors suggest using the system prompt version of the template or referencing alternative formats listed on the GitHub page.
Some models Interpret complex instructions as jailbreak attempts and refuse to comply unless the structure is clearer.
For example, prompting via a system-level instruction like this improves reliability:
You are a helpful assistant. For each query, generate five answers in separate tags, each with a probability lower than 0.10.
This small change often fixes all problems.
An easy solution to an enormous problem
Verbalized sampling provides a practical, inference-time solution to a profound limitation within the behavior of recent language models. No model retraining or internal access is required. It is just not depending on a model family. And it not only improves the range of results, but additionally their quality – as measured by human evaluation and benchmark results.
Given the growing interest in tools to enhance model creativity, VS is more likely to see rapid adoption in areas reminiscent of writing, design, simulation, education, and artificial data generation.
For users and developers frustrated by the sameness of LLM answers, the answer could also be so simple as changing the query.

