Researcher on University of Illinois Urbana-Champaign And Google Cloud AI Research have developed a framework that enables LLM (Large Language Model) agents to arrange their experiences right into a memory bank and thus higher master complex tasks over time.
The framework, called ReasoningBankdistills “generalizable reasoning strategies” from an agent’s successful and failed attempts to resolve problems. The agent then uses this memory during reasoning to avoid repeating previous mistakes and make higher decisions when faced with latest problems. The researchers show that together with Test time scaling techniquesWhen an agent makes multiple attempts to resolve an issue, ReasoningBank significantly improves the performance and efficiency of LLM agents.
Their results show that ReasoningBank consistently outperforms classic storage mechanisms in web browsing and software engineering benchmarks, providing a practical path to developing more adaptable and reliable AI agents for enterprise applications.
The challenge of LLM agent memory
Because LLM agents are deployed in applications that run for prolonged periods of time, they’re faced with a continuous stream of tasks. One of the principal limitations of current LLM agents is that they don’t learn from this gathered experience. By approaching each task in isolation, they inevitably repeat past mistakes, discard helpful insights from related problems, and fail to develop skills that may make them more capable over time.
The solution to this limitation is to present agents some form of memory. Previous efforts to present agents memory have focused on storing previous interactions for reuse by organizing information in various forms, from easy text to structured diagrams. However, these approaches often fall short. Many use raw interaction logs or only save successful task examples. This signifies that they can’t distill high-level, transferable thought patterns and, more importantly, that they can’t extract and use the helpful information from the agent's mistakes. As the researchers note of their paper, “existing storage designs are sometimes limited to passive record keeping moderately than providing actionable, generalizable guidance for future decisions.”
This is how ReasoningBank works
ReasoningBank is a storage framework that goals to beat these limitations. The central idea is to distill useful strategies and thought prompts from past experiences into structured memory elements that might be saved and reused.
According to Jun Yan, a research scientist at Google and co-author of the paper, this marks a fundamental shift in the best way agents work. “Traditional agents work statically – each task is processed in isolation,” Yan explained. “ReasoningBank changes this by converting each task experience (successful or failed) right into a structured, reusable reasoning memory. This means the agent doesn't start from scratch with each customer, but moderately retrieves and adapts proven strategies from similar cases prior to now.”
The framework takes each successful and failed experiences and transforms them into a group of useful strategies and preventive lessons. The agent judges success and failure based on this LLM as a Judge Programs to avoid the necessity for human labeling.
Yan provides a practical example of this process in motion. An agent tasked with finding Sony headphones might fail because their comprehensive search query returns over 4,000 irrelevant products. “ReasoningBank will first try to know why this approach failed,” Yan said. “Then strategies like ‘optimize search queries’ and ‘restrict products through category filtering’ are distilled.” These strategies will probably be extremely useful to successfully complete similar tasks in the longer term.”
The process takes place in a closed cycle. When an agent faces a brand new task, it uses embedding-based search to retrieve relevant reminders from ReasoningBank and guide its actions. These reminders are inserted into the agent's system prompt and supply context for his or her decision making. Once the duty is accomplished, the framework creates latest memory elements to achieve insights from successes and failures. This latest knowledge is then analyzed, distilled and aggregated into the ReasoningBank, allowing the agent to repeatedly develop and improve their skills.
Load storage with scaling
The researchers found a robust synergy between memory and Test time scaling. Classic test time scaling is about generating multiple independent answers to the identical query. However, the researchers argue that this “vanilla form is suboptimal since it doesn’t benefit from the inherent contrastive signal that arises from redundant examination of the identical problem.”
To address this problem, they propose Memory-Aware Test-Time Scaling (MaTTS), which integrates scaling with ReasoningBank. MaTTS is available in two forms. With “parallel scaling,” the system generates multiple trajectories for a similar query after which compares and contrasts them to discover consistent reasoning patterns. With sequential scaling, the agent iteratively refines its reasoning inside a single trial, with the interim notes and corrections also serving as helpful memory signals.
This creates a virtuous circle: the present memory in ReasoningBank leads the agent to more promising solutions, while the varied experiences generated by scaling allow the agent to create higher quality memories which might be stored in ReasoningBank.
“This positive feedback loop positions memory-driven experience scaling as a brand new scaling dimension for agents,” the researchers write.
ReasoningBank in motion
The researchers tested their framework Web (web browsing) and SWE Bench tested (Software engineering) benchmarks using models resembling Google's Gemini 2.5 Pro and Anthropic's Claude 3.7 Sonnet. They compared ReasoningBank to baselines, including memory-free agents and agents that use trajectory-based or workflow-based memory frameworks.
The results show that ReasoningBank consistently outperforms these baselines across all datasets and LLM backbones. On WebArena, it improved the general success rate by as much as 8.3 percentage points in comparison with a memory-free agent. It also generalizes higher to tougher, cross-domain tasks while reducing the variety of interaction steps required to finish tasks. When combined with MaTTS, each parallel and sequential scaling further increased performance and consistently outperformed standard test time scaling.
This gain in efficiency has a direct impact on operating costs. Yan points to a case where a memory-free agent took eight steps through trial and error to search out the suitable product filter on an internet site. “These trial-and-error costs might be avoided by leveraging relevant insights from ReasoningBank,” he noted. “In this case, we save almost double the operating costs,” which also improves the user experience by resolving issues more quickly.
For businesses, ReasoningBank can assist develop cost-effective agents that may learn from experience and adapt over time in complex workflows and areas resembling software development, customer support and data evaluation. The paper concludes: “Our results suggest a practical path to constructing adaptive and lifelong learning agents.”
Yan acknowledged that her findings point to a way forward for true compositional intelligence. For example, a coding agent could learn individual skills resembling API integration and database management through separate tasks. “Over time, these modular capabilities turn into constructing blocks that the agent can flexibly recombine to resolve more complex tasks,” he said, suggesting a future through which agents can autonomously assemble their knowledge to administer entire workflows with minimal human oversight.

