Beyond RAG: How cache-enhanced generation reduces latency and complexity for smaller workloads

January 18, 2025

233

Retrieval-Augmented Generation (RAG) is the de facto method for adapting large language models (LLMs) for tailored information. However, RAG has upfront technical costs and will be slow. Thanks to advances in long-context LLMs, corporations can now bypass RAG by including all proprietary information within the prompt.

A latest study from National Chengchi University in Taiwan shows that through the use of long-context LLMs and caching techniques, you possibly can construct custom applications that outperform RAG pipelines. This approach, called Cache-Augmented Generation (CAG), generally is a easy and efficient alternative for RAG in enterprise environments where the knowledge corpus suits inside the context window of the model.

Limitations of RAG

RAG is an efficient method for coping with open questions and specific tasks. It uses retrieval algorithms to gather documents relevant to the query and adds context to enable the LLM to compose more accurate responses.

However, RAG introduces several limitations for LLM applications. The additional retrieval step introduces latency that may impact the user experience. The result also relies on the standard of the document selection and rating step. In many cases, the constraints of the models used for retrieval require documents to be broken down into smaller pieces, which might impact the retrieval process.

And basically, RAG increases the complexity of the LLM application and requires the event, integration and maintenance of additional components. The additional overhead slows down the event process.

Cache-enhanced retrieval

The alternative to developing a RAG pipeline is to place your entire document corpus into the prompt and let the model select which bits are relevant to the request. This approach eliminates the complexity of the RAG pipeline and the issues attributable to fetch errors.

However, there are three key challenges to preloading all documents into the command prompt. First, long prompts decelerate the model and increase inference costs. Second, the length of the LLM's context window limits the variety of documents that may fit within the prompt. Finally, adding irrelevant information to the prompt can confuse the model and reduce the standard of its responses. So simply cramming all your documents into the command prompt as a substitute of choosing probably the most relevant ones can ultimately degrade the performance of the model.

The proposed CAG approach leverages three key trends to deal with these challenges.

First, advanced caching techniques make processing prompt templates faster and cheaper. The premise of CAG is that the knowledge documents are included in every prompt sent to the model. Therefore, you possibly can calculate the eye values of your tokens prematurely as a substitute of doing so upon receiving requests. This upfront calculation reduces the time needed to process user requests.

Leading LLM providers akin to OpenAI, Anthropic, and Google offer prompt caching capabilities for the repetitive parts of your prompt, including the knowledge documents and directions that you simply include initially of your prompt. With Anthropic you possibly can reduce the fee of the cached portions of your command prompt by as much as 90% and latency by 85%. Equivalent caching capabilities have been developed for open source LLM hosting platforms.

Second, long-context LLMs make it easier to suit more documents and knowledge into prompts. Claude 3.5 Sonnet supports as much as 200,000 tokens, while GPT-4o supports 128,000 tokens and Gemini supports as much as 2 million tokens. This makes it possible to incorporate multiple documents or entire books within the prompt.

Finally, advanced training methods enable models to raised recall, reason, and answer questions over very long sequences. In the past yr, researchers have developed several LLM benchmarks for long-sequence tasks, including BABILong, LongICLBenchAnd RULER. These benchmarks test LLMs on difficult problems akin to multi-fetch and multi-hop query answering. There continues to be room for improvement on this area, but AI labs proceed to make progress.

As newer generations of models proceed to expand their contextual windows, they’ll have the opportunity to process larger collections of information. In addition, we are able to expect that the flexibility of models to extract and use relevant information from long contexts will proceed to enhance.

“These two trends will significantly expand the usability of our approach and enable it to handle more complex and diverse applications,” the researchers write. “Therefore, our methodology is well-positioned to turn out to be a sturdy and versatile solution for knowledge-intensive tasks and leverage the growing capabilities of next-generation LLMs.”

RAG vs CAG

To compare RAG and CAG, researchers conducted experiments using two widely accepted question-answer benchmarks: squadthat focuses on contextual questions and answers from individual documents, and HotPotQAwhich requires multi-hop reasoning across multiple documents.

They used a Llama 3.1-8B model with a context window with 128,000 tokens. For RAG, they combined the LLM with two retrieval systems to acquire passages relevant to the query: the fundamentals BM25 algorithm and OpenAI embeds. For CAG, they inserted several documents from the benchmark into the prompt and let the model determine which passages to make use of to reply the query. Their experiments show that CAG outperformed each RAG systems in most situations.

“By preloading your entire context from the test set, our system eliminates retrieval errors and ensures holistic reasoning of all relevant information,” the researchers write. “This advantage is especially evident in scenarios where RAG systems may retrieve incomplete or irrelevant passages, leading to suboptimal response generation.”

CAG also significantly reduces the time to generate the reply, especially because the length of the reference text increases.

However, CAG is just not a panacea and must be used with caution. It is well suited to environments where the knowledge base doesn’t change often and is sufficiently small to suit inside the model's context window. Companies also needs to listen to cases where their documents contain conflicting facts as a consequence of the context of the documents that might bias the model in reaching conclusions.

The best solution to determine if CAG is true in your use case is to do some experiments. Fortunately, implementing CAG may be very easy and may at all times be regarded as a primary step before investing in additional development-intensive RAG solutions.

Beyond RAG: How cache-enhanced generation reduces latency and complexity for smaller workloads

Limitations of RAG

Cache-enhanced retrieval

RAG vs CAG

LEAVE A REPLY Cancel reply

Must Read

Softbank Chief Pitches $ 1TN AI and robotic complex in Arizona

Mistral has just updated his open source small model from 3.1 to three.2: Here is why

What is behind the KI talent Goldrausch?

Hospital Cyber attacks cost 600,000 US dollars per hour. This is how AI changes mathematics

Researchers present courageous ideas for AI at Generative AI Impact Consortium Kickoff Event

ChatGPT: Everything you might want to know concerning the AI-powered chatbot

Mira Muratis Denkmaschinenlabor value $ 10 billion after fundraising in the quantity of USD 2 billion

Latest articles

Softbank Chief Pitches $ 1TN AI and robotic complex in Arizona

Mistral has just updated his open source small model from 3.1 to three.2: Here is why

What is behind the KI talent Goldrausch?

Our Newsletter

Beyond RAG: How cache-enhanced generation reduces latency and complexity for smaller workloads

Limitations of RAG

Cache-enhanced retrieval

RAG vs CAG

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter