Google researchers have developed a way called Infini-attention that enables LLMs to process infinitely long texts without increasing computational and storage requirements.
The Transformer architecture of an LLM allows it to contemplate all tokens in a prompt. The complex dot product and matrix multiplications it performs are quadratic complex.
This implies that doubling the tokens in your prompt will end in 4 times more memory and processing power being required. This is why it’s so difficult to construct LLMs with large context windows without skyrocketing memory and compute requirements.
In a “standard” LLM, information firstly of the prompt content is lost as soon because the prompt becomes larger than the context window. Googles research paper explains how Infini-attention can store data beyond the context window.
Google presents Leave No Context Behind: Efficient Infinite Context Transformers with Infini Attention
The 1B model, optimized for passkey instances with a sequence length of as much as 5K, solves the 1M length problemhttps://t.co/zyHMt3inhi pic.twitter.com/ySYEMET9Ef
— Aran Komatsuzaki (@arankomatsuzaki) April 11, 2024
How does Infini Attention work?
Infini-Attention combines compressive memory techniques with modified attention mechanisms in order that relevant older information is just not lost.
Once the prompt extends beyond the context length of the model, compressive memory stores information in a compressed format as an alternative of discarding it.
This allows older, less immediately relevant information to be stored without the storage and computing requirements increasing indefinitely because the input increases.
Instead of attempting to retain all older input information, Infini-attention's compressed memory weights and summarizes information that’s deemed relevant and value keeping.
Infini-attention then uses a “vanilla” attention mechanism, but reuses the key-value states (KV) from each subsequent segment within the model relatively than discarding them.
Here is a diagram showing the difference between Infini-Attention and one other prolonged context model Transformer XL.
The result’s an LLM that takes current input data under consideration locally, but additionally accommodates constantly distilled compressed historical data upon which it could possibly focus long-term attention.
The paper states: “This subtle but crucial modification of the eye layer enables LLMs to process infinitely long contexts with limited memory and computational resources.”
How good is it?
Google conducted benchmarking tests with smaller Infini attention models with 1B and 8B parameters. These were in comparison with other prolonged context models similar to Transformer-XL and Memorizing Transformers.
The Infini Transformer achieved significantly lower perplexity scores than the opposite models when processing long-context content. A lower perplexity value means the model is more confident in its output predictions.
In the “passkey retrieval” tests, the Infini Attention models consistently found the random variety of as much as 1 million tokens hidden within the text.
Other models often manage to get the passkey near the tip of the input, but have difficulty finding it in the center or firstly of an extended piece of content. Infini-attention had no problems with this test.
The benchmarking tests are very technical, however the short story is that Infini-Attention outperformed the baseline models at summarizing and handling long sequences while maintaining context over longer periods of time.
Significantly, this superior retention capability was maintained while requiring 114 times less storage.
The benchmark results persuade researchers that Infini-Attention may very well be scaled to handle extremely long input sequences while remaining limited in memory and computing resources.
The plug-and-play nature of Infini-attention means it could possibly be used for continuous pre-training and fine-tuning of existing Transformer models. This could effectively expand their context windows without requiring a whole retraining of the model.
Context windows will proceed to grow, but this approach shows that efficient storage may very well be a greater solution than a big library.