HomeArtificial IntelligenceMeta proposes latest scalable storage layers that improve knowledge and reduce hallucinations

Meta proposes latest scalable storage layers that improve knowledge and reduce hallucinations

As firms proceed to deploy large language models (LLMs) in various applications, considered one of their biggest challenges is improving factual knowledge about models and reducing hallucinations. In a brand new article, researchers report Meta AI suggest “scalable storage layers“, which could possibly be considered one of several possible solutions to this problem.

Scalable storage layers add more parameters to LLMs to extend their learning capability without requiring additional computing resources. The architecture is helpful for applications where you would like to save extra memory for factual knowledge but in addition wish to benefit from the inference speed of more nimble models.

Density and layers of memory

Traditional language models use “dense layers” to encode large amounts of data of their parameters. In dense layers, all parameters are used to their full extent and are frequently activated concurrently during inference. Dense layers can learn complex functions, and increasing them requires additional computing and energy resources.

In contrast, for easy factual knowledge, much simpler layers with associative memory architectures could be more efficient and interpretable. That's what storage layers do. They use easy, sparse activations and key-value search mechanisms to encode and retrieve knowledge. Sparse layers use more memory than dense layers, but only use a small portion of the parameters at a time, making them far more computationally efficient.

Memory layers have been around for several years, but are rarely utilized in modern deep learning architectures. They will not be optimized for current hardware accelerators.

Current frontier LLMs typically use some type of “mixture of experts” (MoE) architecture, which uses a mechanism vaguely much like storage layers. MoE models consist of many smaller expert components which might be specialized for specific tasks. At inference time, a routing mechanism determines which expert to activate based on the input sequence. PEER, an architecture recently developed by Google DeepMind, extends MoE to thousands and thousands of experts and provides more granular control over the parameters activated during inference.

Update storage layers

Storage layers are computationally intensive but memory intensive, which presents particular challenges for current hardware and software frameworks. In their work, the meta-researchers propose several modifications that solve these challenges and permit them to be deployed on a big scale.

First, the researchers configured the storage layers for parallelization and distributed them across multiple GPUs to store thousands and thousands of key-value pairs without changing other layers within the model. They also implemented a dedicated CUDA kernel to handle high memory bandwidth operations. And they developed a parameter sharing mechanism that supports a single set of storage parameters across multiple storage layers inside a model. This signifies that the keys and values ​​used for searches are shared across multiple levels.

These modifications allow the implementation of storage layers inside LLMs without slowing down the model.

“Storage layers with their sparse activations complement dense networks well and supply increased capability for knowledge acquisition while maintaining low computational requirements,” the researchers write. “They scale efficiently and offer practitioners a lovely latest approach to trade off storage and computing power.”

To test storage layers, researchers modified Llama models by replacing a number of dense layers with a typical storage layer. They compared the memory-enhanced models with the dense LLMs in addition to MoE and PEER models on various tasks, including factual query answering, scientific and customary sense world knowledge, and coding.

Memory model vs. dense layers

Their results show that memory models significantly improve over dense baselines and compete with models that eat two to 4 times more computing power. They also match the performance of MoE models which have the identical computational budget and variety of parameters. Particularly noteworthy is the model's performance on tasks that require factual knowledge. For example, when answering questions objectively, a memory model with 1.3 billion parameters approaches the performance of Llama-2-7B, which was trained with twice as many tokens and ten times more computing power.

Additionally, the researchers found that the advantages of memory models remain consistent with model size as they scaled their experiments from 134 million to eight billion parameters.

“Given these findings, we strongly advocate that storage layers needs to be integrated into all next-generation AI architectures,” the researchers write, adding that there continues to be loads of room for improvement. “In particular, we hope that latest learning methods could be developed to further increase the effectiveness of those layers, enabling less forgetting, fewer hallucinations and continuous learning.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read