Matrix multiplications (MatMul) are essentially the most computationally intensive operations in large language models (LLM) using the Transformer architecture. As LLMs grow larger, the associated fee of MatMul increases significantly, leading to more memory requirements and latency during training and inference.
Now researchers on the University of California, Santa Cruz, Soochow University And University of California, Davis have a modern architecture that completely eliminates matrix multiplications from language models while maintaining strong performance at large scales.
In their paper, the researchers present MatMul-free language models that achieve performance comparable to state-of-the-art transformers while requiring significantly less memory for inference.
MatMul
Matrix multiplication is a fundamental operation in deep learning, where it’s used to mix data and weights in neural networks. MatMul is crucial for tasks corresponding to transforming input data through layers of a neural network to make predictions during training and inference.
Thanks to their highly parallel architecture, GPUs are designed to perform many MatMul operations concurrently. This parallelism allows GPUs to perform the large-scale computations required for deep learning much faster than traditional CPUs, making them indispensable for efficiently training and running complex neural network models.
However, as LLMs scale to a whole bunch of billions of parameters, MatMul operations have grow to be a bottleneck, requiring very large GPU clusters during each the training and inference phases. Replacing MatMul with an easier operation may end up in huge savings in memory and compute power. However, previous attempts to interchange MatMul operations have had mixed results: they reduced memory consumption but slowed down the operations because they don’t run well on GPUs.
Replacing MatMul with ternary operations
In the brand new paper, the researchers propose replacing the standard 16-bit floating-point weights utilized in Transformers with 3-bit ternary weights that may take one among three states: -1, 0, and +1. They also replace MatMul with additive operations that produce equally good results at a much lower computational cost. The models consist of “BitLinear layers” that use ternary weights.
“By restricting the weights to the set {−1, 0, +1} and applying additional quantization techniques, MatMul operations are replaced by addition and negation operations,” the researchers write.
They also make more profound changes to the language model architecture. Transformer blocks consist of two predominant components: a token mixer and a channel mixer. The token mixer is chargeable for integrating details about different tokens in a sequence. In traditional transformer models, this is normally achieved by Self-attention mechanismsuse MatMul operations to compute relationships between all token pairs, capturing dependencies and context information.
However, within the MatMul-free architecture described within the document, the token mixer is implemented using a MatMul-free Linear Gated Recurrent Unit (MLGRU). GRU is a deep learning for sequence modeling that was popular before the arrival of Transformers. The MLGRU processes the sequence of tokens by updating hidden states through easy ternary operations without the necessity for expensive matrix multiplications.
The channel mixer is chargeable for integrating information from different feature channels throughout the representation of a single token. The researchers implemented their channel mixer using a Gated Linear Unit (GLU), which can also be utilized in Llama-2 and Mistral. However, they modified the GLU in order that it could also work with ternary weights as a substitute of MatMul operations. This allowed them to scale back computational complexity and memory consumption while maintaining the effectiveness of feature integration.
“By combining the MLGRU token mixer and the GLU channel mixer with ternary weights, our proposed architecture relies solely on addition and element-wise products,” the researchers write.
Evaluating MatMul-free language models
The researchers compared two variants of their MatMul-free LM with the advanced Transformer++ architecture utilized in Llama-2 at several model sizes.
Interestingly, their scaling projections show that the MatMul-free LM uses additional computational resources more efficiently to enhance performance in comparison with the Transformer++ architecture.
The researchers also evaluated the standard of the models on several language tasks. The 2.7B MatMul-free LM outperformed its Transformer++ counterpart on two advanced benchmarks, ARC-Challenge and OpenbookQA, while performance on the opposite tasks remained comparable.
“These results underscore that MatMul-free architectures are able to achieving strong zero-shot performance on a big selection of language tasks, from query answering and customary sense to physics understanding,” the researchers write.
As expected, MatMul-free LM has lower memory consumption and latency in comparison with Transformer++, and its memory and latency benefits grow to be more apparent because the model size increases. For the 13B model, MatMul-free LM used only 4.19 GB of GPU memory with a latency of 695.48 ms, while Transformer++ required 48.50 GB of memory with a latency of 3183.10 ms.
Optimized implementations
The researchers created an optimized GPU implementation and a custom FPGA configuration for MatMul free language models. Using the GPU implementation of the ternary dense layers, they were in a position to speed up training by 25.6% and reduce memory consumption by as much as 61.0% in comparison with a non-optimized baseline implementation.
“This work goes beyond the pure software implementation of lightweight models and shows how scalable yet lightweight language models can reduce each computational effort and energy consumption in the actual world,” the researchers write.
The researchers imagine their work can pave the way in which for the event of more efficient and hardware-friendly deep learning architectures.
Due to computational capability limitations, they were unable to check the MatMul-free architecture on very large models with greater than 100 billion parameters. However, they hope that their work will encourage institutions and organizations which have the resources to construct the biggest language models to take a position in accelerating lightweight models.
Ideally, this architecture will make language models much less depending on high-end GPUs like those from Nvidia and permit researchers to run high-performance models on other, cheaper and fewer supply-constrained processor types. The researchers have the code for the algorithm and models on which the research community can construct.
“By prioritizing the event and deployment of MatMul-free architectures like this one, the LLMs of the long run will only grow to be more accessible, efficient, and sustainable,” the researchers write.