One-bit large language models (LLMs) have emerged as a promising approach to make generative AI more accessible and inexpensive. By representing model weights with a really limited variety of bits, 1-bit LLMs significantly reduce the memory and computational resources required to run them.
Microsoft Research has pushed the boundaries of 1-bit LLMs with its BitNet architecture. In one recent paperthe researchers introduce BitNet a4.8, a brand new technique that further improves the efficiency of 1-bit LLMs without compromising their performance.
The rise of 1-bit LLMs
Traditional LLMs use 16-bit floating point numbers (FP16) to represent their parameters. This requires plenty of memory and computing resources, which limits the accessibility and deployment options for LLMs. Single-bit LLMs address this challenge by dramatically reducing the precision of model weights while achieving the performance of full-precision models.
Previous BitNet models used 1.58-bit values ​​(-1, 0, 1) to represent model weights and 8-bit values ​​for activations. This approach significantly reduced memory and I/O costs, however the computational cost of matrix multiplications remained a bottleneck, and optimizing neural networks with extremely low bitrate parameters is difficult.
Two techniques help address this problem. Sparsification reduces the variety of computations by pruning activations with smaller magnitudes. This is especially useful in LLMs because activation values ​​are inclined to have a long-term distribution with some very large values ​​and plenty of small values.
Quantization, however, uses a smaller variety of bits to represent activations, reducing the computational and storage effort required to process them. However, simply reducing the activation precision can result in significant quantization errors and performance degradation.
Furthermore, the mixture of sparsification and quantization is difficult and poses particular problems when training 1-bit LLMs.
“Both quantization and sparsification introduce non-differentiable operations, which makes gradient computation during training particularly difficult,” Furu Wei, partner research manager at Microsoft Research, told VentureBeat.
Gradient computation is crucial for calculating errors and updating parameters when training neural networks. The researchers also needed to make sure that their techniques might be efficiently implemented on existing hardware while retaining the advantages of each sparsification and quantization.
BitNet a4.8
BitNet a4.8 addresses the challenges of optimizing 1-bit LLMs through what the researchers call “hybrid quantization and sparsification.” They achieved this by designing an architecture that selectively applies quantization or sparsification to different components of the model based on the precise distribution pattern of activations. The architecture uses 4-bit activations for inputs in attention and feed-forward network (FFN) layers. It uses 8-bit sparsification for intermediate states and retains only the highest 55% of the parameters. The architecture can be optimized to make the most of existing hardware.
“With BitNet b1.58, the inference bottleneck of 1-bit LLMs switches from memory/IO to computation, which is constrained by the activation bits (i.e. 8 bits in BitNet b1.58),” Wei said. “In BitNet a4.8, we push the enable bits to 4-bit in order that we are able to use 4-bit kernels (e.g. INT4/FP4) to hurry up LLM inference by 2x on the GPU devices. The combination of 1-bit model weights from BitNet b1.58 and 4-bit activations from BitNet a4.8 effectively accounts for each memory/IO and compute limitations in LLM inference.”
BitNet a4.8 also uses 3-bit values ​​to represent the important thing (K) and value (V) states in the eye mechanism. The KV cache is a vital component of transformer models. It stores the representations of previous tokens within the sequence. By reducing the precision of KV cache values, BitNet a4.8 further reduces memory requirements, especially when processing long sequences.
The promise of BitNet a4.8
Experimental results show that BitNet a4.8 delivers comparable performance to its predecessor BitNet b1.58 while consuming less computing power and memory.
Compared to full-precision Llama models, BitNet a4.8 reduces memory usage by an element of 10 and achieves a 4x speedup. Compared to BitNet b1.58, 2x speed is achieved due to 4-bit activation kernels. But the design can do way more.
“The estimated computational improvement is predicated on the present hardware (GPU),” Wei said. “With hardware specifically optimized for 1-bit LLMs, computational improvements might be significantly increased. BitNet introduces a brand new computational paradigm that minimizes the necessity for matrix multiplication, a serious focus in current hardware design optimization.”
Due to its efficiency, BitNet a4.8 is especially suitable for deploying LLMs at the sting and on resource-constrained devices. This can have significant privacy and security implications. By enabling LLMs on device, users can profit from the ability of those models without having to send their data to the cloud.
Wei and his team proceed their work on 1-bit LLMs.
“We proceed to advance our research and vision for the era of 1-bit LLMs,” Wei said. “While our current focus is on model architecture and software support (e.g. bitnet.cpp), we would like to explore the co-design and co-evolution of model architecture and hardware to completely realize the potential of 1-bit LLMs.”