Large language models (LLMs) have made remarkable progress in recent times, but understanding how they work stays a challenge, and scientists in artificial intelligence labs try to peek contained in the black box.
One promising approach is the Sparse Autoencoder (SAE), a deep learning architecture that breaks down the complex activations of a neural network into smaller, comprehensible components that could be linked to human-readable concepts.
In a brand new paper, researchers from Google DeepMind present JumpReLU SAEa brand new architecture that improves the performance and interpretability of SAEs for LLMs. JumpReLU makes it easier to discover and track individual features in LLM activations, which could be a step toward understanding how LLMs learn and reason.
The challenge of interpreting LLMs
The basic constructing block of a neural network is individual neurons, tiny mathematical functions that process and transform data. During training, neurons are tuned to develop into lively after they encounter certain patterns in the information.
However, individual neurons don’t necessarily correspond to specific concepts. A single neuron could be activated for 1000’s of various concepts, and a single concept can activate a wide selection of neurons across the network. This makes it very obscure what each neuron represents and the way it contributes to the general behavior of the model.
This problem is especially pronounced for LLMs, which have billions of parameters and are trained on huge datasets. As a result, the activation patterns of neurons in LLMs are extremely complex and difficult to interpret.
Sparse autoencoder
Autoencoders are neural networks that learn to encode some kind of input into an intermediate representation after which decode it back to its original form. Autoencoders are available in several flavors and are used for various applications, including compression, image smoothing, and magnificence transfer.
Sparse autoencoders (SAEs) use the concept of the autoencoder with a slight modification. During the encoding phase, the SAE is forced to activate only a small variety of the neurons within the intermediate representation.
This mechanism allows SAEs to compress numerous activations right into a small variety of intermediate neurons. During training, the SAE receives activations from layers inside the goal LLM as input.
SAE attempts to encode these dense activations through a layer of sparse features. It then attempts to decode the learned sparse features and reconstruct the unique activations. The goal is to attenuate the difference between the unique activations and the reconstructed activations while using the smallest possible variety of intermediate features.
The challenge with SAEs is to seek out the precise balance between sparseness and reconstruction fidelity. If the SAE is just too sparse, it cannot capture all of the essential information within the activations. Conversely, if the SAE just isn’t sparse enough, it’s just as difficult to interpret as the unique activations.
JumpReLU SAE
SAEs use an “activation function” to implement sparsity of their intermediate layer. The original SAE architecture uses the Rectified Linear Unit (ReLU) function, which zeroes out all features whose activation value is below a certain threshold (normally zero). The problem with ReLU is that it may possibly compromise sparsity by keeping irrelevant features with very small values.
DeepMind's JumpReLU SAE goals to deal with the constraints of previous SAE techniques through a small change to the activation function. Instead of using a worldwide threshold, JumpReLU can determine separate thresholds for every neuron within the sparse feature vector.
This dynamic feature selection makes the training of the JumpReLU SAE somewhat more complicated, but allows it to seek out a greater balance between sparsity and reconstruction fidelity.
The researchers evaluated JumpReLU SAE on DeepMinds Gemma 2 9B LLM. They compared the performance of JumpReLU SAE with two other state-of-the-art SAE architectures, DeepMind’s own Gated SAE and OpenAIs TopK SAEThey trained the SAEs using the residual stream, attention output, and dense layer outputs of various layers of the model.
The results show that the development fidelity of JumpReLU SAE is superior to that of Gated SAE and not less than nearly as good as that of TopK SAE across different sparsity levels. JumpReLU SAE was also very effective in minimizing “dead features” which are never activated. It also minimizes features which are too lively and don’t provide a signal for certain concepts that the LLM has learned.
In their experiments, the researchers found that JumpReLU SAE's functions were as interpretable as those of other state-of-the-art architectures, which is crucial for understanding how LLMs work.
In addition, JumpReLU SAE was very efficient to coach, making it suitable to be used on large language models.
Understanding and managing LLM behavior
SAEs can provide a more accurate and efficient method for decomposing LLM activations and help researchers discover and understand the features that LLMs use to process and generate language. This may open the door to developing techniques to guide LLM behavior in desired directions and mitigate a few of their shortcomings, reminiscent of bias and toxicity.
For example, a recent study at Anthropic found that SAEs trained on Claude Sonnet's activations were capable of find features that activate on text and pictures related to the Golden Gate Bridge and popular tourist attractions. This form of concept visibility can allow scientists to develop techniques that prevent the model from generating harmful content, reminiscent of malicious code, even when users manage to bypass immediate protections through jailbreaks.
SAEs also can allow for more granular control over the model's responses. For example, by modifying the sparse activations and decoding them back into the model, users may have the ability to regulate elements of the output, reminiscent of making the responses more fun, easier to read, or more technical. Studying the activations of LLMs has develop into a vibrant field of research, and there remains to be much to learn.