After years of dominance of the shape of AI referred to as “Transformer,” the hunt for brand spanking new architectures now begins.
Transformers form the premise of OpenAI's video generation model Sora and are at the center of text generation models resembling Anthropic's Claude, Google's Gemini and GPT-4o. But they’re increasingly encountering technical obstacles – particularly computational ones.
Transformers aren’t particularly efficient at processing and analyzing large amounts of knowledge, not less than not when running on commercially available hardware. And this results in steep and maybe not sustainable Electricity demand is increasing as firms construct and expand their infrastructure to satisfy transformer needs.
A promising architecture proposed this month is Test time training (TTT)developed over the course of a 12 months and a half by researchers at Stanford, UC San Diego, UC Berkeley, and Meta. The research team claims that TTT models cannot only process way more data than transformers, but they’ll achieve this without using nearly as much computing power.
The hidden state in transformers
A fundamental component of transformers is the “hidden state,” which is basically an extended list of knowledge. When a transformer processes something, it adds entries to the hidden state to “remember” what it just processed. For example, when the model is working through a book, the hidden state values are things like representations of words (or parts of words).
“If you’re thinking that of a transformer as an intelligent being, then the lookup table – its hidden state – is the brain of the transformer,” Yu Sun, a postdoctoral fellow at Stanford and co-author of the TTT research, told TechCrunch. “This specialized brain enables the well-known capabilities of transformers, resembling contextual learning.”
Hidden state is an element of what makes Transformers so powerful. But it also hinders them. To “say” a single word a couple of book a Transformer just read, the model would have to go looking its entire lookup table—a task that’s as computationally intensive as reading your entire book again.
So Sun and his team got here up with the concept of replacing the hidden state with a machine learning model – a sort of AI sailor, a model inside a model.
It's a bit technical, however the core idea is that unlike a transformer's lookup table, the TTT model's internal machine learning model doesn't continue to grow because it processes additional data. Instead, it encodes the info it processes into representative variables called weights, which is what makes TTT models so powerful. No matter how much data a TTT model processes, the scale of its internal model doesn't change.
Sun believes that future TTT models could efficiently process billions of pieces of knowledge, from words to pictures, audio to videos, far exceeding the capabilities of today's models.
“Our system can say X words a couple of book without having to reread the book X times,” Sun said. “Large Transformer-based video models like Sora can only process 10 seconds of video because they only have a 'brain' with lookup tables. Our ultimate goal is to develop a system that may process an extended video that resembles the visual experience of a human lifetime.”
Skepticism towards the TTT models
So will TTT models eventually replace transformers? They could, however it's too early to say obviously.
TTT models aren’t a substitute for transformers. And the researchers only developed two small models for his or her studies, so TTT as a technique is currently difficult to check with among the larger transformer implementations.
“I believe it's a completely interesting innovation, and if the info supports the claims that it delivers efficiency gains, then that's great news, but I couldn't let you know whether it's higher than existing architectures or not,” said Mike Cook, a lecturer in the pc science department at King's College London who was not involved within the TTT research. “An old professor of mine used to inform me a joke after I was a student: how do you solve any problem in computer science? By adding one other layer of abstraction. Adding a neural network inside a neural network definitely jogs my memory of that.”
Nevertheless, the increasing pace of research into transformer alternatives indicates a growing awareness of the necessity for a breakthrough.
This week, AI startup Mistral released a model called Codestral Mamba, which relies on one other alternative to the transformer, called State Space Models (SSMs). Like TTT models, SSMs look like more computationally efficient than transformers and might scale to larger data sets.
AI21 Labs can be researching SSMs. Cartesianwhich developed among the first SSMs and the namesakes of Codestral Mamba, Mamba and Mamba-2.
If these efforts are successful, generative AI could turn out to be much more accessible and widespread than it’s now – for higher or for worse.