Most languages use word position and sentence structure to extract meaning. For example, “The cat was sitting on the box” isn’t the identical as “The box was sitting on the cat.” Over the course of an extended text, akin to a financial document or a novel, the syntax of those words is prone to evolve.
Likewise, an individual could track variables in a bit of code or follow instructions that involve conditional actions. These are examples of state changes and sequential pondering that we expect state-of-the-art artificial intelligence systems to excel at. However, the present state-of-the-art attention mechanism inside Transformers – the first architecture utilized in large language models (LLMs) to find out the meaning of words – has theoretical and empirical limitations in relation to such capabilities.
An attention mechanism allows an LLM to look back at previous parts of a question or document and use its training to find out which details and words are most vital. However, this mechanism alone doesn’t understand word order. It “sees” all of the entered words, also called tokens, at the identical time and processes them within the order by which they seem. Therefore, researchers have developed techniques for encoding positional information. This is crucial for highly structured domains akin to language. However, the prevailing position encoding method, called Rotary Position Encoding (RoPE), only considers the relative distance between tokens in a sequence and is independent of the input data. This implies that, for instance, words which are 4 positions apart, like “cat” and “box” in the instance above, all receive the identical fixed mathematical rotation specific to that relative distance.
Now research led by MIT and the MIT-IBM Watson AI Lab has produced an encoding technique called “PaTH Attention” that makes position information adaptive and context-aware, fairly than static, as in RoPE.
“Transformers enable accurate and scalable modeling of many domains, but have these limitations to state tracking, a category of phenomena which are thought to underlie the essential capabilities we wish in our AI systems. So the essential query is: How can we maintain the scalability and efficiency of transformers while enabling state tracking?” says the paper's senior writer Yoon Kim, an associate professor within the Department of Electrical Engineering and Computer Science (EECS), a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and a researcher on the MIT-IBM Watson AI Lab.
A brand new paper on this work was presented on the Conference on Neural Information Processing Systems (NeurIPS) earlier this month. Kim's co-authors include lead writer Songlin Yang, an EECS graduate student and former MIT-IBM Watson AI Lab Summer Program intern; Kaiyue Wen of Stanford University; Liliang Ren of Microsoft; and Yikang Shen, Shawn Tan, Mayank Mishra and Rameswar Panda from IBM Research and the MIT-IBM Watson AI Lab.
Path to understanding
Instead of assigning each word a hard and fast rotation based on the relative distance between tokens, as RoPE does, PaTH Attention is flexible and treats the words in between as a path made up of small, data-dependent transformations. Each transformation, based on a mathematical operation called Householder reflection, acts like a small mirror that adjusts depending on the content of every token passed. Each step in a sequence can influence how the model later interprets information. The cumulative effect allows the system to model how meaning changes along the trail between words, not only how far apart they’re. This approach allows Transformers to trace how entities and relationships change over time, providing a way of “positional memory.” Think of it as walking down a path while experiencing your surroundings and their effects on you. In addition, the team also developed a hardware-efficient algorithm to more efficiently calculate the eye values between each pair of tokens, in order that the cumulative mathematical transformation of PaTH Attention is compressed and broken down into smaller calculations, making it compatible with fast processing on GPUs.
The MIT-IBM researchers then examined PaTH Attention's performance on synthetic and real-world tasks, including reasoning, long-context benchmarks, and full LLM training, to see whether it improves a model's ability to trace information over time. The team tested its ability to follow the ultimate “write” command despite many distracting steps and multi-stage recall tests, tasks which are difficult for traditional positional coding methods akin to RoPE. The researchers also trained medium-sized LLMs and compared them with other methods. PaTH Attention improved perplexity and outperformed other methods on reasoning benchmarks on which it was not trained. They also evaluated recall, reasoning, and stability across inputs of tens of 1000’s of tokens. PaTH Attention has proven to be content-aware throughout.
“We found that in each diagnostic tasks geared toward testing the boundaries of transformers and real-world language modeling tasks, our recent approach was capable of outperform existing attention mechanisms while maintaining their efficiency,” says Kim. Also: “I might have an interest to see if these kind of data-dependent positional encodings like PATH improve the performance of transformers in structured areas like biology, in (analyzing) proteins or DNA.”
Think greater and more efficiently
The researchers then examined how the PaTH attention mechanism would behave if it more closely mimicked human cognition, by which we ignore old or less relevant information when making decisions. To do that, they combined PaTH Attention with one other position encoding scheme called Forgetting Transformer (FoX), which allows models to selectively “forget.” The resulting PaTH-FoX system provides a method to weight information in a data-dependent manner, producing strong ends in reasoning, long-context understanding, and language modeling benchmarks. In this fashion, PaTH Attention expands the expressiveness of transformer architectures.
Kim says research like this is an element of a broader effort to develop the “next big thing” in AI. He explains that a key driver of each the deep learning and generative AI revolutions is the creation of “general constructing blocks that might be applied across broad domains,” akin to “convolutional layers, recurrent neural network (RNN) layers,” and more recently, transformers. Looking forward, Kim notes that considerations akin to accuracy, expressiveness, flexibility and hardware scalability have been and can proceed to be critical. As he puts it, “the core concern of contemporary architectural research is to develop these recent basic elements that maintain or enhance expressiveness while being scalable.”
This work was supported partially by the MIT-IBM Watson AI Lab and the AI2050 program at Schmidt Sciences.

