The Transformer architecture is the idea of the preferred private and non-private AI models today. So we ask ourselves: what's next? Is this the architecture that may lead to raised reasoning? What could come after the Transformers? To incorporate intelligence, models today require large amounts of knowledge, GPU computing power, and rare talent. This makes them generally costly to construct and maintain.
The use of AI began on a small scale by making easy chatbots smarter. Now startups and enterprises have found out how you can package intelligence in the shape of copilots that complement human knowledge and skills. The next natural step is to package things like multi-step workflows, storage, and personalization in the shape of agents that may solve use cases across multiple functions, including sales and engineering. The expectation is that a straightforward prompt from a user will allow an agent to categorise the intent, break the goal down into multiple steps, and complete the duty, whether that's web searches, authenticating to multiple tools, or learning from past repeated behaviors.
When these agents are applied to consumer use cases, we start to check a future where everyone can have a private Jarvis-like agent on their phone that understands them. Want to book a visit to Hawaii, order food out of your favorite restaurant, or manage your personal funds? It's possible that you simply and I’ll give you the option to soundly perform these tasks in the long run using personalized agents, but from a technology perspective, we're still a good distance from that future.
Is transformer architecture the ultimate frontier?
The Self-Attention mechanism of the Transformer architecture allows a model to concurrently weigh the importance of every input token against all tokens in an input sequence. This helps improve a model's language and computer vision understanding by capturing long-range dependencies and complicated token relationships. However, which means the computational complexity increases for long sequences (e.g. DNA), leading to slow performance and high memory consumption. Some solutions and research approaches to unravel the long sequence problem are:
- : A promising technology is here LightningAttention. This paper claims that Transformer performance may be improved by fastidiously managing reads and writes for various tiers of fast and slow memory on the GPU. This is achieved by making attention algorithms IO-aware, which reduces the variety of reads/writes between the GPU's high-bandwidth memory (HBM) and static random access memory (SRAM).
- : Self-attention mechanisms have a complexity of O(n^2), where n is the length of the input sequence. Is there a technique to reduce this quadratic computational complexity to linear in order that transformers can higher handle long sequences? The optimizations here include techniques resembling reformers, performers, Skyshaper and other.
In addition to those optimizations to cut back the complexity of transformers, there are some alternative models that challenge the dominance of transformers (nevertheless it continues to be too early for many of them):
- State space model: This is a category of models related to recurrent (RNN) and convolutional (CNN) neural networks that compute with linear or nearly linear computational complexity for long sequences. State space models (SSMs) resembling mamba can handle long-distance relationships higher, but lag behind Transformers in performance.
These research approaches at the moment are now not present in university labs, but are publicly available in the shape of recent models for anyone to check out. In addition, the most recent model releases can tell us in regards to the state of the underlying technology and the viable path of Transformer alternatives.
Notable model launches
We proceed to listen to in regards to the latest and best model launches from the same old suspects like OpenAI, Cohere, Anthropic and Mistral. Meta's base model on Compiler optimization is characterised by its effectiveness in code and compiler optimization.
In addition to the dominant Transformer architecture, there at the moment are production-grade State Space Models (SSM), hybrid SSM-Transformer models, Mixture of Experts (MoE), and Composition of Expert (CoE) models. These appear to perform well against the most up-to-date open source models on several benchmarks. Standout models include:
- Data blocks Open Source DBRX Model: This MoE model has 132 billion parameters. It has 16 experts, 4 of that are lively concurrently during inference or training. It supports a 32K context window and the model was trained using 12T tokens. Some other interesting details – it took 3 months, $10 million and 3072 Nvidia GPUs connected over 3.2 Tbps InfiniBand to finish pre-training, post-training, evaluation, red-teaming and refinement of the model.
- : This CoE model is a composition of 5 7B parameter experts, only considered one of which is lively on the time of inference. The experts are all open source models and along with the experts, the model has a router. This recognizes which model is best fitted to a specific query and forwards the request to that model. It is blazing fast and generates 330 tokens/second.
- This is a hybrid Transformer-Mamba MoE model. It is the primary production-ready Mamba-based model with elements of the standard Transformer architecture. “Transformer models have two disadvantages: First, their high memory and compute requirements hinder the processing of long contexts, where the scale of the key-value cache (KV) becomes the limiting factor. Second, the dearth of a single summary state results in slow inference and low throughput, since each generated token performs a calculation of the complete context.” SSMs like Mamba are higher at handling long-distance relationships, but lag behind Transformers in performance. Jamba compensates for the inherent limitations of a pure SSM model, offering a 256K context window and bringing 140K of context to a single GPU.
Challenges of introducing it into firms
While recent research and modeling are promising and support transformer architecture as the subsequent frontier, we must also consider the technical challenges that prevent firms from realizing these advantages:
- : Imagine selling to CXOs without easy things like role-based access control (RBAC), single sign-on (SSO), or no access to logs (each prompt and output). Today's models might not be enterprise-ready yet, but firms are creating separate budgets to make sure they don't miss out on the subsequent big trend.
- : AI copilots and agents make it more complex to secure data and applications. Consider a straightforward use case: a video conferencing app you employ each day introduces AI summarization features. As a user, you is perhaps excited in regards to the ability to receive transcripts after a gathering, but in regulated industries, this advanced feature can suddenly turn into a nightmare for CISOs. The bottom line is that what worked perfectly until now could be broken and desires additional security scrutiny. Companies must put safeguards in place to make sure data privacy and compliance when SaaS apps introduce such features.
- It is feasible to make use of each together or neither without sacrificing much. One can consider Retrieval-Augmented Generation (RAG) as a technique to be sure that facts are represented accurately and data is up-to-date, while fine-tuning may be seen as the results of the perfect model quality. Fine-tuning is difficult, which is why some model providers advise against it. It also brings the challenge of overfitting, which negatively impacts model quality. Fine-tuning appears to be demanded from several sides – because the model context window gets larger and token costs decrease, RAG could turn into a greater deployment option for enterprises. In the context of RAG, the recently introduced Command R+ model from Cohere is the primary open weights model to beat GPT-4 within the chatbot space. Command R+ is the state-of-the-art RAG-optimized model designed for enterprise-wide workflows.
I recently spoke to an AI lead at a big financial institution who claimed that the long run doesn’t belong to software developers, but to creative English/art students who can design an efficient approach. There could also be some truth to this statement. With a straightforward sketch and multimodal models, non-professionals can create easy applications without much effort. Knowing how you can use such tools could be a superpower and can help anyone who wants to reach their profession.
The same is true for researchers, practitioners and founders. Today, they’ve several architectures to select from as they seek to make their underlying models cheaper, faster and more accurate. Today, there are many ways to change models for specific use cases, including fine-tuning techniques and up to date breakthroughs resembling direct preference optimization (DPO), an algorithm that may be considered a substitute for reinforcement learning with human feedback (RLHF).
With so many rapid changes within the generative AI space, it will possibly be overwhelming for founders and buyers alike to prioritize. I'm excited to see what comes next from everyone constructing something recent.