A surprising benchmark result that might shake up the competitive landscape for AI inference, a start-up chip company Grok appears to have confirmed through a series of retweets that its system serves Meta's newly released LLaMA 3 large language model at greater than 800 tokens per second.
“We tested their API a bit and the service is certainly not as fast because the hardware demos showed. “Probably more of a software issue – we’re still looking forward to wider use of Groq,” says Dan Jakaitis, an engineer who evaluated LLaMA 3’s performance. Posted on X (formerly often called Twitter).
But in keeping with an If independently verified, this may be a big advance in comparison with existing cloud AI services. VentureBeat's own initial tests show that the claim appears to be true. (You can Test it for yourself right here.)
A novel processor architecture optimized for AI
Groq, a well-funded Silicon Valley startup, has developed a novel processor architecture optimized for the matrix multiplication operations that lie on the computational heart of deep learning. The companys Tensor streaming processor Forgoes the caches and sophisticated control logic of traditional CPUs and GPUs and as an alternative relies on a simplified, deterministic execution model tailored to AI workloads.
Groq claims that by avoiding the overhead and memory bottlenecks of general-purpose processors, much higher performance and efficiency could be achieved for AI inference. The LLaMA 3 results of 800 tokens per second, if held, would lend credence to this claim.
Groq's architecture differs significantly from the designs utilized by Nvidia and other established chipmakers. Instead of adapting general-purpose processors for AI, Groq designed its Tensor Streaming Processor to speed up the precise computational patterns of deep learning.
This “clean sheet” approach allows the corporate to remove redundant circuitry and optimize data flow for highly repetitive, parallelizable AI inference workloads. The result, Groq says, is a dramatic reduction within the latency, power consumption and price of running large neural networks in comparison with mainstream alternatives.
The need for fast and efficient AI inference
The performance of 800 tokens per second is akin to about 48,000 tokens per minute – fast enough to generate about 500 words of text per second. This is nearly an order of magnitude faster than the everyday inference speeds of huge language models deployed within the cloud on traditional GPUs today.
Fast and efficient AI inference is becoming increasingly essential as language models reach lots of of billions of parameters in size. While training these massive models is enormously computationally intensive, deploying them inexpensively requires hardware that may run them quickly without consuming enormous amounts of energy. This is particularly true for latency-sensitive applications corresponding to chatbots, virtual assistants, and interactive experiences.
The energy efficiency of AI inference can be coming under increasing scrutiny because the technology becomes more widely used. Data centers already devour significant power, and the computational demands of large-scale AI threaten to dramatically increase this power consumption. Hardware that may deliver the required inference performance while minimizing energy consumption will probably be key to creating AI sustainable at scale. Groq's Tensor Streaming Processor was designed with this efficiency imperative in mind and guarantees to significantly reduce the ability costs of running large neural networks in comparison with general-purpose processors.
Challenging Nvidia's dominance
Nvidia currently dominates the AI processor market, with its A100 and H100 GPUs powering the overwhelming majority of cloud AI services. But numerous well-funded startups corresponding to Groq, Cerebras, SambaNova and Graphcore are difficult this dominance with latest architectures designed specifically for AI.
Of these challengers, Groq has been probably the most vocal in relation to targeted conclusions and training. CEO Jonathan Ross has boldly predicted that by the tip of 2024, most AI startups will probably be using Groq's low-precision Tensor streaming processors for inference.
Meta's release of LLaMA 3, described as probably the most powerful open source language models available, provides Groq with a high-profile opportunity to showcase the inference capabilities of its hardware. The model, which Meta claims is on par with the most effective closed-source offerings, is predicted to be widely used for benchmarking and deployed in lots of AI applications.
If Groq's hardware can run LLaMA 3 significantly faster and more efficiently than mainstream alternatives, that may bolster the startup's claims and potentially speed up the adoption of its technology. Groq recently launched a brand new business unit to make its chips more accessible to customers through a cloud service and partnerships.
The combination of powerful open models like LLaMA and highly efficient “AI-first” inference hardware like Groq’s could make advanced voice AI cheaper and accessible to a wider range of firms and developers. But Nvidia won't hand over its lead that quickly and other challengers are also within the starting blocks.
What is definite is that there’s a race to construct an infrastructure that may keep pace with the explosive advances in AI model development and scale the technology to satisfy the needs of a rapidly expanding range of applications. Near real-time AI inference at an inexpensive cost could unlock transformative opportunities in areas corresponding to e-commerce, education, finance, healthcare and more.
As An X.com user responded to Groq's LLaMA 3 benchmark claim: “Speed + low_cost + quality = it is senseless (for the time being) to make use of anything”. The coming months will show whether this daring equation works, nevertheless it's clear that the hardware fundamentals of AI are removed from settled as a brand new wave of architectures challenge the established order.