It is becoming increasingly difficult to get today's artificial intelligence (AI) systems running at the size required for further advances. They require enormous amounts of memory to be sure that all of their processing chips can quickly share all the info they generate to operate as a unit.
The chips which have driven the deep learning boom within the last decade are called graphics processing units (GPUs). They were originally designed for games, not AI models, where every step of their thought process must occur in significantly lower than a millisecond.
Each chip comprises only a modest amount of memory, so the massive language models (LLMs) that underlie our AI systems should be split across many GPUs connected via high-speed networks. LLMs work by training an AI on massive amounts of text, and each a part of them involves moving data between chips – a process that's not only slow and slow energy intensive but additionally requires an increasing number of chips the larger the models get.
For example, OpenAI used around 200,000 GPUs to develop its latest model, GPT-5, about 20 times the number utilized in the GPT-3 model that supported the unique version of Chat-GPT three years ago.
To overcome the restrictions of GPUs, corporations like California-based Cerebras have began developing a special kind of chip called wafer-scale processors. These are the scale of a dinner plate, about five times larger than GPUs and have only recently change into commercially viable. Each comprises huge on-chip memory and a whole bunch of hundreds of individual processors (called cores).
The idea behind it is straightforward. Instead of coordinating dozens of small chips, keep all the pieces on one piece of silicon in order that data doesn't need to travel across hardware networks. This is essential because when an AI model generates a solution – a step called inference – any delay adds up.
The time it takes for the model to reply is known as latency. Reducing this latency is critical for applications that operate in real time, similar to chatbots. Scientific evaluation engines And Fraud detection systems.
However, wafer-scale chips alone are usually not enough. Without a software system specifically designed to your architecture, much of your theoretical performance gain will simply never materialize.
The deeper challenge
Wafer-scale processors have an unusual combination of features. Each core has very limited memory, so there may be an excellent need for data sharing inside the chip. Cores can access their very own data in nanoseconds, but each chip has so many cores in such a big area that reading memory on the opposite side of the wafer could be hundreds of times slower.
Limitations within the routing network on each chip also mean that it cannot handle all possible communications between cores at the identical time. In summary, cores cannot access memory fast enough, cannot communicate freely, and ultimately spend most of their time waiting.
Brovko Serhii
We recently worked on an answer called WaferLLM, a three way partnership between the University of Edinburgh and Microsoft Research that goals to efficiently run the most important LLMs on wafer-scale chips. The vision is to reorganize the best way an LLM works in order that each core on the chip primarily processes locally stored data.
What is that in? first paper To examine this problem from a software perspective, we developed three recent algorithms that mainly break down the model's large mathematical operations into much smaller pieces.
These pieces are then arranged in order that neighboring cores can process them together, passing only tiny fragments of knowledge to the subsequent core. This ensures that information is transmitted locally across the wafer and avoids long-distance communication that slows down the whole chip.
We also introduced recent strategies for distributing different portions (or layers) of the LLM across a whole bunch of hundreds of cores without leaving large portions of the wafer unused. This involves coordinating processing and communication to be sure that one group of cores processes data while one other group moves data and a 3rd prepares its next task.
These adaptations have been tested on LLMs similar to Meta's Llama and Alibaba's Qwen using Europe's largest wafer-scale AI facility on the Edinburgh International Data Facility. WaferLLM made the wafer-scale chips generate text about 100 times faster than before.
Compared to a cluster of 16 GPUs, this meant a tenfold reduction in latency and twice the energy efficiency. So whereas some argue Although we suspect that the subsequent leap in AI performance could come from chips designed specifically for LLMs, our results suggest that you may as an alternative develop software that conforms to the structure of existing hardware.
In the short term, faster inference at lower cost raises the prospect of more responsive AI tools able to evaluating many more hypotheses per second. This would improve all the pieces from argumentation assistants to scientific evaluation engines. Even more data-intensive applications similar to fraud detection and idea testing through simulations would give you the chance to handle significantly larger workloads without the necessity for large GPU clusters.
The future
GPUs remain flexible, widely available, and supported by a mature software ecosystem, so wafer-scale chips is not going to replace them. Instead, they may likely serve workloads that depend on ultra-low latency, extremely large models, or high energy efficiency, similar to drug discovery and financial trading.
GPUs, nonetheless, don’t stand still: Better software and continuous improvements in chip design help them work more efficiently and deliver more speed. Assuming even greater efficiency is required, some GPU architectures could also adopt wafer-scale ideas over time.

Simplystocker
The broader lesson is that AI infrastructure is becoming a co-design problem: hardware and software must evolve together. As models grow, simply scaling with more GPUs isn’t any longer enough. Systems like WaferLLM show that rethinking the software stack is important to unlocking the subsequent generation of AI performance.
For the general public, the advantages will appear not as recent chips on shelves, but as AI systems that support applications that were previously too slow or too expensive to run. Whether in scientific research, public sector services or mass analytics, the shift to wafer-scale computing signals a brand new phase in how AI systems are built – and what they will achieve.

