top of page

Wafer Scale Integration: The Cerebras Story Told as a Single Story

  • Writer: Kumar Venkatramani
    Kumar Venkatramani
  • Feb 23
  • 46 min read

Updated: Feb 24

Note to Reader

This article is written for the technically curious, not the technically specialist. You do not need a degree in computer science or electrical engineering to follow what is written here — though if you have one, you may find some descriptions in more depth than most common publications provide. What you do need is an appetite for understanding how things work, and a comfort with the kind of mathematics you last encountered in high school — ratios, orders of magnitude, and the occasional multiplication of large numbers.

Specifically, the article builds on three mathematical concepts that will feel familiar to anyone who has studied mathematics beyond basic arithmetic: scalars (a single number), vectors (an ordered list of numbers), and matrices (a grid of numbers arranged in rows and columns). These are introduced gently and with examples when they first appear, but a passing familiarity with what they are will make the architecture sections considerably easier to follow.

The subject — wafer scale integration and the engineering choices Cerebras made to achieve it — is genuinely complex. But complexity is not the same as inaccessibility. Every concept in this article is built from first principles, with analogies chosen for clarity rather than precision, and with technical terms introduced only when they earn their place. Where background knowledge is helpful, it is provided in clearly marked insets and appendices that can be read alongside the main article or skipped by those who do not need them.

Editorial Note

This article was written by the author with assistance from an AI system and reflects his best understanding of a technically complex subject. While every effort has been made to ensure accuracy, readers who identify errors or imprecisions are encouraged to raise them. The goal is clarity and honesty, not the appearance of expertise


Introduction

I am writing this article because in the 40+ years that I have worked in the semiconductor industry, there has always been one holy grail that we strived toward, long thought to be unachievable. It was called Wafer Scale Integration.

I have never seen a company attempt not just to accomplish this on its own, but also innovate on at least three different axes while simultaneously increasing the size of on-chip memory — all at once. I began researching this partly to convince myself that what this company claims to have achieved is actually real.

The Basics

The company I am about to describe is Cerebras. Founded in 2016 with a single purpose — to build a semiconductor chip and system specifically designed to address the AI hardware infrastructure problem — Cerebras has evolved its solution over three generations to its latest chip, announced in March 2024, called the WSE-3 (Wafer Scale Engine, generation 3).

The company has raised over $2.8 billion across approximately eight rounds of venture capital funding, including a $1 billion Series H led by Tiger Global in February 2026. Its valuation has risen sharply as its technology has gained commercial validation — from approximately $8 billion in mid-2025 to $23 billion in February 2026, following the announcement of its landmark partnership with OpenAI. (Source: Cerebras Systems press releases, February 2026)

The remainder of this article answers what Cerebras did and what it means.


Wafer Scale Integration: A Brief History


In 1979–1980, Gene Amdahl — a pioneer in semiconductor development who had left IBM in 1970 to start Amdahl Corporation — departed Amdahl Corporation and embarked on a mission to achieve Wafer Scale Integration. The company he founded was called Trilogy Systems. Trilogy was the talk of the semiconductor frontier, and in their attempt to make this a reality, they decided not only to build the chip but also to construct the fabrication facility — the "fab" — that would manufacture it. They raised over $230 million (close to $1 billion in today's dollars), making them by far the most well-funded startup in Silicon Valley at the time.

To appreciate what Gene was attempting, consider that the average chip size in the late 1970s was roughly 0.25" × 0.25" (approximately 6mm × 6mm), while Trilogy's chip was designed to be 2.5" × 2.5" — about 100 times larger (by area) than what could be reliably manufactured at the time. The way they did that is to not dice the semiconductor wafer into many small dies, but instead used the entire silicon wafer all at once coining the term “Wafer Scale Integration”. (There is a more detailed explanation of what wafers and dies are in the next section). It incorporated built-in redundancy, sealed heat exchangers to handle the massive cooling requirements, and fabrication complexity that pushed well beyond the state of the art.

Trilogy's downfall came not from a single cause but from a compounding series of misfortunes — some technical, some human, and some simply bad luck. Gene Amdahl was distracted by a lawsuit following a serious car accident. Co-founder Clifford Madden was diagnosed with a brain tumor and died in 1982. The fabrication plant was damaged during construction by a winter storm. And the separate engineering teams never cohered into a unified effort. Meanwhile, Gordon Moore had articulated what became known as Moore's Law — predicting that chip density would roughly double every eighteen months through standard manufacturing progress — meaning that building a larger chip than the state of the art would simply become feasible through normal progress within a few years anyway. By 1985, Gene Amdahl was forced to pivot, and Trilogy eventually merged its remaining assets with a small startup called Elxsi.

Fast forward to 2015 — essentially 30 years later — and the dream of creating a single wafer-sized chip was considered foolhardy. As Cerebras CEO Andrew Feldman put it in an interview with The New Yorker:


"Trilogy cast a long shadow. People stopped thinking and started saying, 'It's impossible.'"


Enter Cerebras.


How Chips Are Made — and How Cerebras Broke the Rules

To appreciate what Cerebras accomplished, it helps to understand how chips are conventionally manufactured. In today’s technology landscape, you start with a circular semiconductor wafer — typically 300mm (12 inches) in diameter, roughly the size of a dinner plate — and lay out a grid of identical units on its surface.

The maximum unit size, known as the reticle limit, is approximately 26mm × 33mm, or roughly 858mm². This limit is set by the photo lithography equipment: a lithographic scanner can only expose that much area in a single "shot." After etching the circuit patterns onto each grid unit (called a die), the wafer is cut apart — using diamond blades, lasers, or in advanced processes, plasma etching — yielding around 90 dies per wafer at the reticle limit, or several hundred to several thousand for smaller, more typical die sizes. Each die is then packaged to form what we commonly call an integrated circuit, or chip.

For reference, in 2015 the largest single-die chip in production measured about 600mm² --- roughly 70% of the reticle limit. A decade later, the largest single dies produced by leading manufacturers measure approximately 750-800mm². After a decade of engineering progress, the industry has reached the edge of what single-exposure lithography can physically deliver --- and stopped there. The reticle limit has become a ceiling that conventional chip design accepts as fixed. What Cerebras did was elegantly simple in concept and extraordinarily difficult in practice: they skipped the dicing step entirely, leaving the full wafer intact as one single, giant chip.


Voilà — Wafer Scale Integration.



Figure 1: Understanding what scale means in the era of Wafer Scale Integration


In Cerebras's design, the full wafer aggregates to approximately 900,000 AI-optimized compute cores on a single chip. While the most advanced conventional accelerator dies press against the physical ceiling of roughly 800 mm² --- the maximum a single lithography exposure can produce --- the Cerebras WSE-3 measures approximately 46,225mm². That is more than 50 times the area of the largest die conventional manufacturing can deliver.

 

The Problem: What Deep Learning Actually Requires


The systems most people refer to today as “AI” — the Large Language Models (or LLMs for short) that answer questions, write code, translate languages, and summarize documents — belong to a class of mathematical systems called deep learning networks. Understanding what these networks are, at a mathematical level, helps explain why the hardware required to train and run them looks the way it does.

Before going further, it is worth pausing on terminology — because the words used to describe these systems are often used interchangeably in public discourse, and the distinctions matter for what follows. The diagram below illustrates the relationship between three terms that will appear throughout this article: machine learning, neural networks, and deep learning. They are not synonyms. Traditional software follows explicit rules written by programmers.  Machine learning is the broadest category — any system that learns patterns from data rather than following explicit rules. Neural networks are a specific family within machine learning, characterized by their layered architecture of weighted connections. Deep learning is a subset of neural networks specifically — networks with many layers, where the depth of the architecture is what gives the approach its name and much of its power. The systems at the frontier of modern AI, including the large language models this article is ultimately about, are deep learning systems. The diagram makes the nesting visible at a glance.

 

 

[Figure 2: ML / Neural Networks / Deep Learning hierarchy diagram]

  

A deep learning network is built from layers. Each layer is usually represented as a large grid of numbers — a matrix — representing the current state of whatever the network is processing: a word, a phrase, an input sentence, an intermediate calculation, a partially formed answer —  any and all of which are called tokens. What makes the network “learn” is a second set of numbers sitting between those layers: the weights. Weights are the learned multipliers that determine how strongly the output of each element in one layer should influence each element in the next. Training a network means adjusting those weights, gradually, until the network’s outputs match what we want.  See below for a graphic that shows how that looks.

 

Figure 3: Deep Learning Network – Layers and Weights


A note on terminology. Throughout this article, "AI" is used as shorthand for the specific class of deep learning systems the article describes — large language models and the training and inference workloads they require. This reflects the language of the industry, including trade publications and the companies whose products are discussed. The precise distinctions between AI, machine learning, neural networks, and deep learning are the subject of the diagram above and are written here as an aid to the reader so you can understand the nuance of these terms.

 

Large language models — the specific class of deep learning network behind tools like ChatGPT, Claude, and Gemini — are simply deep learning networks trained on very large quantities of text, image, or other forms of data, with a particular architectural design well suited to understanding and generating language. A large language model at the frontier of current capability contains well over a trillion weights. Even a modest deployed model contains tens of billions. These are not metaphorical large numbers: a trillion is one million millions, and every one of those values must be stored, loaded, multiplied, and accumulated, continuously, at speed.

During operation — whether generating a response or training on new data — the network applies those weights to the current values in each layer, across hundreds of layers in sequence, performing vast numbers of matrix multiplications and accumulations, repeatedly. The scale of this arithmetic is difficult to overstate: a single training run for a large model performs these operations billions of times, across datasets containing trillions of words, over weeks or months of continuous computation.

Two terms appear constantly in any discussion of AI hardware, and it is worth pausing on them now before the article depends on them.

Training is the process by which a deep learning network learns — the computationally intensive phase where the model is exposed to vast quantities of data, its weights are repeatedly adjusted to reduce errors, and the billions or trillions of parameters that define its behavior are gradually shaped into something useful. Training a frontier model happens once, or a small number of times, and takes weeks or months on thousands of accelerators running continuously.

Inference is everything that happens after training is complete — every time the model is asked a question, generates a response, writes code, or translates a sentence.

Think of Training a Neural Network like getting a professional qualification: studying, preparing for and taking the exams — grueling, expensive, done once. Inference is the career that follows — performed constantly, for years, in front of real people who expect a timely answer.

The two workloads, namely Training and Inference, have different hardware demands: training requires sustained, high-throughput parallel computation across enormous datasets; inference requires quick responses — the model must answer each individual request fast enough that the person on the other end doesn't notice a delay. A system optimized for one is not necessarily optimized for the other, and much of the commercial and architectural story of AI hardware — including the OpenAI-Cerebras deal discussed later in this article — only makes sense once that distinction is clear.


The first hardware requirement is fast memory, close to compute. The weights must move continuously from wherever they are stored into the arithmetic units that will use them. The speed at which this can happen is called memory bandwidth — and it is worth pausing on what that term means, because it appears throughout this article. Bandwidth is a rate, not an amount.

Think of memory as a tank of water and bandwidth as the width of the pipe draining it. A large tank with a narrow pipe holds a lot but delivers it slowly. A smaller tank with a wide pipe empties faster.

For AI training, the pipe width — the rate at which data can move from storage into computation — matters more than the tank size.


The second hardware requirement is massive independent parallelism. Matrix operations lend themselves naturally to parallel execution — many of the individual multiplications within them have no dependency on one another and can be performed simultaneously by independent arithmetic units. The more of this parallelism a system can exploit, the faster it moves through each layer and on to the next.  We will see more about this in the section below as well as a deeper dive if you need it in Inset B below.


The Accelerator Model — Why No Single Chip Does Everything


The processor inside a laptop or server — the CPU (Central Processing Unit) — is a general-purpose machine, designed to handle anything: running a spreadsheet, playing a video, responding to a mouse click. That generality is its strength and its limitation. A CPU is optimized for sequential tasks — doing one thing very well, then the next. For the continuous, high-volume parallel arithmetic that matrix multiplication requires, it is categorically the wrong tool. The arithmetic itself is trivially simple. The problem is the volume, and the need to do it in parallel.

The answer the computing industry settled on is the accelerator model: rather than asking the CPU to do everything, pair it with a specialist co-processor designed from the ground up for one class of problem and divide the work between them. The CPU handles what it is good at — serial orchestration, control logic, managing the training loop, deciding what to compute next. Think of the CPU as a conductor — directing each section of the orchestra, deciding what plays next and when. The accelerator handles what the CPU cannot — executing the same arithmetic operation across millions of data elements simultaneously. Neither can do the other’s job. Together, they can train a language model and provide quick answers when requested. It is also quite likely that a complete system might have multiple accelerators (GPUs, TPUs, FPGAs, etc.) as well as one or more CPUs working in concert.

GPUs (Graphics Processing Units — originally designed by Nvidia to render video game frames in real time) became the dominant AI accelerator because their massively parallel architecture happened to fit the first innovation —  a series of (or a vector of) multiplications very fast —  and because Nvidia’s CUDA software platform, launched in 2006, gave researchers a practical way to use them. A GPU is not a general-purpose computer: it is an accelerator, permanently paired with a host CPU. The CPU that handles everything a GPU cannot. To understand this better, see the SIMD section of the Architecture Primer in Inset B.

TPUs (Tensor Processing Units — Google’s custom accelerator, designed specifically for the matrix operations that neural networks require) occupy the same slot in the same architecture: specialist parallel compute engine, paired with a host CPU for orchestration. So does every other accelerator in this category — including the Cerebras WSE-3.


This is the architectural framework within which Cerebras operates. The WSE-3 is not a replacement for the CPU. It is a better answer to the question GPUs and TPUs are already answering: how do you deliver fast memory and massive parallelism to the arithmetic units that need them, at the scale that frontier AI demands?


Different hardware designers have addressed those two demands in different ways, each reflecting a set of engineering judgements about how best to manage the constraints of power, cost, size, and the specific profile of the workload. The choices available — and their trade-offs — are explored in Inset B. Cerebras made its own set of judgements. The rest of this article describes what those were, and what it took to build them into silicon.


 

The WSE-3 — What Was Actually Built


Deep learning training and inference demand two things from hardware:

1.     The ability to move enormous quantities of weights at high speed

2.     The ability to process them in massive parallel moves.

Cerebras’s answer to both demands begins with a single architectural decision: Put large amounts of fast memory directly beside compute and do it at Wafer scale.

Everything else follows from that choice.


Physical Scale — A Wafer full of processors acting as one


The WSE-3 — a third generation flagship processor — occupies the entire surface of a standard 300mm silicon wafer — 46,225 mm² of active silicon. The physical ceiling for a conventional accelerator die --- the maximum area a lithography tool can expose in a single shot --- is approximately 800mm². The WSE-3, at 46,225mm², is more than 50 times that area. It is not a larger version of a conventional chip. It is a different category of object entirely.

On that single piece of silicon reside:

  • 900,000 independent compute cores

  • 44GB of on-chip SRAM

  • 21 petabytes per second of Aggregate memory bandwidth 

  • Approximately 15 kilowatts of peak power draw

Those numbers are not incremental improvements. This is by far the largest processor ever manufactured for commercial deployment. It reflects a different architectural category.


Memory Architecture — 44GB of On-Chip SRAM


Conventional accelerators rely on off-chip DRAM for their working memory. Even high-bandwidth DRAM requires hundreds of clock cycles to fetch data.

The WSE-3 instead integrates 44 gigabytes of SRAM directly on the wafer, distributed evenly across its entire surface. Each compute core sits beside its own slice of local SRAM.

This changes the memory equation fundamentally:

  • DRAM access: hundreds of clock cycles

  • On-chip SRAM access: one to a few clock cycles

In ratio terms, that is roughly 20–50× lower per memory read.


Compute Architecture — MIMD at Scale


The wafer contains 900,000 active compute cores, each executing its own instruction stream on its own data — a Multiple-Instruction, Multiple-Data (MIMD) architecture.

Unlike SIMD systems, where all cores execute the same instruction simultaneously, a MIMD design allows each core to operate independently. Neural networks — particularly during training — involve irregular communication and synchronization patterns. Independence reduces idle waiting.

Each core is small — approximately 0.05 mm² — and tightly coupled to local SRAM. The result is not just parallelism, but locality.

The architecture is designed so that computation happens where the data already resides.


Scribe-Line Stitching — Making the Wafer Behave as One Die


Lithography equipment cannot expose a 300mm wafer in a single shot. The wafer necessarily spans multiple reticle fields. In conventional manufacturing, these fields are separated by scribe lines — narrow strips of silicon that are later cut and discarded when the wafer is diced into individual dies.

Cerebras did not dice the wafer.

Instead, they worked with TSMC to route approximately 20,000 metal interconnect wires across each scribe line, electrically stitching adjacent reticle fields together.

This is not a cosmetic detail. It is what makes wafer-scale integration electrically viable.

From the perspective of any compute core:

  • Communicating across a reticle boundary is as fast as communicating within one.

  • The wafer behaves as a single, continuous chip.

Without this stitching, the wafer would be a collection of isolated regions. With it, it becomes a unified compute fabric.

Figure 4: Understanding a Scribe Line


Manufacturing Yield — Designing Around Defects


In conventional chip manufacturing, defective dies are discarded. When the chip is the entire wafer, that option disappears.

Each fabricated wafer contains approximately 970,000 physical cores:

  • 900,000 are activated

  • Roughly 70,000 serve as distributed spares

After fabrication, the wafer is mapped for defects. Faulty cores are bypassed, and spare cores are activated in their place. The interconnect fabric is dynamically reconfigured to route around imperfections.

The yield problem is not eliminated — it is engineered around. From the perspective of software,  every shipped wafer presents a clean, fully functional 900,000-core surface.


Power and Cooling — The Physical Consequence


At full load, the WSE-3 consumes approximately 15 kilowatts — an order of magnitude greater than a conventional accelerator card.

This required:

  • A custom vertical power delivery system

  • Direct liquid cooling integrated into the assembly

  • A mechanical enclosure closer to industrial equipment than server hardware

The chip is not just large in area; it is dense in power.



Figure 5: The Engine Block that shows the size of power cooling tubes on the right and the vertical power delivery system on the left


What Makes This Architecturally Distinct


The WSE-3 is not simply a larger GPU. It represents a different answer to a specific question: If AI computation is memory-bound, what happens if memory is no longer off-chip?

By integrating 44GB of SRAM directly onto a wafer containing 900,000 independent cores — and by electrically stitching reticle fields into a single fabric — Cerebras removes the traditional boundaries between die, package, and memory hierarchy.

Whether this architectural bet becomes commercially dominant remains to be seen.


From Chip to System — The CS-3 and the Max Cluster


The Gap Between a Chip and a Deployable System


A wafer-scale processor does not fit into a conventional accelerator slot. Once the silicon area approaches the size of an entire wafer and power consumption reaches kilowatt levels, the problem shifts from chip design to system design.

The WSE-3 therefore cannot be considered in isolation. It exists as part of an integrated system. As established earlier, the WSE-3 is an accelerator: a specialist parallel compute engine that handles the matrix arithmetic of AI training and inference, paired with a host CPU cluster that handles everything an accelerator cannot — orchestration, data loading, training loop management, and I/O. In addition to that host CPU relationship, the WSE-3 needs to be powered and cooled reliably, connected to external memory large enough to hold the full weight matrices of a frontier AI model, and networked with other systems for the largest training runs in the world.

The CS-3 is Cerebras's answer to everything except the host CPU: a standard 19-inch rack-mountable unit that houses the WSE-3 chip along with its power delivery system, water cooling infrastructure, and high-speed I/O connections. Think of it as the accelerator system — the chip, its life support, and its interface to the world — ready to be paired with a host and put to work.


The CS-3 — One System, One Chip


A single CS-3 delivers 125 petaflops of peak AI compute — where one petaflop means 10¹⁵ floating-point operations per second. That number is not the most important thing about it.

The CS-3's most important capability is how it handles the memory problem for frontier AI models. Cerebras's MemoryX external memory system connects directly to the CS-3 and scales up to 1.2 petabytes (1.2 x 10¹⁵ bytes) of storage — enough to hold the full weight matrices of models far larger than anything currently deployed. Rather than requiring the weights to be partitioned across thousands of accelerator chips with the synchronization overhead that entails, a single CS-3 paired with MemoryX can hold and serve an entire model from one location.

The practical consequence is a dramatic reduction in software complexity. The distributed training frameworks, gradient synchronization logic, and inter-chip communication management that GPU clusters require — typically tens of thousands of lines of specialized code — are largely unnecessary. Training a model on a CS-3 can be expressed in a few hundred lines of standard PyTorch. The host CPU handles what it always handles; the CS-3 handles the rest without requiring the programmer to manage parallelism manually.

Once the decision is made to place tens of gigabytes of SRAM and nearly a million cores on a full 300 mm wafer, the dominant engineering problem shifts from transistor design to physical survivability. A silicon wafer is thin and mechanically fragile; at this scale it cannot simply be packaged like a conventional die. Delivering on the order of kilowatts uniformly across its surface requires low-impedance power distribution to avoid voltage droop and localized heating. That heat cannot be removed with air alone, so liquid cooling becomes a practical necessity rather than a performance enhancement. The resulting thermal gradients introduce mechanical stress due to differential expansion between silicon, interconnect metals, substrates, and cooling structures. At wafer dimensions, even small mismatches accumulate. The chip therefore exists only as part of a tightly integrated electro-mechanical system in which power delivery, cooling, structural support, and reliability are inseparable from the silicon itself.  

All these problems had to be solved.  Here is what the power delivery and liquid cooling system looks like from the outside.


Figure 4: CS-3 An inside view of the power delivery and cooling pipes, Courtesy of Cerebras


The Price Question


No official price has been confirmed by Cerebras for either the WSE-3 chip or the CS-3 system. Industry observers and analyst estimates put the CS-3 system in the range of $2-3 million per unit, though Cerebras has declined to publicly confirm this figure. (Source: industry analyst estimates; Cerebras has not confirmed official pricing)

At that price point the CS-3 is unquestionably a premium product — accessible to national laboratories, large research institutions, hyperscalers (A term commonly used to describe the likes of Amazon Web Services or AWS, Google Cloud, or Azure - Microsoft’s Cloud, and their ilk)  and frontier AI labs (like OpenAI, Anthropic, Meta, Perplexity etc.) , but beyond the reach of most organizations. This is a real constraint on Cerebras's addressable market, and one the company is addressing through its cloud inference API, which opens access to Cerebras compute on a pay-per-token basis without requiring a capital purchase.


The Max Cluster — 2,048 CS-3 Systems


Cerebras's SwarmX interconnect fabric links up to 2,048 CS-3 systems together, presenting the entire cluster to the programmer as a single unified computing device. This is the Cerebras Max Cluster.

A fully configured Max Cluster delivers approximately 256 exaflops of aggregate AI compute — where one exaflop (or 10¹⁸ floating point operations) equals 1,000 petaflops or , roughly the estimated computational power of the human brain. To put 256 exaflops in physical terms: the US Department of Energy's Frontier supercomputer — housed at Oak Ridge National Laboratory — achieved the world's first exaflop in 2022, filling an entire football-field-sized facility and consuming 20 megawatts of power. The Cerebras Max Cluster delivers 256 times that compute. (Source: Wikipedia, Exascale computing; Cerebras Systems product specifications)


The Apples-to-Apples Cost Comparison


Meta's AI infrastructure — built around approximately 350,000 Nvidia H100 GPUs — cost approximately $25 billion and delivers roughly 1 zettaflop (10²¹ floating point operations) of aggregate compute — where one zettaflop equals 1,000 exaflops, or roughly one thousand times the estimated computational power of the human brain. (Source: multiple industry analyst reports, 2024)

A fully built Cerebras Max Cluster of 2,048 CS-3 systems, at the estimated $2-3 million per unit, would cost approximately $4-6 billion and delivers approximately 256 exaflops or 256 x 1018 — one quarter of Meta's cluster compute, at roughly one quarter of the cost. On a raw compute-per-dollar basis, the two approaches are broadly comparable.

But raw compute per dollar tells only part of the story. The hidden costs of operating a 500,000 GPU cluster are substantial: the network infrastructure connecting all those GPUs, the engineering teams required to write and maintain distributed training frameworks, the idle time from gradient synchronization delays, and the months of debugging that large distributed systems inevitably require. None of those costs appear in the $25 billion hardware figure — but all of them are real, and all of them are dramatically reduced or eliminated on a CS-3 cluster.

The honest summary: the Cerebras Max Cluster offers comparable compute, at comparable cost, with a fraction of the engineering complexity, faster time-to-trained-model, and a fundamentally different architectural approach that eliminates the memory wall bottleneck that limits GPU clusters on the workloads that matter most for AI.


The Competition


The Challengers — Different Bets, Mixed Outcomes


Before discussing Nvidia — the dominant incumbent — it is worth understanding the landscape of architectural challengers that emerged alongside Cerebras, each placing a different bet on how to solve the AI hardware problem.

Graphcore built its Intelligence Processing Unit (IPU) around a bulk synchronous parallel architecture — a genuinely innovative approach that showed promise in research settings but struggled to find production-scale adoption. After reaching a peak valuation of approximately $2.8 billion, Graphcore was acquired by SoftBank in July 2024 and folded into their broader AI and hardware portfolio alongside ARM.

Groq optimized aggressively for inference latency, building a deterministic Language Processing Unit (LPU) that eliminated the unpredictability of GPU-based inference. It demonstrated remarkable speeds on specific workloads and reached a peak valuation of $6.9 billion. In December 2025, Nvidia acquired Groq's core assets for approximately $20 billion in a deal structured as a non-exclusive licensing agreement — with Groq's founder, CEO, and approximately 80% of its engineering team joining Nvidia — while Groq continued to operate nominally as an independent company, retaining its GroqCloud inference service.

SambaNova built its Reconfigurable Dataflow Unit (RDU) around a dataflow architecture well suited to running very large models efficiently on premises. It continues to operate independently, focused primarily on government, national laboratory, and enterprise customers. It remains a credible but niche player.

The pattern across these three companies is instructive. Each identified a real architectural limitation in Nvidia's approach and built a genuinely innovative solution. Each raised significant funding and demonstrated real performance advantages in specific workloads. And yet none achieved the scale necessary to mount a sustained challenge. Building a semiconductor company capable of challenging an incumbent with Nvidia's scale, software ecosystem, and manufacturing relationships is an extraordinarily difficult undertaking — and the history of well-funded challengers is a sobering reminder of that reality.


Nvidia and AMD — The Incumbent and the Established Challenger


Against this backdrop, Nvidia's position rests on two foundations worth distinguishing carefully.

The first is hardware at scale. Nvidia has invested decades building manufacturing and supply chain relationships with TSMC that no startup can replicate quickly. Their Blackwell architecture and the forthcoming Rubin platform represent the state of the art in GPU performance, and their NVLink interconnect fabric allows thousands of GPUs to work together.

The second — and more durable — foundation is software. Nvidia's CUDA platform has become so deeply embedded in AI development over fifteen years that nearly every major AI framework, research codebase, and production pipeline is optimized for it. Switching away from CUDA requires rewriting code, retraining engineers, revalidating results, and accepting performance risk. This software moat is what keeps customers on Nvidia even when alternative hardware demonstrates compelling advantages.

AMD occupies the role of established hardware challenger — offering competitive raw performance and growing compatibility with PyTorch and TensorFlow workflows, without requiring a fundamental architectural departure.


The Frontier AI Labs — and Their Enormous Commitment to Nvidia


The companies that develop and train the world's most capable AI models are known in the industry as Frontier AI Labs — among them OpenAI, Anthropic, Google DeepMind, Meta AI, Mistral, and xAI. Their commitment to Nvidia has been staggering: Nvidia announced in 2025 a commitment to support OpenAI as it builds at least 10 gigawatts of Nvidia systems — equivalent to between 4 and 5 million GPUs. (Source: Nvidia press release, September 2025)

The aggregate financial commitment of frontier AI labs to Nvidia hardware is measured in the hundreds of billions of dollars — a level of investment that creates enormous switching costs and makes Nvidia's position appear almost unassailable.

Almost.


The OpenAI-Cerebras Deal — The First Significant Signal


On January 14, 2026, OpenAI signed a multi-year agreement with Cerebras to deploy 750 megawatts of Cerebras wafer-scale systems — coming online in multiple tranches through 2028 — in a deal worth more than $10 billion. (Source: industry sources familiar with the matter)

The precise framing from OpenAI's own press release is worth quoting directly.


Sachin Katti of OpenAI stated:


"OpenAI's compute strategy is to build a resilient portfolio that matches the right systems to the right workloads. Cerebras adds a dedicated low-latency inference solution to our platform. That means faster responses, more natural interactions, and a stronger foundation to scale real-time AI to many more people."


And from Andrew Feldman, CEO of Cerebras:


"We are delighted to partner with OpenAI, bringing the world's leading AI models to the world's fastest AI processor. Just as broadband transformed the internet, real-time inference will transform AI, enabling entirely new ways to build and interact with AI models."


Two things in these statements deserve careful attention. First, OpenAI is explicitly positioning this as an inference partnership — not training. Second, OpenAI frames it as portfolio diversification rather than a departure from Nvidia.

There are two reasons this inference-first positioning makes sense, and they reinforce each other. The first is switching cost: OpenAI's training infrastructure is deeply embedded in Nvidia's CUDA ecosystem, and moving training workloads to a new architecture carries real engineering risk — revalidating results, retraining teams, rewriting pipelines. Inference is a lower-stakes entry point for a new hardware relationship. The second reason is commercial: inference is the revenue engine. Every token delivered to a user is a billable event. Training the next model is a capital expenditure — an internal cost that produces a future asset, but does not itself generate revenue. When OpenAI routes live customer traffic through Cerebras hardware, faster inference at lower cost per token improves margins on every query served, immediately and measurably. A training speed improvement, however significant, does not appear on the revenue side of the ledger in the same way. The decision to deploy Cerebras for inference first is therefore not a hedge on the technology — it is the commercially rational sequence for any company evaluating a new accelerator platform.

It is worth noting that Cerebras's architectural advantages — the MIMD design, the 44GB of on-chip SRAM, the elimination of inter-chip communication overhead, and the self reported 28× compute advantage over the Nvidia B200 — are, if anything, more pronounced for training than for inference. The reason OpenAI chose to deploy Cerebras first for inference rather than training is almost certainly commercial rather than technical — inference latency is immediately visible to hundreds of millions of ChatGPT users, making it the highest-impact place to demonstrate speed advantages publicly. Training runs happen behind closed doors and take months; inference happens billions of times a day in real time.

OpenAI is also just one frontier AI lab among many. Anthropic, Google DeepMind, Meta, xAI, and Mistral have not yet made equivalent public commitments to Cerebras hardware. The full test of Cerebras's commercial viability lies in whether those commitments follow.


 

Conclusion — An Open Question, and a Name to Watch


Forty years is a long time for an idea to wait. Wafer scale integration was not forgotten; it was remembered as a cautionary tale, a symbol of ambition outrunning engineering reality. The industry did not abandon it because the idea became useless, but because the last attempt failed spectacularly, expensively, and publicly, casting a shadow long enough to deter a generation of engineers and investors from asking whether the failure was permanent or merely premature. Cerebras decided that one dramatic data point was not enough to close the question. What Andrew Feldman and his team built between 2016 and 2024 was not a single breakthrough but a carefully orchestrated stack of innovations, each one making the next possible.

Wafer scale integration required solving power delivery, which meant flipping the chip and routing current vertically through the assembly rather than laterally across a board.. It required solving inter‑die communication, which meant working with TSMC to stitch metal interconnects across scribe lines that had never carried signals before. It required solving manufacturing yield, which meant making cores small enough that defects became cheap, building redundancy into the wafer fabric, and writing software that could remap around failures at boot time. Underlying all of it was a single clear architectural conviction: the memory wall is the central bottleneck in AI computing, and only a MIMD architecture with on‑chip SRAM beside every compute core could eliminate it at the scale frontier AI demands.

 

The result is a chip more than 50 times the area of the largest die conventional lithography can produce in a single exposure, with 44 gigabytes of on-chip SRAM against the handful of megabytes a conventional accelerator core can access locally, delivering memory bandwidth measured in petabytes per second against the terabytes per second of leading accelerator systems --- and it allows a single rack-mounted system to train models of up to 24 trillion parameters without the distributed computing complexity that GPU clusters require.

 None of that guarantees that Cerebras wins. OpenAI has committed around ten billion dollars to deploy Cerebras hardware for inference workloads — a remarkable validation from the world’s most prominent frontier AI lab — but OpenAI is one lab among many. Anthropic, Google DeepMind, Meta, xAI, and Mistral have not yet made equivalent public commitments. Training workloads, where Cerebras’s architectural advantages are arguably even more pronounced than in inference, remain dominated by Nvidia, and the CUDA software ecosystem is still the most durable moat in the industry. The estimated 2–3 million dollars per CS‑3 unit keeps Cerebras out of reach for all but the largest and best‑funded organizations, and the history of AI hardware challengers — Graphcore acquired, Groq absorbed, others still searching for scale — is a standing reminder that technological excellence and commercial success are not the same thing.

What Cerebras has demonstrated is that wafer scale integration works in practice: the three hard problems can be solved, a 300 mm wafer can become a single functioning chip, and that chip can outperform clusters of thousands of conventional GPUs on the AI workloads it targets most directly. That demonstration alone ranks among the most significant engineering achievements in the semiconductor industry in the past twenty years, and the market has reflected it, with Cerebras’s valuation rising from 8 billion to 23 billion in the six months following the OpenAI partnership announcement.

 

Whether that achievement translates into a generation‑defining company will depend on manufacturing scale, software ecosystem development, continued customer wins beyond OpenAI, and whether the frontier AI labs that have not yet committed decide that Cerebras’s performance advantages justify the switching costs and the price. Those questions are genuinely open, and anyone who claims to know the answers is overconfident. What is not in doubt is what Cerebras has already accomplished, and on that point the final word belongs to the engineering record.


The semiconductor industry spent forty years saying wafer scale integration was impossible. Cerebras spent eight years proving it wasn't — and the world's most powerful AI lab just wrote them a ten billion dollar check to say thank you.



 

Insets

The following insets provide foundational background on memory and computer architecture for readers who would like context before or while reading the main article.

 

📘 INSET: Inset A: Memories

T

📘 Inset A: Memories

There is a hierarchy of memory in every computer, arranged by speed and cost. At the top sits the fastest memory — small in capacity, expensive, and physically close to the processor. At the bottom sits the slowest — vast in capacity, cheap, and capable of holding data for years without power. Understanding that hierarchy is the foundation for understanding why the Cerebras architecture matters.

Before climbing the ladder rung by rung, one distinction shapes everything else: the difference between volatile and non-volatile memory.

Volatile memory holds data only as long as power is applied. The moment the electricity stops, the data vanishes. Non-volatile memory retains its contents without power — an SSD or a hard drive can sit unplugged for years and still hold everything it stored. The fast tiers at the top of the hierarchy are volatile. The slow storage tiers at the bottom are non-volatile. Within the volatile category, the most important distinction for understanding AI hardware is the difference between two technologies: SRAM and DRAM.


SRAM — Static Random Access Memory — stores each bit using six transistors. That six-transistor circuit is inherently stable: it holds its state without any external intervention as long as power is applied. The cost of that stability is silicon area — six transistors per bit is expensive to manufacture, which is why SRAM appears only in small quantities inside processors.


DRAM — Dynamic Random Access Memory — stores each bit using just one transistor and one capacitor. One transistor per bit instead of six means far denser and cheaper memory, which is why DRAM is used for the main memory in virtually every computer — the 8, 16, or 32 gigabytes in a laptop or desktop. But that single capacitor has a fundamental physical weakness: it leaks its charge over time, like a bucket with a slow drip at the bottom. To prevent the data from disappearing, the memory must be periodically refreshed resulting in a continuous background refresh process. This is called a refresh cycle. This adds a small but persistent background overhead further reducing effective bandwidth.

DRAM’s higher latency arises from its array architecture — rows must be activated, sensed, and precharged before access — while SRAM cells are directly addressable flip-flop structures. A modern SRAM cache (spefically the Level one cache) can deliver data to the processor in one to four nanoseconds (10⁻⁹ seconds — billionths of a second). DRAM has a typical access latency of 60 to 100 nanoseconds — roughly twenty to fifty times slower. To put that in human terms: if an SRAM access took one second, the equivalent DRAM access would take almost a full minute. Translated into processor clock cycles, an SRAM access costs roughly four to twelve cycles; a DRAM access costs 200 to 300 cycles — cycles during which the processor sits idle, waiting. This waiting is so pervasive that engineers gave it a name: the memory wall.

 

The Memory Ladder


CPU Registers. The topmost rung. A tiny number of storage locations built directly into the processor, holding the values it is actively computing with at this exact moment. Access is instantaneous — registers are usually available within one clock cycle at full processor frequency. Measured in bytes rather than gigabytes.

SRAM — Static Random Access Memory. Just below registers. Fast, stable, and volatile. Six transistors per bit makes it expensive in silicon area, so it appears in relatively small quantities as the processor's on-chip cache — the working scratchpad the processor checks before reaching out to slower memory.

DRAM — Dynamic Random Access Memory. The main memory of a computer. One transistor per bit makes it far denser and cheaper than SRAM but has higher latency than SRAM. The afore mentioned “Leakage” explains the need for refresh; the array design explains the latency gap. This access latency, measured in tens to hundreds of nanoseconds, and the continuous refresh overhead, causes the dominant memory bottleneck in most computing systems.

SSD — Solid State Drive. The first non-volatile tier. The storage medium inside virtually every SSD is NAND flash memory — a technology that locks electrical charge inside microscopic cells to represent data. Unlike DRAM's leaky capacitor, that charge stays in place without power, which is what makes an SSD non-volatile. Access latency is measured in microseconds (10⁻⁶ seconds — millionths of a second) — orders of magnitude slower than DRAM for individual reads, but fast enough for storing and retrieving files, programs, and large datasets.

HDD — Hard Disk Drive. The slowest storage in common use. Data is stored as magnetic patterns on spinning metal platters — a fundamentally mechanical technology. HDDs remain the most cost-effective option for very large storage volumes, and the magnetic patterns persist without power indefinitely, but seek times measured in milliseconds (10⁻³ seconds — thousandths of a second) make them orders of magnitude slower than any tier above.

 

Emerging Memory Technologies

The five tiers above represent the memory landscape most people encounter in computing systems today, but the boundaries are not fixed. A number of newer technologies occupy the gaps between these rungs, each attempting a different trade-off between speed, cost, and non-volatility. MRAM (Magnetoresistive RAM), for example, stores data using magnetic states rather than electrical charge, making it non-volatile while offering access latencies closer to SRAM than to flash — a combination that neither DRAM nor SSD can offer. Other technologies — including FRAM (Ferroelectric RAM) and RRAM (Resistive RAM) — are in various stages of research and commercial development. None has yet displaced DRAM or SRAM in mainstream high-performance computing, but the five-rung ladder is a simplification of a field that continues to evolve.

 

The Numbers, Side by Side

DRAM is used as the baseline — the reference point against which the other tiers are compared — because it is the memory tier that conventional AI accelerators use as their primary working memory.

Memory type

Typical capacity

Access latency

Clock cycles (3–4 GHz)

Speed vs DRAM (baseline)

CPU Registers

Bytes

< 1 ns

1

~200–300× faster

SRAM (on chip cache)

10’s of MB

1–4 ns

4–12

~20–50× faster

DRAM (baseline)

8–32 GB

60–100 ns

200–300

Baseline

SSD (NAND flash)

256 GB–4 TB

50–100 µs

150,000–400,000

~500–1,500× slower

HDD

1–20 TB

5–10 ms

15–40 million

~50,000–150,000× slower

 

Why This Matters for AI


A large language model contains hundreds of billions of weights — numerical values that must be loaded into arithmetic units before any computation can take place. The speed at which those weights can move from storage into compute can sometimes be the binding constraint on AI performance. In practice, the limiting factor is usually memory bandwidth — how many bytes per second can be delivered — rather than the latency of a single access. It is the width of the pipe that matters, not the size of the tank.

Conventional AI accelerators store their weights in DRAM. When weights must be fetched from off-chip DRAM, the access latency is on the order of hundreds of cycles. Modern accelerators hide much of this latency through massive parallelism, but the underlying DRAM bandwidth remains the limiting factor. On a GPU cluster running a large model, this memory latency is not an occasional nuisance; it is the dominant limiter on how fast the system can train or serve inference.

The Cerebras WSE-3 stores all 44 gigabytes of working memory as SRAM, distributed directly across the wafer surface beside the compute cores that use it. There is no off-chip DRAM in the execution data path. There are no refresh cycles consuming bandwidth. Each memory access costs one to a few clock cycles rather than hundreds. The leaky bucket is replaced by a tank with no drip at all — and the pipe connecting it to compute is as wide as the processor needs.

 

📘 Inset B: Computer Architecture — A Primer

In 1966, computer scientist Michael Flynn of Stanford University proposed a classification of processor architectures based on two variables: how many instruction streams the processor executes simultaneously, and how many data streams those instructions operate on. Arrange those two variables as the axes of a grid — instruction streams along one axis, data streams along the other — and the result is a 2×2 matrix of four possible architectures. Three of them have significant commercial implementations; one does not. All four are described here, because understanding why one cell of the matrix remains largely empty is itself instructive.

The three architectures with practical implementations — SISD, SIMD, and MIMD — map onto a natural progression in the scale of computation: from operating on a single number, to operating on a list of numbers all at once, to operating on a grid of numbers simultaneously. That progression — scalar (1), vector (N), matrix (N×M) — is not just an abstract taxonomy. It is the architectural story of how computing hardware evolved to meet the demands of modern AI. And as we will see at the end of this inset, that progression does not stop at two dimensions — but that is a story for after we have understood the architecture itself.

 

A Note on von Neumann Architecture and the Bus

All three practical architectures in Flynn’s taxonomy are built on a foundation established in the late 1940s by mathematician John von Neumann. The von Neumann architecture has two defining characteristics: first, both the instructions that tell the processor what to do and the data those instructions operate on live together in the same memory; and second, the processor and that memory are separate physical components connected by a shared communication channel called a bus.

The name is no accident. Think of a city bus: it runs along a fixed route, stops are predetermined, and anyone who wants to travel that route has to wait for the bus to arrive, get on, ride to their stop, and get off. Multiple passengers can ride at once, but they all share the same vehicle — and if the bus is full, or hasn’t arrived yet, you wait. You cannot travel until the bus comes to you.

A computer bus works the same way — and the analogy holds more precisely than it might first appear. A bus in a computer is a shared communication channel, but it does not simply run between two stops. It connects multiple destinations: the processor, the main memory, the graphics card, the storage controller, the network interface, and others depending on the system. Each of those destinations is, in effect, a bus stop — it has an address, and data travelling on the bus is labelled with the address of where it is going. Just as a passenger on a city bus needs to know which stop to get off at, a signal on a computer bus carries information about its destination so that only the right endpoint reads it and acts on it. Everything else on the bus sees the signal pass by and ignores it.

Every time the processor needs an instruction or a piece of data, it puts a request on the bus, labelled with the address of the memory location it wants, and waits for the response to arrive back along the same shared channel. If something else is already using the bus — another request in flight, a memory refresh cycle completing, a write from the graphics card — the processor waits in line. No matter how fast the processor itself is, it cannot work any faster than the bus will allow. This is the origin of the memory wall described in Inset A: the processor is almost always capable of computing faster than the bus can feed it.

This bottleneck is present in all three practical architectures, though each handles it differently. Conceptually, SISD machines operate through a single processor-memory path, even though modern implementations use multiple internal channels and caches to mitigate that bottleneck. SIMD amortizes the instruction side through broadcast, but each processor still has its own dedicated channel to its own data. MIMD goes further — each processor has its own dedicated channel to its own memory from the start. This is called a “distributed memory” MIMD system and hence reduces (sometimes eliminates shared memory contention). A slight variant called “shared memory” MIMD system still contend for common memory resources. The progression from one architecture to the next is, in part, the story of how that bottleneck was progressively narrowed.

One further note before the diagrams: the illustrations in this inset are conceptual. They are not representations of how processors are physically laid out on silicon, nor do they imply that real systems have exactly the number of components shown. What they convey is the architectural principle — the logical relationship between instructions, data, processors, and memory — which is the same whether the system has four processors or four million.

 

 

SISD — Single Instruction, Single Data

The scalar tier: one operation on one number at a time

 

 

[ Figure 6: SISD diagram — conceptual illustration only, not a physical layout]

One processor. One memory holding both instructions and data. One operation at a time. This is the pure von Neumann architecture — the design of the earliest computers and the conceptual baseline against which everything else is measured. It handles scalar computation: operations on individual numbers, one at a time, in sequence.

The diagram captures the essential structure: a single memory box and a single processor box, connected by a bus. Every operation involves the processor crossing that bus at least twice before any arithmetic takes place — once to fetch the instruction, and at least once more to fetch the data. The processor is idle during every one of those crossings.

Worked example. Consider the scalar operation a = b × c. The processor would typically:

1. Fetch the multiply instruction from memory

2. Read b from memory

3. Read c from memory

4. Multiply b × c [1 arithmetic operation]

5. Write the result a back to memory

 

That is 4 memory operations and 1 arithmetic operation, performed sequentially. For the computations of an earlier era, this was adequate. For modern AI, where a single forward pass through a large language model involves trillions of individual operations, working through every one of them serially through a single bus is not a viable path.

 

 

SIMD — Single Instruction, Multiple Data

The vector tier: one operation applied to many numbers simultaneously

 


 

[ Figure 7: SIMD diagram — conceptual illustration only, not a physical layout]

Multiple processors, all executing the same instruction simultaneously, each operating on its own slice of data. This is the architectural foundation of the modern GPU — and the step from scalar to vector computation.

The diagram is a conceptual illustration — it does not represent how processors are physically arranged on silicon, nor does it imply that real SIMD systems have exactly this number of levels or nodes. What it shows is the essential principle: a single instruction source at the top, a broadcast tree carrying that instruction downward to every processor simultaneously, and independent data memory attached locally at each node.

In a SIMD system, processors generally execute the same instruction at the same time, applied to different data items. To see why this shape of work suits neural networks, it helps to recognize that a neural network — for all the mystique that surrounds the term — is at its core nothing more than a very large collection of arithmetic operations on arrays of numbers. The same operations that appear in school mathematics appear here too, just applied millions or billions of times. Scaling an array by a constant: k × B[0..n] — the same multiplication, applied independently to every element. Applying a transformation to every element: f(B[0..n]) — for example, clipping every value to zero if it is negative and leaving it unchanged if it is positive: max(0, B[0..n]). The same rule, swept uniformly across the array. Computing a dot product: A[0..n] · B[0..n] — the same multiply-and-accumulate operation, applied to successive pairs and summed. In every case, the operation is identical, only the data changes. This is the shape of work SIMD is designed for, and it describes a substantial fraction of what a neural network does during both training and inference.

It is also worth noting that matrix multiplication can be expressed as a series of vector operations — each row of a result matrix is a dot product, and dot products are vector multiplications followed by a sum. With enough processors and careful organization of the data, a SIMD architecture can handle large matrix operations efficiently. This is broadly what GPUs do: they do not need a fundamentally different architecture to tackle matrix work; they apply vector-style parallelism at very large scale, with many thousands of processors running simultaneously.

The instruction bus bottleneck is partially relieved in SIMD: the same instruction travels to all processors via broadcast, so the cost of fetching it is shared across however many processors are running. Each processor still has its own data memory and its own dedicated channel to that memory, but the instruction overhead is amortized rather than replicated.

Worked example. Extend the same operation to a vector: A(i) = B(i) × C(i) for i = 1 to 2. Two processors, instruction broadcast once:

1. Receive the broadcast multiply instruction [shared — no individual bus cost]

2. Read own B(i) from local memory

3. Read own C(i) from local memory

4. Multiply [1 arithmetic operation]

5. Write own A(i) back to local memory

 

Both processors do this simultaneously. We have doubled the processor count and completed both elements in the time it takes to compute one.

 

SIMD in Practice: a note on SIMT

Real GPU architectures tend not to implement pure SIMD exactly as Flynn originally defined it. Nvidia’s approach, used across its GPU families, is called SIMT — Single Instruction, Multiple Thread. A detailed treatment of SIMT is beyond the scope of this primer — and the internal architecture of any particular GPU is a proprietary matter that manufacturers do not fully disclose — but the broad idea is that SIMT relaxes the strict lockstep requirement of pure SIMD. Processors within a group generally execute the same instruction together, but individual threads can diverge when needed — for example when following different branches of a conditional — and later reconverge. The result tends to behave like SIMD when the workload is uniform, which most matrix operations are, while handling more varied workloads more gracefully than a strict SIMD design would. SIMT is not a separate category in Flynn’s taxonomy; it is broadly an engineering refinement within the SIMD family, and mentioning it here is simply to acknowledge that the clean lines of the taxonomy become somewhat blurred in real commercial implementations.

 

 

MISD — Multiple Instructions, Single Data

 

Multiple processors, each executing a different instruction, all operating on the same single data stream simultaneously. This is the fourth cell of Flynn’s 2×2 matrix, and it is the one which has no widely adopted general-purpose commercial implementation. The scenario it describes — many different operations all converging on a single shared data stream at once — maps onto very few real computational problems, and the shared data stream itself would tend to create a different kind of bottleneck.

Its place in this primer is completeness rather than practical relevance: a full taxonomy has four cells and understanding why one of them remains largely empty is instructive. Computation in practice generally involves either the same operation applied to many data items, or many independent processors each handling their own part of a larger calculation — not many different operations converging on one shared stream.

 

 

MIMD — Multiple Instructions, Multiple Data

The matrix tier: many independent processors, each following its own path


[ Figure 8: MIMD diagram — conceptual illustration only, not a physical layout]

Multiple processors, each executing its own independent instruction stream on its own independent data, all simultaneously. This is the most general form of parallel computation in Flynn’s taxonomy.

The diagram is a conceptual illustration. What it conveys is the essential principle: each processor is paired with its own memory, each pair operates independently, and coordination between processors happens through a shared interconnect rather than through a shared instruction stream.

The key characteristic of MIMD is autonomy. Each processor is free to follow its own instruction sequence, reading from its own memory, without waiting for or synchronizing with others unless the application requires it. There is no broadcast, no lockstep — each processor simply gets on with its own calculation.

To see what that means in practice, return to the same idea introduced in the SIMD section: a neural network is, at its core, a large collection of arithmetic operations on arrays of numbers. When those operations take the form of a matrix multiplication — which is the central operation in every layer of a modern neural network — the arithmetic looks like this. To compute the output matrix C from input matrices A and B, every element of C requires its own independent accumulation:


C[0,0] = A[0,0]×B[0,0] + A[0,1]×B[1,0]

C[0,1] = A[0,0]×B[0,1] + A[0,1]×B[1,1]

C[1,0] = A[1,0]×B[0,0] + A[1,1]×B[1,0]

C[1,1] = A[1,0]×B[0,1] + A[1,1]×B[1,1]

 

Each of these is a dot product — the same kind of multiply-and-accumulate we saw in the SIMD section. A SIMD machine can compute them by decomposing the matrix into vectors and processing those vectors in parallel. A MIMD machine can assign one processor to each output element and let each run its own independent accumulation sequence simultaneously, without any of them needing to wait for a broadcast or stay in step with the others. Neither approach is the only way to do it; they are different expressions of the same underlying arithmetic. MIMD simply imposes fewer constraints on how each processor gets to its answer.

The bottleneck that exists between each processor and its memory is not shared across the system. Each processor has its own dedicated channel to its own memory, thereby reducing the bottleneck — [To remain technically accurate, it is important to state that having its own memory eliminates shared-memory contention, though processors may still contend for bandwidth on the interconnect when exchanging data. This is true for truly distributed memory systems like the one that is being considered here; There are “shared memory” systems for which this is not strictly true].

Worked example. Four processors, one per output element of the 2×2 matrix above. Each has its own local memory and its own instruction stream; Consider for example that is represented as: C[i,j] = Σₖ A[i,k] × B[k,j]


1. Fetch first multiply instruction from own local memory

2. Read own A[row, k] from local memory

3. Read own B[k, col] from local memory

4. Multiply them [1 arithmetic operation]

5. Fetch add instruction from own local memory

6. Accumulate into running total [1 arithmetic operation]

… repeat for second pair …

7. Write result C [row, col] back to local memory

 

All four run simultaneously. The four output elements are computed in the time it takes one processor to compute one. Each followed its own instruction sequence on its own data — no broadcast, no lockstep.

 

 

The WSE-3: MIMD Taken to Its Logical Conclusion

 

[ Figure: WSE-3 conceptual diagram — illustrates the architectural principle only, not the physical layout of the chip]

The Cerebras WSE-3 carries the MIMD architecture to a scale not previously attempted in a commercial AI accelerator: 900,000 independent cores, each paired with its own local SRAM, on a single continuous silicon wafer.

The diagram above is a conceptual illustration only — it does not represent the actual physical layout of the WSE-3, which Cerebras has not publicly disclosed in detail. What it conveys is the architectural principle: each core is paired with its own local SRAM, physically adjacent on the same piece of silicon, with no off-chip bus in the data path.

In the WSE-3, the SRAM sits physically adjacent to each core — the distance is measured in micrometers, and the access time is one to a few clock cycles rather than the hundreds of clock cycles that an off-chip DRAM access would typically cost. Von Neumann’s separation of processor and memory is not abolished — each core still fetches its instructions and data from its local SRAM. But the dedicated channel between them has been reduced to the on-wafer distances, and the memory has been changed from leaky-bucket DRAM to stable, fast SRAM. The bottleneck that has characterized computer architecture since the 1940s has not been eliminated — but it has been reduced to the point where off-chip memory access is no longer the binding constraint on AI computation.

 

 

The Pattern


The scalar-to-vector-to-matrix progression loosely mirrors the architectural sequence. SISD handles one number at a time — the scalar tier. SIMD handles many numbers simultaneously under one instruction — the vector tier, where GPUs broadly live. MIMD handles a grid of numbers across many independent processors each following its own path — the matrix tier, and the natural home of the WSE-3.

The instruction overhead runs through all three practical tiers but is handled differently in each. In SISD it adds a memory crossing to every operation. In SIMD it is amortized across all processors sharing the same broadcast — the cost per processor falls as the processor count rises. In MIMD it is carried independently by every processor: the price of full autonomy, paid individually, offset by the fact that each processor has its own dedicated channel to its own memory and shares no memory with its neighbors.

Throughout this progression, the consistent trade-off is silicon area exchanged for computation speed — more processors, smaller memories per processor, shorter channels, faster memory. Each step reduces the time the arithmetic hardware spends waiting. And as we will see in the Final Thought, this progression does not stop at matrices — the same logic extends to structures with more dimensions, which is where modern AI computation lives.

 

Beyond Flynn: VLIW and FPGAs

Flynn’s taxonomy describes the landscape of general-purpose processor architectures, but it does not exhaust the approaches applied to the specific problem of accelerating neural network computation. Two others are worth noting.

VLIW — Very Long Instruction Word. VLIW processors express multiple independent operations in a single wide instruction, allowing the compiler to schedule parallelism explicitly at compile time rather than leaving the hardware to figure it out at runtime. This shifts the instruction-fetch overhead: rather than fetching many short instructions sequentially, the processor fetches one wide instruction that encodes several operations simultaneously. This tends to achieve high hardware utilization in workloads where the instruction mix is predictable in advance, which inference often is. Cadence’s Tensilica processor family is a well-known example, widely used in signal processing and neural network inference.

FPGAs — Field-Programmable Gate Arrays. FPGAs allow the hardware circuit itself to be reconfigured for a specific computation after manufacture. A designer can, in effect, build a custom chip in software and deploy it without going through a fabrication process. This flexibility makes FPGAs well-suited to low-latency inference at the edge, and to workloads where the model architecture is stable enough to justify the engineering effort. Their reconfigurability is both their strength and their constraint: they tend to be slower and less dense than a purpose-built chip, but they can be updated as requirements change.

 

A Note on Optimizations

Flynn’s taxonomy, and the VLIW and FPGA approaches above, represent broad architectural families rather than the full catalog of processor design choices. Decades of processor engineering have produced a rich set of additional techniques — out-of-order execution, speculative execution, branch prediction, instruction prefetching, superscalar pipelines, and many others — each representing a different strategy for squeezing more useful work from available silicon within a given architectural family. These are genuine and important innovations that have shaped every processor manufactured in the last thirty years. They are not described here because they operate as optimizations within an architectural family rather than as alternative answers to the fundamental question of how parallelism should be organized.

This primer has focused on the architectural groundwork most relevant to understanding what Cerebras built. But architecture does not exist in isolation — it is always a response to a problem. The problem Cerebras is addressing is the deep learning problem: how do you train and run neural networks at a scale where the sheer volume of matrix operations, and the speed at which weights must be fed to arithmetic units, overwhelms conventional hardware? The two threads running through this inset — fast memory access and independent parallel computation — are not incidental background. They are precisely the properties that deep learning demands and that conventional GPU-based systems struggle to deliver at scale. That context is what makes the WSE-3’s architectural choices — all-SRAM memory, 900,000 independent cores, dedicated channels measured in micrometers — read not as engineering curiosities but as direct answers to a well-defined problem.

 

A Final Thought: The Real Innovation Was the Mapping

Once you understand the architecture, and recognize that the underlying mathematics has been known for decades — matrix multiplication, dot products, the rules for propagating errors back through a network — a natural question arises: why did all of this happen now, and not thirty years ago?

The hardware is part of the answer. But the deeper innovation was not mathematical and not architectural. It was representational.

It helps to follow a simple progression. A single number — one pixel’s brightness — is a scalar. A row of numbers — one row of pixels, or one set of measurements — is a vector. A rectangular grid of numbers — a full greyscale image, or a table of values — is a matrix. And when you need more than two dimensions — a color image, where each pixel carries red, green, and blue values simultaneously, or a video, which adds time as a further dimension — the structure is called a tensor. A tensor is simply a generalization of a matrix to as many dimensions as the problem requires.

The graphics industry showed the way. Once engineers recognized that an image was just a matrix of numbers — or a tensor, when color was included — the entire machinery of linear algebra became available to process it. Processors could be designed specifically to perform matrix and tensor operations at speed. The hardware that resulted turned out to be equally well suited to manipulating any collection of numbers arranged in that form, regardless of what those numbers represented.

The breakthrough in neural networks was applying the same insight to a far harder representational problem. How do you turn a word into a number? How do you turn a sentence, or a document, or the meaning of a concept, into a tensor that a processor can operate on? That mapping — from the messy, ambiguous, context-dependent world of human language, or sound, or protein structure, into the clean rows and columns of linear algebra — is what took decades to work out. Once it existed, the rest followed from mathematics that was already understood and hardware that was already being built for other reasons. The equations were ready. The processors were ready. What was missing was the map.

This progression — scalar, vector, matrix, tensor — will reappear when we discuss the TPU, whose name, Tensor Processing Unit, is a direct acknowledgement that tensor operations are the heartbeat of modern AI computation.


Further Reading


Readers wishing to explore processor architecture in depth will find the Wikipedia articles on Flynn’s taxonomy, VLIW, and FPGA a reliable starting point. The standard academic reference for the field is: Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Quantitative Approach. Morgan Kaufmann (6th edition, 2017). Its opening chapters on instruction-level parallelism and memory hierarchy are accessible to a determined non-specialist and repay the effort.

 

 

 

📎 APPENDIX: Appendix A: How Can So Much Computation Fit Into 0.05mm²?


Source: Cerebras Engineering Blog, March 2025; Cerebras WSE-3 Product Page

A natural question arises from the manufacturing yield section: if each WSE-3 core is so tiny at 0.05mm², how can it be computationally useful at all?

The answer lies in what each core is — and crucially, what it is not. A conventional CPU or GPU core is a general-purpose computing engine. It must handle any instruction a programmer might throw at it — branching logic, floating point arithmetic of many types, memory management, cache coherence, branch prediction, and dozens of other capabilities. All of that generality requires silicon area. A modern Nvidia streaming multiprocessor is large precisely because it must be flexible.

A Cerebras WSE core, by contrast, is not general purpose. Each core is independently programmable but narrowly optimized for tensor-based, sparse linear algebra operations — the specific class of matrix multiplication and addition that underpins virtually every neural network computation. It does not need branch prediction hardware, complex cache hierarchies, or the general-purpose instruction set of a CPU. It needs to multiply, accumulate, and pass results to its neighbors — and it does those three things with extreme efficiency in a very small area of silicon.

Furthermore, each core does not need to carry large local memory because the WSE-3's 44GB of on-chip SRAM is distributed evenly across the entire wafer surface, giving every core single-clock-cycle access to fast memory at extremely high bandwidth. The memory is spread uniformly — each core has just enough fast memory immediately beside it.

It is also worth noting an important distinction from conventional chip design. When a die stands alone as a conventional packaged chip, it must carry significant overhead circuitry to manage its relationship with the outside world — I/O serializers and deserializers to send data across package boundaries, memory controllers to manage the interface with external DRAM, power management circuits, error correction logic for signals traveling long distances off-chip, and interrupt controllers to communicate with the rest of the system. All of this overhead exists because a conventional chip is an island — it must manage every interaction with the world outside its package boundary.

On the Cerebras wafer, each die is not an island — it is a node in a fabric. It has direct metal connections to its neighbors through the scribe line stitching. It shares a unified power delivery system through the chip's vertical power infrastructure. It has on-chip SRAM immediately beside it. It receives control signals through the overlay network. The external world does not exist at the die level — only at the wafer level. All of that island-management overhead is eliminated, freeing silicon area for what actually matters: compute.

 


 

 

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

K's Musings

  • LinkedIn
  • Facebook
  • Twitter

©2020 by K's Musings. Proudly created with Wix.com

bottom of page