top of page

Wafer Scale Integration — The Cerebras Story

  • Writer: Kumar Venkatramani
    Kumar Venkatramani
  • Mar 3
  • 23 min read

Updated: Mar 14


Note to Reader

This article is written for the technically curious, not the technically specialist. You do not need a degree in computer science or electrical engineering to follow what is written here, though if you have one, you may find some descriptions in more depth than most common publications provide. What you do need is an appetite for understanding how things work, and a willingness to engage with the ideas that follow.

This article stands on its own. Where the underlying mathematics or hardware architecture would help a reader go deeper, those topics are summarized briefly in-line and companion posts are signposted. The goal of this article is the Cerebras story — what was built, why it matters, and what remains uncertain. The companion posts exist for readers who want the full foundation; they are not prerequisites for following this one.

To make the ideas here accessible to a general reader, some concepts are described in simplified form. The goal is to build intuition and understanding, not to serve as a technical reference. Where simplifications have been made, they are in the service of clarity rather than completeness.


Editorial Note

This article was written by the author with assistance from an AI system and reflects his best understanding of a technically complex subject. While every effort has been made to ensure accuracy, readers who identify errors or imprecisions are encouraged to raise them. The goal is clarity and honesty, not the appearance of expertise.


Introduction


I am writing this article because, in the 40+ years that I have worked in the semiconductor industry, there has always been one holy grail that we strived toward, long thought to be unachievable. It was called Wafer Scale Integration.

I have never seen a company attempt not just to accomplish this on its own, but also innovate on at multiple different axes while simultaneously increasing the size of on-chip memory — all at once. I began researching this partly to convince myself that what this company claims to have achieved is actually real.


The Problem — What Deep Learning Actually Demands


Modern AI systems — the large language models behind tools like ChatGPT, Claude, and Gemini — are built on a foundation of linear algebra. A neural network is organized into layers of numbers, connected by learned multipliers called weights. Training and running such a network means performing matrix multiplications continuously across hundreds of layers, billions of times, over datasets containing trillions of words. A capable deployed model carries somewhere between 7 and 70 billion (10⁹) weights; a frontier model sits above a trillion (10¹²). The mathematics is not exotic. The volume is.

Two hardware requirements follow directly from this. The first is fast memory, close to the compute — the weights must move continuously from wherever they are stored into the arithmetic units that will use them, and the speed of that movement is often the binding constraint. The second is massive independent parallelism — many of the individual multiplications within a matrix operation have no dependency on one another and can be performed simultaneously.


For readers who would like to understand the mathematics of AI to a first order — scalars, vectors, matrices, how a neural network learns, and what a large language model actually is — the companion post Making Sense of AI Through Linear Algebra covers this from first principles and is worth the read. I have often felt that most public articles on AI either don't explain this (without jargon) or don't take the effort to explain in terms of mathematical constructs that more people can understand. This is my attempt at it.


The Basics


The company I am about to describe is Cerebras. Founded in 2016 with a single purpose — to build a semiconductor chip and system specifically designed to address the AI hardware infrastructure problem — Cerebras has evolved its solution over three generations to its latest chip, announced in March 2024, called the WSE-3 (Wafer Scale Engine, generation 3).

The company has raised over $2.8 billion across approximately eight rounds of venture capital funding, including a $1 billion Series H led by Tiger Global in February 2026. Its valuation has risen sharply as its technology has gained commercial validation — from approximately $8 billion in mid-2025 to $23 billion in February 2026, following the announcement of its landmark partnership with OpenAI. (Source: Cerebras Systems press releases, February 2026)

The remainder of this article answers what Cerebras did and what it means.


Wafer Scale Integration: A Brief History


In 1979–1980, Gene Amdahl — a pioneer in semiconductor development who had left IBM in 1970 to start Amdahl Corporation — departed Amdahl Corporation and embarked on a mission to achieve Wafer Scale Integration. The company he founded was called Trilogy Systems. Trilogy was the talk of the semiconductor frontier, and in their attempt to make this a reality, they decided not only to build the chip but also to construct the fabrication facility — the "fab" — that would manufacture it. They raised over $230 million (close to $1 billion in today's dollars), making them by far the most well-funded startup in Silicon Valley at the time.

To appreciate what Gene was attempting, consider that the average chip size in the late 1970s was roughly 0.25" × 0.25" (approximately 6mm × 6mm), while Trilogy's chip was designed to be 2.5" × 2.5" — about 100 times larger by area than what could be reliably manufactured at the time. The way they did that was to not dice the semiconductor wafer into many small dies, but instead use the entire silicon wafer all at once, coining the term Wafer Scale Integration. It incorporated built-in redundancy, sealed heat exchangers to handle the massive cooling requirements, and fabrication complexity that pushed well beyond the state of the art.

Trilogy's downfall came not from a single cause but from a compounding series of misfortunes — some technical, some human, and some simply bad luck. Gene Amdahl was distracted by a lawsuit following a serious car accident. Co-founder Clifford Madden was diagnosed with a brain tumor and died in 1982. The fabrication plant was damaged during construction by a winter storm. And the separate engineering teams never cohered into a unified effort. Meanwhile, Gordon Moore had articulated what became known as Moore's Law — predicting that chip density would roughly double every eighteen months through standard manufacturing progress — meaning that building a larger chip than the state of the art would simply become feasible through normal progress within a few years anyway. By 1985, Gene Amdahl was forced to pivot, and Trilogy eventually merged its remaining assets with a small startup called Elxsi.

Fast forward to 2015 — essentially 30 years later — and the dream of creating a single wafer-sized chip was considered foolhardy. As Cerebras CEO Andrew Feldman put it in an interview with The New Yorker:

"Trilogy cast a long shadow. People stopped thinking and started saying, 'It's impossible.'"

Enter Cerebras.


How Chips Are Made — and How Cerebras Broke the Rules


To appreciate what Cerebras accomplished, it helps to understand how chips are conventionally manufactured. In today's technology landscape, you start with a circular semiconductor wafer — typically 300mm (12 inches) in diameter, roughly the size of a dinner plate — and lay out a grid of identical units on its surface.


The maximum unit size, known as the reticle limit, is approximately 26mm × 33mm, or roughly 858mm². This limit is set by the photolithography equipment: a lithographic scanner can only expose that much area in a single "shot." After etching the circuit patterns onto each grid unit (called a die), the wafer is cut apart — using diamond blades, lasers, or, in advanced processes, plasma etching — yielding around 90 dies per wafer at the reticle limit, or several hundred to several thousand for smaller, more typical die sizes. Each die is then packaged to form what we commonly call an integrated circuit, or chip.

For reference, in 2015, the largest single-die chip in production measured about 600mm² — roughly 70% of the reticle limit. A decade later, the largest single dies produced by leading manufacturers measure approximately 750–800mm². After a decade of engineering progress, the industry has reached the edge of what single-exposure lithography can physically deliver — and stopped there. The reticle limit has become a ceiling that conventional chip design accepts as fixed. What Cerebras did was elegantly simple in concept and extraordinarily difficult in practice: they skipped the dicing step entirely, leaving the full wafer intact as one single, giant chip.


Voilà — Wafer Scale Integration.



[Figure 1: Understanding what scale means in the era of Wafer Scale Integration]


In Cerebras's design, the full wafer aggregates to approximately 900,000 AI-optimized compute cores on a single chip. While the most advanced conventional accelerator dies press against the physical ceiling of roughly 800mm² — the maximum a single lithography exposure can produce — the Cerebras WSE-3 measures approximately 46,225mm². That is more than 50 times the area of the largest die conventional manufacturing can deliver.


The WSE-3 — What Was Actually Built


The two hardware demands described above — fast memory close to compute, and massive independent parallelism — have shaped the evolution of processor architecture over decades. The industry's response has followed a clear progression: from processors that handle one operation at a time, to processors that apply the same operation to many data elements simultaneously, to processors where each core operates fully independently on its own data and its own local memory. This last category — known as MIMD, or Multiple Instruction Multiple Data — is the architectural family the WSE-3 belongs to.


Memory itself is arranged in a hierarchy, from fast and expensive on-chip Static Random Access Memory or SRAM, to slower and cheaper off-chip Dynamic Random Access Memory or DRAM, and the gap between the two has historically been one of the central constraints on AI performance.


For readers who want some background on what the tradeoffs are begtween SRAM and DRAM, understanding memory hierarchies, and how processor architectures evolved to shape AI hardware design, the companion post Evolving Hardware to Meet AI Needs covers this and provides useful context for what follows.


Cerebras's answer to both demands begins with a single architectural decision: place large amounts of fast memory directly beside compute, and do it at wafer scale.

Everything else follows from that choice.


Physical Scale — A Wafer Full of Processors Acting as One


The WSE-3 occupies the entire surface of a standard 300mm silicon wafer — 46,225mm² of active silicon. The physical ceiling for a conventional accelerator die — the maximum area a lithography tool can expose in a single shot — is approximately 800mm². The WSE-3, at 46,225mm², is more than 50 times that area. It is not a larger version of a conventional chip. It is a different category of object entirely.


On that single piece of silicon reside:

  • 900,000 independent compute cores

  • 44GB of on-chip SRAM

  • 21 petabytes per second of aggregate memory bandwidth

  • Approximately 15 kilowatts of peak power draw


Those numbers are not incremental improvements over a conventional accelerator. They reflect a different architectural category.


Memory Architecture — 44GB of On-Chip SRAM


Conventional accelerators rely on off-chip DRAM for their working memory, which carries an access latency of hundreds of clock cycles per read — a consequence of DRAM's array architecture, its refresh overhead, and the contention that arises when many cores compete to pass data across shared interconnects. The WSE-3 instead integrates 44 gigabytes of SRAM directly on the wafer, distributed evenly across its entire surface. Each compute core sits beside its own slice of local SRAM.

This changes the memory equation directly. An on-chip SRAM access costs one to a few clock cycles rather than hundreds that most DRAM-based systems might require. There are no refresh cycles as there are in DRAM to consume bandwidth. This improves speed of access and bandwidth, and the pipe connecting it to the compute is as wide as the processor needs.


Compute Architecture — MIMD at Scale


The wafer contains 900,000 active compute cores, each executing its own instruction stream on its own data — a Multiple Instruction, Multiple Data (MIMD) architecture, as described in the post, Evolving Hardware To Meet AI Needs. Unlike SIMD systems, where all cores execute the same instruction simultaneously, a MIMD design allows each core to operate independently. Neural networks — particularly during training — involve irregular communication and synchronization patterns. Independence reduces idle waiting.

Each core is small — approximately 0.05mm² — and tightly coupled to local SRAM. The result is not just parallelism, but locality: computation happens where the data already resides.


Scribe-Line Stitching — Making the Wafer Behave as One Die


Lithography equipment cannot expose a 300mm wafer in a single shot. The wafer necessarily spans multiple reticle fields. In conventional manufacturing, these fields are separated by scribe lines — narrow strips of silicon that are later cut and discarded when the wafer is diced into individual dies.

Cerebras did not dice the wafer.

Instead, they worked with TSMC to route approximately 20,000 metal interconnect wires across each scribe line, electrically stitching adjacent reticle fields together.

This is not a cosmetic detail. It is what makes wafer-scale integration electrically viable. From the perspective of any compute core, communicating across a reticle boundary is as fast as communicating within one. The wafer behaves as a single, continuous chip. Without this stitching, the wafer would be a collection of isolated regions. With it, it becomes a unified compute fabric.



[Figure 2: Understanding a Scribe Line]


Manufacturing Yield — Designing Around Defects


In conventional chip manufacturing, defective dies are discarded. When the chip is the entire wafer, that option disappears.

Each fabricated wafer contains approximately 970,000 physical cores:

  • 900,000 are activated

  • Roughly 70,000 serve as distributed spares

After fabrication, the wafer is mapped for defects. Faulty cores are bypassed, and spare cores are activated in their place. The interconnect fabric is dynamically reconfigured to route around imperfections. The yield problem is not eliminated — it is engineered around. From the perspective of software, every shipped wafer presents a clean, fully functional 900,000-core surface.


Power and Cooling — The Physical Consequence


At full load, the WSE-3 consumes approximately 15 kilowatts — an order of magnitude greater than a conventional accelerator card. This required:

  • A custom vertical power delivery system

  • Direct liquid cooling integrated into the assembly

  • A mechanical enclosure closer to industrial equipment than server hardware

The chip is not just large in area; it is dense in power.




[Figure 3: The Engine Block showing the power cooling tubes and vertical power delivery system]


To put the WSE-3's power draw in context, it helps to compare it with a a state of the art GPU. A lot of these numbers are subject to several caveats (peak power vs. average power, what kind of load condidtions etc.) but for a first-order understanding of these chips, it wwould appropriate to say that a single GB200 Superchip — Nvidia's Grace Blackwell unit, comprising one Grace CPU and two Blackwell GPUs — draws approximately 2.7 kW; So, the power consumed by the WSE-3 chip is about 6 times that of the GB200. But that is only one side of the picture. Of course, the WSE-3 (being much larger in size) and with so much more compute capability, a more appropriate measure might be Power/Flop;


Looking at both the power drawn and the Floating-point-operations-per-second (or FLOPs), the WSE-3 delivers 125 × 10¹⁵ flops while consuming 15 kW or about 8.3 flops/kW. Correspondingly, the GB200 delivers roughly 40 x 10¹⁵ flops while consuming approximately 3 kW, so about 14.8 flops/kW.


So in this comparison, the GPU-based system delivers approximately 2 times as much flops/kW. This should make sense, as the size of the GPU based system is much smaller and hence optimized for densly packed compute elements.

(Source: Nvidia GB200 product specifications; Cerebras WSE-3 product page)

What Makes This Architecturally Distinct

The WSE-3 is not simply a larger GPU. It represents a different answer to a specific question: if AI computation is constrained by how fast weights can move from memory into arithmetic units, what happens when memory is no longer off-chip?

By integrating 44GB of SRAM directly onto a wafer containing 900,000 independent cores — and by electrically stitching reticle fields into a single fabric — Cerebras removes the traditional boundaries between die, package, and memory hierarchy. Whether this architectural bet becomes commercially dominant remains to be seen.


From Chip to System — The CS-3 and the Max Cluster


The Gap Between a Chip and a Deployable System


A wafer-scale processor does not fit into a conventional accelerator slot. Once the silicon area approaches the size of an entire wafer and power consumption reaches kilowatt levels, the problem shifts from chip design to system design.

The WSE-3, therefore, cannot be considered in isolation. It exists as part of an integrated system. As explained in Making Sense of AI Through Linear Algebra, the WSE-3 is an accelerator: a specialist parallel compute engine that handles the matrix arithmetic of AI training and inference, paired with a host CPU cluster that handles everything an accelerator cannot — orchestration, data loading, training loop management, and I/O. In addition to that host CPU relationship, the WSE-3 needs to be powered and cooled reliably, connected to external memory large enough to hold the full weight matrices of a frontier AI model, and networked with other systems for the largest training runs in the world.

The CS-3 is Cerebras's answer to everything except the host CPU: a standard 19-inch rack-mountable unit that houses the WSE-3 chip along with its power delivery system, water cooling infrastructure, and high-speed I/O connections. Think of it as the accelerator system — the chip, its life support, and its interface to the world — ready to be paired with a host and put to work.


The CS-3 — One System, One Chip


A single CS-3 delivers 125 petaflops of peak AI compute — where one petaflop means 10¹⁵ floating-point operations per second. That number is not the most important thing about it.

The CS-3's most important capability is how it handles the memory problem for frontier AI models. Cerebras's MemoryX external memory system connects directly to the CS-3 and scales up to 1.2 petabytes (1.2 × 10¹⁵ bytes) of storage — enough to hold the full weight matrices of models far larger than anything currently deployed. Rather than requiring the weights to be partitioned across thousands of accelerator chips with the synchronization overhead that entails, a single CS-3 paired with MemoryX can hold and serve an entire model from one location.

The practical consequence is a meaningful reduction in software complexity. The distributed training frameworks, gradient synchronization logic, and inter-chip communication management that GPU clusters require — typically tens of thousands of lines of specialized code — are largely unnecessary. Training a model on a CS-3 can be expressed in a few hundred lines of standard PyTorch. The host CPU handles what it always handles; the CS-3 handles the rest without requiring the programmer to manage parallelism manually.

Once the decision is made to place tens of gigabytes of SRAM and nearly a million cores on a full 300mm wafer, the dominant engineering problem shifts from transistor design to physical survivability. A silicon wafer is thin and mechanically fragile; at this scale, it cannot simply be packaged like a conventional die. Delivering kilowatts uniformly across its surface requires low-impedance power distribution to avoid voltage droop and localized heating. That heat cannot be removed with air alone, so liquid cooling becomes a practical necessity rather than a performance enhancement. The resulting thermal gradients introduce mechanical stress due to differential expansion between silicon, interconnect metals, substrates, and cooling structures. At wafer dimensions, even small mismatches accumulate. The chip, therefore, exists only as part of a tightly integrated electro-mechanical system in which power delivery, cooling, structural support, and reliability are inseparable from the silicon itself. All these problems had to be solved.


In the previous section, we had discussed the power consumed by the WSE-3 and compared it with a corresponding GPU chip. When we do the same comparison at the system level, the numbers change.


So, to make the corresponding case at the system level, the CSE-3 delivers 125 x 10¹⁵ flops while consuming 23 kW or about 5.4 x 10¹⁵ flops/kW. Correspondingly, the NVL72 rack delivers roughly 360 x 10¹⁵ flops while consuming approximately 120 kW, so about 3 x 10¹⁵ flops/kW.


Notice that at the system level, the ratio reverses and the CSE-3 generates about twice as many flops/kW as the GPU-based system.

(Source: Nvidia GB200 NVL72 product page and HPE QuickSpecs; Cerebras CS-3 product specifications; Cerebras CS-3 vs. Nvidia DGX B200, cerebras.ai, 2024)


The Price Question


No official price has been confirmed by Cerebras for either the WSE-3 chip or the CS-3 system. Industry observers and analyst estimates put the CS-3 system in the range of $2–3 million per unit, though Cerebras has declined to publicly confirm this figure. (Source: Estimate from Industry Analysts at TheNextPlatform; Cerebras has not confirmed official pricing)

At that price point, the CS-3 is unquestionably a premium product — accessible to national laboratories, large research institutions, hyperscalers (companies such as Amazon Web Services, Google Cloud, and Microsoft Azure), and frontier AI labs — but beyond the reach of most organizations. This is a real constraint on Cerebras's addressable market, and one the company is addressing through its cloud inference API, which opens access to Cerebras compute on a pay-per-token basis without requiring a capital purchase.


The Max Cluster — 2,048 CS-3 Systems


Cerebras's SwarmX interconnect fabric links up to 2,048 CS-3 systems together, presenting the entire cluster to the programmer as a single unified computing device. This is the Cerebras Max Cluster.

A fully configured Max Cluster delivers approximately 256 exaflops of aggregate AI compute, where one exaflop (10¹⁸ floating-point operations per second) equals 1,000 petaflops, roughly the estimated computational power of the human brain. To put 256 exaflops in physical terms: the US Department of Energy's Frontier supercomputer — housed at Oak Ridge National Laboratory — achieved the world's first exaflop in 2022, filling an entire football-field-sized facility and consuming 20 megawatts of power. The Cerebras Max Cluster delivers 256 times that compute. (Source: Wikipedia, Exascale computing; Cerebras Systems product specifications)


Considering the power consumed by this kind of cluster, the system consumed 47 MW at peak load. If we were to construct an equivalent GPU-based cluster (to have the same amount of compute capability), we would be looking at consuming about 216-240 MW, approximately 5 times the power required for the Cerebras Max Cluster ! So as you scale this solution, the power consumption works in your favor.


The Apples-to-Apples Cost Comparison


Meta's AI infrastructure — built around approximately 350,000 Nvidia H100 GPUs — cost approximately $25 billion and delivers roughly 1 zettaflop (10²¹ floating-point operations per second) of aggregate compute, where one zettaflop equals 1,000 exaflops. (Source: Estimates from industry analyst reports in TheNextPlatform, 2024)

A fully built Cerebras Max Cluster of 2,048 CS-3 systems, at the estimated $2–3 million per unit, would cost approximately $4–6 billion and deliver approximately 256 exaflops — one quarter of Meta's cluster compute, at roughly one quarter of the cost. On a raw compute-per-dollar basis, the two approaches are broadly comparable.

But raw compute per dollar tells only part of the story. The hidden costs of operating a cluster of hundreds of thousands of GPUs are substantial: the network infrastructure connecting them, the engineering teams required to write and maintain distributed training frameworks, the idle time from gradient synchronization delays, and the months of debugging that large distributed systems inevitably require. None of those costs appear in the hardware figure — but all of them are real, and all of them are meaningfully reduced on a CS-3 cluster.

The honest summary: the Cerebras Max Cluster offers comparable compute, at comparable cost, with substantially less engineering complexity, faster time-to-trained-model, and a fundamentally different architectural approach that addresses the memory bottleneck that constrains GPU clusters on the workloads that matter most for AI.


The Competition


The Challengers — Different Bets, Mixed Outcomes

Before discussing Nvidia — the dominant incumbent — it is worth understanding the landscape of architectural challengers that emerged alongside Cerebras, each placing a different bet on how to solve the AI hardware problem.

Graphcore built its Intelligence Processing Unit (IPU) around a bulk synchronous parallel architecture — a genuinely innovative approach that showed promise in research settings but struggled to find production-scale adoption. After reaching a peak valuation of approximately $2.8 billion, Graphcore was acquired by SoftBank in July 2024 and folded into their broader AI and hardware portfolio alongside ARM.

Groq optimized aggressively for inference latency, building a deterministic Language Processing Unit (LPU) that eliminated the unpredictability of GPU-based inference. It demonstrated remarkable speeds on specific workloads and reached a peak valuation of $6.9 billion. In December 2025, Nvidia acquired Groq's core assets for approximately $20 billion in a deal structured as a non-exclusive licensing agreement — with Groq's founder, CEO, and approximately 80% of its engineering team joining Nvidia — while Groq continued to operate nominally as an independent company, retaining its GroqCloud inference service.

SambaNova built its Reconfigurable Dataflow Unit (RDU) around a dataflow architecture well suited to running very large models efficiently on premises. It continues to operate independently, focused primarily on government, national laboratory, and enterprise customers. It remains a credible but niche player.

The pattern across these three companies is instructive. Each identified a real architectural limitation in Nvidia's approach and built a genuinely innovative solution. Each raised significant funding and demonstrated real performance advantages in specific workloads. And yet none achieved the scale necessary to mount a sustained challenge. Building a semiconductor company capable of challenging an incumbent with Nvidia's scale, software ecosystem, and manufacturing relationships is an extraordinarily difficult undertaking — and the history of well-funded challengers is a sobering reminder of that reality.


NVIDIA and AMD — The Incumbent and the Established Challenger


Against this backdrop, Nvidia's position rests on two foundations worth distinguishing carefully.

The first is hardware at scale. NVIDIA has invested decades building manufacturing and supply chain relationships with TSMC that no startup can replicate quickly. Their Blackwell architecture and the forthcoming Rubin platform represent the state of the art in GPU performance, and their NVLink interconnect fabric allows thousands of GPUs to work together.

The second — and more durable — foundation is software. NVIDIA's CUDA platform has become so deeply embedded in AI development over fifteen years that nearly every major AI framework, research codebase, and production pipeline is optimized for it. Switching away from CUDA requires rewriting code, retraining engineers, revalidating results, and accepting performance risk. This software ecosystem is what keeps customers on Nvidia even when alternative hardware demonstrates compelling advantages.

AMD occupies the role of established hardware challenger — offering competitive raw performance and growing compatibility with PyTorch and TensorFlow workflows, without requiring a fundamental architectural departure.


The Frontier AI Labs — and Their Enormous Commitment to Nvidia


The companies that develop and train the world's most capable AI models — among them OpenAI, Anthropic, Google DeepMind, Meta AI, Mistral, and xAI — have made staggering commitments to Nvidia hardware. Nvidia announced in 2025 a commitment to support OpenAI as it builds at least 10 gigawatts of Nvidia systems — equivalent to between 4 and 5 million GPUs. (Source: Nvidia press release, September 2025)

The aggregate financial commitment of frontier AI labs to Nvidia hardware is measured in the hundreds of billions of dollars — a level of investment that creates enormous switching costs and makes Nvidia's position appear difficult to displace.


The OpenAI-Cerebras Deal — A Signal Worth Noting


On January 14, 2026, OpenAI signed a multi-year agreement with Cerebras to deploy 750 megawatts of Cerebras wafer-scale systems — coming online in multiple tranches through 2028 — in a deal worth more than $10 billion. (Source: industry sources familiar with the matter)

Sachin Katti of OpenAI stated:


"OpenAI's compute strategy is to build a resilient portfolio that matches the right systems to the right workloads. Cerebras adds a dedicated low-latency inference solution to our platform. That means faster responses, more natural interactions, and a stronger foundation to scale real-time AI to many more people."


And from Andrew Feldman, CEO of Cerebras:


"We are delighted to partner with OpenAI, bringing the world's leading AI models to the world's fastest AI processor. Just as broadband transformed the internet, real-time inference will transform AI, enabling entirely new ways to build and interact with AI models."


Two things in these statements deserve attention. First, OpenAI is explicitly positioning this as an inference partnership — not training. Second, OpenAI frames it as portfolio diversification rather than a departure from Nvidia.

There are two reasons this inference-first positioning makes sense, and they reinforce each other. The first is switching cost: OpenAI's training infrastructure is deeply embedded in Nvidia's CUDA ecosystem, and moving training workloads to a new architecture carries real engineering risk. Inference is a lower-stakes entry point for a new hardware relationship. The second reason is commercial: inference is the revenue engine. Every token delivered to a user is a billable event. Training the next model is a capital expenditure — an internal cost that produces a future asset, but does not itself generate revenue. When OpenAI routes live customer traffic through Cerebras hardware, faster inference at lower cost per token improves margins on every query served, immediately and measurably.

It is worth noting that Cerebras's architectural advantages — the MIMD design, the 44GB of on-chip SRAM, the elimination of inter-chip communication overhead — are, if anything, more pronounced for training than for inference. The reason OpenAI chose to deploy Cerebras first for inference rather than training is almost certainly commercial rather than technical: inference latency is immediately visible to hundreds of millions of ChatGPT users, making it the highest-impact place to demonstrate speed advantages publicly. Training runs happen behind closed doors and take months; inference happens billions of times a day in real time.

OpenAI is also just one frontier AI lab among many. Anthropic, Google DeepMind, Meta, xAI, and Mistral have not yet made equivalent public commitments to Cerebras hardware. The full test of Cerebras's commercial viability lies in whether those commitments follow.


Conclusion — An Open Question, and a Name to Watch


Forty years is a long time for an idea to wait. Wafer scale integration was not forgotten; it was remembered as a cautionary tale, a symbol of ambition outrunning engineering reality. The industry did not abandon it because the idea became useless, but because the last attempt failed spectacularly, expensively, and publicly, casting a shadow long enough to deter a generation of engineers and investors from asking whether the failure was permanent or merely premature. Cerebras decided that one dramatic data point was not enough to close the question. What Andrew Feldman and his team built between 2016 and 2024 was not a single breakthrough but a carefully orchestrated stack of innovations, each one making the next possible.


Wafer scale integration required solving power delivery, which meant routing current vertically through the assembly rather than laterally across a board. It required solving inter-die communication, which meant working with TSMC to stitch metal interconnects across scribe lines that had never carried signals before. It required solving manufacturing yield, which meant making cores small enough that defects became cheap, building redundancy into the wafer fabric, and writing software that could remap around failures at boot time. Underlying all of it was a single clear architectural conviction: the memory wall is a central constraint in AI computing, and a MIMD architecture with on-chip SRAM beside every compute core addresses it at the scale frontier AI demands.


The result is a chip more than 50 times the area of the largest die conventional lithography can produce in a single exposure, with 44 gigabytes of on-chip SRAM against the handful of megabytes a conventional accelerator core can access locally, delivering memory bandwidth measured in petabytes per second — and it allows a single rack-mounted system to train models of up to 24 trillion parameters without the distributed computing complexity that GPU clusters require.


None of that guarantees that Cerebras wins. OpenAI has committed around ten billion dollars to deploy Cerebras hardware for inference workloads — a meaningful validation from one of the world's most prominent frontier AI labs — but OpenAI is one lab among many. Anthropic, Google DeepMind, Meta, xAI, and Mistral have not yet made equivalent public commitments. Training workloads, where Cerebras's architectural advantages are arguably even more pronounced than in inference, remain dominated by Nvidia, and the CUDA software ecosystem is still the most durable competitive moat in the industry. The estimated $2–3 million per CS-3 unit keeps Cerebras out of reach for all but the largest and best-funded organizations, and the history of AI hardware challengers — Graphcore acquired, Groq absorbed, others still searching for scale — is a standing reminder that technological excellence and commercial success are not the same thing.


What Cerebras has demonstrated is that wafer-scale integration works in practice: the three hard problems can be solved, a 300mm wafer can become a single functioning chip, and that chip can outperform clusters of thousands of conventional GPUs on the AI workloads it targets most directly. That demonstration alone ranks among the most significant engineering achievements in the semiconductor industry in the past twenty years, and the market has reflected it, with Cerebras's valuation rising from $8 billion to $23 billion in the six months following the OpenAI partnership announcement.


Whether that achievement translates into a generation-defining company will depend on manufacturing scale, software ecosystem development, continued customer wins beyond OpenAI, and whether the frontier AI labs that have not yet committed decide that Cerebras's performance advantages justify the switching costs and the price. Those questions are genuinely open, and anyone who claims to know the answers is overconfident. What is not in doubt is what Cerebras has already accomplished — and on that point, the engineering record speaks clearly.



The semiconductor industry spent forty years saying wafer scale integration was impossible. Cerebras spent eight years proving it wasn't — and the world's most powerful AI lab just wrote them a ten billion dollar check to say thank you.



Appendix A: How Can So Much Computation Fit Into 0.05mm²?


Source: Cerebras Engineering Blog, March 2025; Cerebras WSE-3 Product Page

A natural question arises from the manufacturing yield section: if each WSE-3 core is so tiny at 0.05mm², how can it be computationally useful at all?

The answer lies in what each core is — and crucially, what it is not. A conventional CPU or GPU core is a general-purpose computing engine. It must handle any instruction a programmer might throw at it — branching logic, floating point arithmetic of many types, memory management, cache coherence, branch prediction, and dozens of other capabilities. All of that generality requires silicon area. A modern Nvidia streaming multiprocessor is large precisely because it must be flexible.

A Cerebras WSE core, by contrast, is not general-purpose. Each core is independently programmable but narrowly optimized for tensor-based, sparse linear algebra operations — the specific class of matrix multiplication and addition that underpins virtually every neural network computation. It does not need branch prediction hardware, complex cache hierarchies, or the general-purpose instruction set of a CPU. It needs to multiply, accumulate, and pass results to its neighbors — and it does those three things with extreme efficiency in a very small area of silicon.

Furthermore, each core does not need to carry large local memory because the WSE-3's 44GB of on-chip SRAM is distributed evenly across the entire wafer surface, giving every core single-clock-cycle access to fast memory at extremely high bandwidth. The memory is spread uniformly — each core has just enough fast memory immediately beside it.

It is also worth noting an important distinction from conventional chip design. When a die stands alone as a conventional packaged chip, it must carry significant overhead circuitry to manage its relationship with the outside world — I/O serializers and deserializers, memory controllers, power management circuits, error correction logic for signals traveling long distances off-chip, and interrupt controllers to communicate with the rest of the system. All of this overhead exists because a conventional chip is an island — it must manage every interaction with the world outside its package boundary.

On the Cerebras wafer, each die is not an island — it is a node in a fabric. It has direct metal connections to its neighbors through the scribe line stitching. It shares a unified power delivery system. It has on-chip SRAM immediately beside it. It receives control signals through the overlay network. The external world does not exist at the die level — only at the wafer level. All of that island-management overhead is eliminated, freeing silicon area for what actually matters: compute.

2 Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
KV
Feb 26
Rated 5 out of 5 stars.

Nicely explained, K. A little more info on final packaging and power delivery/cooling challenges will be great as an addendum

Like
Kumar
Feb 28
Replying to

Thanks Katni ! I will look into adding this also.

Like
Post: Blog2_Post

K's Musings

  • LinkedIn
  • Facebook
  • Twitter

©2020 by K's Musings. Proudly created with Wix.com

bottom of page