Evolving Hardware to meet AI Needs

Kumar Venkatramani
Feb 24
21 min read

Updated: Feb 24

This article is a companion to Wafer Scale Integration — The Cerebras Story. It can be read on its own — but if you arrived here from that article, this is the deeper treatment of memory and processor architecture it pointed you toward.

Note to Reader

This article is written for the technically curious, not the technically specialist. You do not need a degree in computer science or electrical engineering to follow what is written here — though if you have one, you may find some descriptions in more depth than most common publications provide. What you do need is an appetite for understanding how things work, and a willingness to engage with the ideas that follow.

This article stands on its own. It opens with a brief recap of the AI mathematics that motivates the hardware discussion — enough to follow what comes next without requiring any prior reading. Readers who want the full mathematical foundation will find it in Making Sense of AI Through Linear Algebra, but it is not a prerequisite for this article.

To make the ideas here accessible to a general reader, some concepts are described in simplified form. The goal is to build intuition and understanding, not to serve as a technical reference. Where simplifications have been made, they are in the service of clarity rather than completeness.

Editorial Note

This article was written by the author with assistance from an AI system and reflects his best understanding of a technically complex subject. While every effort has been made to ensure accuracy, readers who identify errors or imprecisions are encouraged to raise them. The goal is clarity and honesty, not the appearance of expertise.

The Mathematics Behind AI - What this post builds on

Modern AI — specifically the deep learning networks behind tools like ChatGPT, Claude, and Gemini — is built on linear algebra. A neural network is organized into layers, each represented as a matrix of numbers. Between those layers sit the weights: learned multipliers that determine how strongly the output of each element in one layer influences each element in the next. The operation connecting each layer to the next is a matrix multiplication, expressed as:

C[i,j] = Σₖ A[i,k] × B[k,j]

where A is the current layer's values, B is the weight matrix, and C becomes the input to the next layer. Training the network means adjusting those weights iteratively — feeding data in, measuring the error, and nudging the weights slightly in the right direction — until the error falls below an acceptable threshold. The scale of this process is what makes the hardware question consequential: a capable deployed model today carries somewhere between 7 and 70 billion (10⁹) weights; a frontier model sits above a trillion (10¹²) weights. A single forward pass — generating one word of a response — involves billions of individual multiply-and-add operations. Training requires doing this billions of times across datasets containing trillions of words.

The mathematics is not exotic. The volume is.

For readers who would like more depth on the mathematics Making Sense of AI Through Linear Algebra covers it fully.

Why Computer Architecture Matters Here

To understand how the computation described in Post 1 is actually performed — and why doing it efficiently at scale is a hard problem — it helps to understand some basic computer architecture. This section is not intended as a comprehensive treatment of the subject; it focuses on the concepts most relevant to understanding AI hardware, and simplifies where necessary to keep the ideas accessible. Readers who want to go deeper will find pointers at the end of this post.

Every computer, at its most fundamental level, consists of two large blocks: a processor that performs computation, and memory that stores the data and instructions the processor needs. This separation was formalized by mathematician John von Neumann in the late 1940s, and it remains the foundation on which virtually every practical computing system is built. The two threads this post follows — how memory evolved, and how processor architecture evolved — both trace their origins back to that same two-block model. We will take each in turn, starting with memory, because understanding what memory can and cannot do is the essential context for understanding why processor architectures developed the way they did.

Memory — Speed, Cost, and the Hierarchy

There is a hierarchy of memory in every computer, arranged by speed and cost. At the top sits the fastest memory — small in capacity, expensive, and physically close to the processor. At the bottom sits the slowest — vast in capacity, cheap, and capable of holding data for years without power. Understanding that hierarchy is critical to understanding that architecture matters.

Before climbing the hierarchy rung by rung, one distinction shapes everything else: the difference between volatile and non-volatile memory.

Volatile memory holds data only as long as power is applied. The moment the electricity stops, the data vanishes. Non-volatile memory retains its contents without power — an SSD or a hard drive can sit unplugged for years and still hold everything it stored. The fast tiers at the top of the hierarchy are volatile. The slow storage tiers at the bottom are non-volatile. Within the volatile category, the most important distinction for understanding AI hardware is the difference between two technologies: SRAM and DRAM.

SRAM — Static Random Access Memory (SRAM) stores each bit using six transistors. That circuit is inherently stable: it holds its state without any external intervention as long as power is applied. The cost of that stability is silicon area — six transistors per bit is expensive in silicon area (hence cost to manufacture), which is why SRAM usually appears only in relatively small quantities, physically close to the processor.

DRAM — Dynamic Random Access Memory (DRAM) — stores each bit using just one transistor and one capacitor. One transistor per bit instead of six means far denser and cheaper memory, which is why DRAM is used for the main memory in virtually every computer — the 8, 16, or 32 gigabytes in a laptop or desktop. But that single capacitor has a fundamental physical weakness: it leaks its charge over time, like a bucket with a leak at the bottom. To prevent the data from disappearing, the memory must be periodically refreshed resulting in a continuous background refresh process. This is called a refresh cycle. This adds a small but persistent background overhead further reducing effective bandwidth.

DRAM’s higher latency arises from its array architecture — rows must be activated, sensed, and precharged before access — while SRAM cells are directly addressable flip-flop (a particular arrangement in which the transistors are arranged) structures. A modern SRAM cache (specifically the Level one cache) can deliver data to the processor in one to four nanoseconds (10⁻⁹ seconds — billionths of a second). DRAM has a typical access latency of 60 to 100 nanoseconds — roughly twenty to fifty times slower. To put that in human terms: if an SRAM access took one second, the equivalent DRAM access would take almost a full minute. Translated into processor clock cycles, an SRAM access costs roughly four to twelve cycles; a DRAM access costs 200 to 300 cycles — cycles during which the processor sits idle, waiting. This waiting is so pervasive that engineers gave it a name: the memory wall.

Now lets look at the whole Memory Hierarchy.

The Memory Hierarchy

CPU Registers. The topmost rung. A tiny number of storage locations built directly into the processor, holding the values it is actively computing with at this exact moment. Access is instantaneous — registers are typically available within one clock cycle at full processor frequency. Measured in bytes rather than gigabytes.

SRAM — Static Random Access Memory. Just below registers. Fast, stable, and volatile. Six transistors per bit makes it expensive in silicon area, so it appears in relatively small quantities as the processor's on-chip cache — the working scratchpad the processor checks before reaching out to slower memory.

DRAM — Dynamic Random Access Memory. The main memory of a computer. One transistor per bit makes it far denser and cheaper than SRAM, but the leaky capacitor design requires continuous refresh cycles, and the array architecture adds access latency. This access latency — measured in tens to hundreds of nanoseconds — and the refresh overhead together constitute the dominant memory bottleneck in most computing systems.

SSD — Solid State Drive. The first non-volatile tier. The storage medium inside virtually every SSD is NAND flash memory — a technology that locks electrical charge inside microscopic cells to represent data. Unlike DRAM's leaky capacitor, that charge stays in place without power. Access latency is measured in microseconds (10⁻⁶ seconds — millionths of a second) — orders of magnitude slower than DRAM for individual reads, but fast enough for storing and retrieving files, programs, and large datasets.

HDD — Hard Disk Drive. The slowest storage in common use. Data is stored as magnetic patterns on spinning metal platters — a fundamentally mechanical technology. HDDs remain the most cost-effective option for very large storage volumes, and the magnetic patterns persist without power indefinitely, but seek times measured in milliseconds (10⁻³ seconds) make them orders of magnitude slower than any tier above.

The full hierarchy, side by side, with DRAM as the baseline — since it is the memory tier that conventional AI accelerators use as their primary working memory:

Memory type	Typical capacity	Access latency	Clock cycles (3–4 GHz)	Speed vs DRAM
CPU Registers	Bytes	< 1 ns	1	~200–300× faster
SRAM (on-chip cache)	Tens of MB	1–4 ns	4–12	~20–50× faster
DRAM (main memory)	8–64 GB	60–100 ns	200–300	Baseline
SSD (flash storage)	256 GB–4 TB	50–100 µs	150,000–400,000	~500–1,500× slower
HDD (hard disk)	1–20 TB	5–10 ms	15–40 million	~50,000–150,000× slower

Table 1: Memory Hierachies and their tradeoffs

Emerging Memory Technologies

The five tiers above represent the memory landscape most people encounter in computing systems today, but the boundaries are not fixed. A number of newer technologies occupy the gaps between these rungs, each attempting a different trade-off between speed, cost, and non-volatility. MRAM (Magnetoresistive RAM), for example, stores data using magnetic states rather than electrical charge, making it non-volatile while offering access latencies closer to SRAM than to flash — a combination that neither DRAM nor SSD can offer. Other technologies — including FRAM (Ferroelectric RAM) and RRAM (Resistive RAM) — are in various stages of research and commercial development. None has yet displaced DRAM or SRAM in mainstream high-performance computing, but the five-rung ladder is a simplification of a field that continues to evolve.

Why This Matters for AI

A large language model contains hundreds of billions of weights — numerical values that must be loaded into arithmetic units before any computation can take place. The speed at which those weights can move from storage into compute is often the binding constraint on AI performance. In practice, the limiting factor is usually memory bandwidth — how many bytes per second can be delivered — rather than the latency of a single access. It is the width of the pipe that matters, not just the size of the tank.

Memory bandwidth — the rate at which data can move from storage into computation — is a rate, not an amount. For example a memory might be labelled as 100 gigabytes. The bandwidth may be described at gigabits/second ; Both appear similar but are different.

Think of memory as a tank of water and bandwidth as the width of the pipe draining it. A large tank with a narrow pipe holds a lot but delivers it slowly. A smaller tank with a wide pipe empties faster. For AI training, the pipe width matters alongside the volume of computation itself.

Conventional AI accelerators store their weights in DRAM. When weights must be fetched from off-chip DRAM, the access latency runs to hundreds of clock cycles — a consequence of DRAM's array architecture, its refresh overhead, and the contention that arises when many cores compete to pass data across shared interconnects. On a system running a large model, this is not an occasional nuisance. It is a persistent constraint on how fast the system can work — and one that the compute architecture must contend with at every step.

With that foundation in place, we can turn to the compute side of von Neumann's two-block model — how processor architectures evolved to perform the arithmetic that AI demands, and how each generation handled the memory bottleneck differently.

Flynn's Taxonomy — A Framework for Processor Architectures

In 1966, computer scientist Michael Flynn of Stanford University proposed a classification of processor architectures based on two variables: how many instruction streams the processor executes simultaneously, and how many data streams those instructions operate on. Arrange those two variables as the axes of a grid and the result is a 2×2 matrix of four possible architectures. Three of them have significant commercial implementations; one does not. All four are described below, because understanding why one cell of the matrix remains largely empty is itself instructive.

The three architectures with practical implementations — SISD, SIMD, and MIMD — map onto a natural progression in the scale of computation: from operating on a single number, to operating on a list of numbers all at once, to operating on a grid of numbers simultaneously. That progression — scalar (1), vector (N), matrix (N×M) — mirrors the same mathematical progression described in Making Sense of AI Through Linear Algebra, and is the architectural story of how computing hardware evolved to meet the demands of modern AI.

All three practical architectures share the von Neumann foundation: processor and memory as separate blocks, connected by a bus — the shared communication channel between them.

The name is no accident. Think of a city bus: it runs along a fixed route, stops are predetermined, and anyone who wants to travel that route has to wait for the bus to arrive, get on, ride to their stop, and get off. Multiple passengers can ride at once, but they all share the same vehicle — and if the bus is full, or hasn't arrived yet, you wait. You cannot travel until the bus comes to you.

A computer bus works the same way — and the analogy holds more precisely than it might first appear. A bus in a computer is a shared communication channel, but it does not simply run between two stops. It connects multiple destinations: the processor, the main memory, the graphics card, the storage controller, the network interface, and others depending on the system. Each of those destinations is, in effect, a bus stop — it has an address, and data traveling on the bus is labeled with the address of where it is going. Just as a passenger on a city bus needs to know which stop to get off at, a signal on a computer bus carries information about its destination so that only the right endpoint reads it and acts on it. Everything else on the bus sees the signal pass by and ignores it.

Every time the processor needs an instruction or a piece of data, it puts a request on the bus, labeled with the address of the memory location it wants, and waits for the response to arrive back along the same shared channel. If something else is already using the bus — another request in flight, a memory refresh cycle completing, a write from the graphics card — the processor waits in line. No matter how fast the processor itself is, it cannot work any faster than the bus will allow. This bottleneck is present in all three practical architectures, though each handles it differently.

SISD — The Scalar Era

Single Instruction (SI), Single Data (SD).

One processor. One memory. One operation at a time. This is the pure von Neumann architecture — the design of the earliest computers and the conceptual baseline against which everything else is measured. It handles scalar computation: operations on individual numbers, one at a time, in sequence — the same scalar concept introduced in Post 1.

Every operation involves the processor crossing that shared bus at least twice before any arithmetic takes place — once to fetch the instruction, and at least once more to fetch the data. The processor is idle during every one of those crossings.

Figure 1: A typical SISD based Von Neumann Machine architecture

The diagram for this section is a conceptual illustration — it shows a single memory box and a single processor box connected by a bus. It is not a representation of how processors are physically laid out on silicon, nor does it imply that real systems have exactly the number of components shown. What it conveys is the architectural principle — the logical relationship between instructions, data, processor, and memory.

Worked example. Consider the scalar operation a = b × c. The processor would typically:

Fetch the multiply instruction from memory
Read b from memory
Read c from memory
Multiply b × c — one arithmetic operation
Write the result a back to memory

That is four memory operations for one arithmetic operation, performed sequentially. This architecture served earlier computing needs well. For modern AI — where a single forward pass through a large language model involves trillions (10¹²) of individual operations — working through every one of them serially through a single bus is not a practical path. The arithmetic operations themselves are straightforward, as Post 1 described. The challenge is the sheer volume of them.

SIMD — The Vector Era

Single Instruction (SI), Multiple Data (MD).

Multiple processors, all executing the same instruction simultaneously, each operating on its own slice of data. This is the architectural foundation of the modern Graphics Processing Unit (GPU) — and the step from scalar to vector computation.

Just as a vector operation can be thought of as a loop over a series of scalar operations — applying the same computation to each element one after the other — SIMD executes that loop in parallel, with each processor handling one element simultaneously. What takes N sequential steps in SISD can, in principle, be done in one step across N processors.

Figure 2: SIMD Diagram (for Illustrative Purposes, not actual physical layout)

The diagram for this section is a conceptual illustration — it does not represent how processors are physically arranged on silicon, nor does it imply that real SIMD systems have exactly this number of levels or nodes. What it shows is the essential principle: a single instruction source, a broadcast tree carrying that instruction to every processor simultaneously, and independent data memory attached locally at each node.

To see why this suits neural networks, recall from Post 1 that the weights connecting each layer to the next are applied as matrix multiplications — and that those matrix multiplications can be expressed as a series of vector dot products. Scaling an array by a constant, applying a transformation to every element, computing a dot product — in every case, the operation is identical across all elements; only the data changes. This is exactly the shape of work SIMD is designed for, and it describes a substantial fraction of what a neural network does during both training and inference.

It is also worth noting that matrix multiplication can be expressed as a series of vector operations — each row of a result matrix is a dot product, and dot products are vector multiplications followed by a sum. With enough processors and careful organization of the data, a SIMD architecture can handle large matrix operations efficiently. This is broadly what GPUs do: they do not need a fundamentally different architecture to tackle matrix work; they apply vector-style parallelism at very large scale, with many thousands of processors running simultaneously.

GPUs — Graphics Processing Units, originally designed by Nvidia to render video game frames in real time — became the dominant AI accelerator because their massively parallel architecture fit vector-style computation well, and because Nvidia's CUDA (Compute Unified Device Architecture) software platform, launched in 2006, gave researchers a practical way to use them. A GPU is not a general-purpose computer: it is an accelerator, permanently paired with a host CPU that handles everything a GPU cannot.

The instruction bus bottleneck is partially relieved in SIMD: the same instruction travels to all processors via broadcast, so the cost of fetching it is shared across however many processors are running. Each processor still has its own data memory and its own dedicated channel to that memory, but the instruction overhead is amortized rather than replicated.

Worked example. Extend the same operation to a vector: A(i) = B(i) × C(i) for i = 1 to 2. Two processors, instruction broadcast once:

Receive the broadcast multiply instruction — shared, no individual bus cost
Read own B(i) from local memory
Read own C(i) from local memory
Multiply — one arithmetic operation
Write own A(i) back to local memory

Both processors do this simultaneously. The processor count doubles and both elements complete in the time it takes to compute one.

SIMD in Practice: A Note on SIMT

Real GPU architectures tend not to implement pure SIMD exactly as Flynn originally defined it. Nvidia's approach, used across its GPU families, is called SIMT — Single Instruction, Multiple Thread. A detailed treatment of SIMT is beyond the scope of this post — and the internal architecture of any particular GPU is a proprietary matter that manufacturers do not fully disclose — but the broad idea is that SIMT relaxes the strict lockstep requirement of pure SIMD. Processors within a group generally execute the same instruction together, but individual threads can diverge when needed — for example when following different branches of a conditional — and later reconverge. The result tends to behave like SIMD when the workload is uniform, which most matrix operations are, while handling more varied workloads more gracefully than a strict SIMD design would. SIMT is not a separate category in Flynn's taxonomy; it is broadly an engineering refinement within the SIMD family, and mentioning it here is simply to acknowledge that the clean lines of the taxonomy become somewhat blurred in real commercial implementations.

MISD — Multiple Instructions, Single Data

Multiple processors, each executing a different instruction, all operating on the same single data stream simultaneously. This is the fourth cell of Flynn's 2×2 matrix, and it is the one which has no widely adopted general-purpose commercial implementation. The scenario it describes — many different operations all converging on a single shared data stream at once — maps onto very few real computational problems, and the shared data stream itself would tend to create a different kind of bottleneck.

Its place here is completeness rather than practical relevance: a full taxonomy has four cells and understanding why one of them remains largely empty is instructive. Computation in practice generally involves either the same operation applied to many data items, or many independent processors each handling their own part of a larger calculation — not many different operations converging on one shared stream.

MIMD — The Matrix Era

Multiple Instructions (MI), Multiple Data (MD).

Multiple processors, each executing its own independent instruction stream on its own independent data, all simultaneously. This is the most general form of parallel computation in Flynn's taxonomy.

Figure 3: MIMD

The diagram for this section is a conceptual illustration. What it conveys is the essential principle: each processor is paired with its own memory, each pair operates independently, and coordination between processors happens through a shared interconnect rather than through a shared instruction stream. It is not a representation of the physical layout of any real chip.

The key characteristic of MIMD is autonomy. Each processor is free to follow its own instruction sequence, reading from its own memory, without waiting for or synchronizing with others unless the application requires it. There is no broadcast, no lockstep — each processor simply gets on with its own calculation.

Recall from Post 1 that the full matrix multiplication C[i,j] = Σₖ A[i,k] × B[k,j] produces each output element independently — C[1,1] has no dependency on C[1,2] during computation. A MIMD architecture exploits this directly: assign one processor to each output element and let each run its own independent accumulation simultaneously, without any of them needing to wait for a broadcast or stay in step with the others.

Worked example. Four processors, one per output element of a 2×2 matrix multiplication. Each has its own local memory and its own instruction stream:

Fetch first multiply instruction from own local memory
Read own A[row, k] from local memory
Read own B[k, col] from local memory
Multiply — one arithmetic operation
Fetch add instruction from own local memory
Accumulate into running total — one arithmetic operation
(repeat for second pair)
Write result C[row, col] back to local memory

All four processors run simultaneously. The four output elements are computed in the time it takes one processor to compute one. Each followed its own instruction sequence on its own data — no broadcast, no lockstep.

Each processor has its own dedicated channel to its own memory, which reduces shared-memory contention — the bottleneck that arises when many processors compete for a single shared bus. To remain technically accurate: in a fully distributed memory MIMD system, having its own memory eliminates shared-memory contention, though processors may still contend for bandwidth on the interconnect when exchanging data with each other. There are also shared-memory MIMD variants for which this is not strictly true.

Crucially, the local memory beside each MIMD processor is typically fast SRAM — not off-chip DRAM. For computations that fit within that local store, the memory wall described earlier is substantially reduced: each processor draws data in one to four nanoseconds rather than the 200 to 300 clock cycles that an off-chip DRAM access would cost. The bottleneck that has characterized computer architecture since the 1940s has not been eliminated — but it has been narrowed considerably for the workloads that matter most.

A number of commercially significant chips have been built on MIMD principles, spanning supercomputers, high-performance networking processors, and specialized AI accelerators. The architecture is not a theoretical ideal — it is a practical design that has been refined across decades. Wafer Scale Integration — The Cerebras Story describes one of the most ambitious implementations of it: the Cerebras WSE-3, which takes the MIMD principle — each processor paired with its own local SRAM — and extends it across a single piece of silicon the size of an entire wafer, with 900,000 independent cores each sitting beside its own slice of fast memory.

Beyond Flynn — VLIW and FPGAs

Flynn's taxonomy describes the landscape of general-purpose processor architectures well, but it does not exhaust the approaches applied to accelerating neural network computation. Two others are worth noting.

VLIW — Very Long Instruction Word (VLIW). VLIW processors express multiple independent operations in a single wide instruction, allowing the compiler to schedule parallelism explicitly at compile time rather than leaving the hardware to figure it out at runtime. Rather than fetching many short instructions sequentially, the processor fetches one wide instruction that encodes several operations simultaneously. This tends to achieve high hardware utilization in workloads where the instruction mix is predictable in advance — which inference often is, since the network architecture is fixed and the same operations run in the same order every time. Cadence's Tensilica processor family is a well-known example, widely used in signal processing and neural network inference.

FPGAs — Field-Programmable Gate Arrays (FPGAs). FPGAs allow the hardware circuit itself to be reconfigured for a specific computation after manufacture. A designer can, in effect, build a custom chip in software and deploy it without going through a fabrication process. This flexibility makes FPGAs well suited to low-latency inference at the edge, and to workloads where the model architecture is stable enough to justify the engineering effort. Their reconfigurability is both their strength and their constraint: they tend to be slower and less dense than a purpose-built chip, but they can be updated as requirements change.

The Pattern

The scalar-to-vector-to-matrix progression of the mathematics described in Post 1 loosely mirrors the architectural sequence described here. SISD handles one number at a time — the scalar tier. SIMD handles many numbers simultaneously under one instruction — the vector tier, where GPUs broadly live. MIMD handles a grid of numbers across many independent processors each following its own path — the matrix tier, and the natural home of the most advanced AI accelerators.

The instruction overhead runs through all three practical tiers but is handled differently in each. In SISD it adds a memory crossing to every operation. In SIMD it is amortized across all processors sharing the same broadcast — the cost per processor falls as the processor count rises. In MIMD it is carried independently by every processor: the price of full autonomy, paid individually, offset by the fact that each processor has its own dedicated channel to its own memory and shares no memory with its neighbors.

Throughout this progression, the consistent trade-off is silicon area exchanged for computation speed — more processors, smaller memories per processor, shorter channels, faster memory. Each step reduces the time the arithmetic hardware spends waiting.

A Note on Optimizations

Flynn's taxonomy, and the VLIW and FPGA approaches above, represent broad architectural families rather than the full catalog of processor design choices. Decades of processor engineering have produced a rich set of additional techniques — out-of-order execution, speculative execution, branch prediction, instruction prefetching, superscalar pipelines, and many others — each representing a different strategy for squeezing more useful work from available silicon within a given architectural family. These are genuine and important innovations that have shaped every processor manufactured in the last thirty years. They are not described here because they operate as optimizations within an architectural family rather than as alternative answers to the fundamental question of how parallelism should be organized.

A Note on Multicore CPUs

Before leaving the taxonomy, it is worth acknowledging a class of processor that sits between the SISD baseline and the fully parallel architectures described above: the multicore CPU. Modern CPUs routinely contain between 8 and 128 cores on a single chip, each capable of executing its own instruction stream independently. For general-purpose computing — running multiple applications simultaneously, handling diverse workloads, managing an operating system — this works well, and the combination of multiple cores with the rich optimization techniques described above makes modern CPUs remarkably capable machines. The limitation in the context of AI is one of degree. A 64-core CPU is a powerful general-purpose machine, but each of those cores is large, complex, and expensive in silicon area — designed for flexibility rather than throughput. A modern GPU, by contrast, packs thousands of simpler cores onto the same die area, each optimized for the narrow arithmetic that matrix multiplication requires. A MIMD AI accelerator may carry hundreds of thousands. The multicore CPU excels at the serial orchestration work — managing the training loop, handling I/O, coordinating the overall computation — which is precisely the role it plays in the accelerator model described in the next section. For the matrix arithmetic itself, the ratio of useful arithmetic cores per unit of silicon area means the CPU is rarely the right tool.

The Accelerator Model — A Summary

The accelerator model is a well-established pattern in computing: rather than asking the CPU to do everything, pair it with a specialist co-processor designed from the ground up for one class of problem. The CPU handles what it is good at — serial orchestration, control logic, deciding what to compute next. Think of it as the conductor of an orchestra, directing each section and deciding what plays next and when. The accelerator handles what the CPU cannot do as efficiently — executing the same arithmetic operation across millions of data elements simultaneously. Neither replaces the other; they work in combination. Graphics rendering, signal processing, and scientific simulation have all made use of this model. AI training and inference are the most recent — and most demanding — workloads to benefit from it.

As computation scaled from scalar to vector to matrix operations, hardware architecture evolved along the same path. GPUs brought SIMD parallelism. MIMD architectures took that further, giving each processor its own instruction stream and its own dedicated memory channel — and crucially, its own local SRAM. As described in the memory section above, the local SRAM sits hundreds of clock cycles closer to the arithmetic units than off-chip DRAM does, substantially reducing the latency and contention that had constrained earlier architectures.

What this post has established is that the two central challenges of AI hardware — computational parallelism and memory speed — are both well understood. MIMD architectures address the parallelism side directly. The memory hierarchy, and the space, speed, and cost tradeoffs within it, define what is available on the memory side.

The industry has responded to this in several ways — through faster interconnects, smarter distributed training frameworks, and architectural innovations in how memory and compute are organized at the chip and system level. These are active areas of engineering, and no single approach has emerged as the definitive answer. Wafer Scale Integration — The Cerebras Story describes one particular architectural response to this challenge — and the engineering story behind it.