Making Sense of AI Through Linear Algebra

Kumar Venkatramani
Feb 23
13 min read

Updated: Mar 18

This article is a companion to Wafer Scale Integration — The Cerebras Story. It can be read on its own — but if you arrived here from that article, this is the deeper mathematical foundation it pointed you toward.

Note to Reader

This article is written for the technically curious, not the technically specialist. You do not need a degree in computer science or electrical engineering to follow what is written here — though if you have one, you may find some descriptions in more depth than most common publications provide. What you do need is an appetite for understanding how things work, and a willingness to engage with some mathematics — specifically linear algebra, probability, and the idea of iterative error correction. None of these require specialist knowledge to follow at the level this post requires, and all of them are introduced through concrete examples before being used.

Specifically, this article builds on three mathematical concepts that will feel familiar to anyone who has studied mathematics beyond basic arithmetic: scalars (a single number), vectors (an ordered list of numbers), and matrices (a grid of numbers arranged in rows and columns). These are introduced gently and with examples when they first appear. Readers who go on to explore the companion articles on hardware architecture will find this foundation useful, though neither article requires it as a prerequisite.

Editorial Note

This article was written by the author with assistance from an AI system and reflects his best understanding of a technically complex subject. While every effort has been made to ensure accuracy, readers who identify errors or imprecisions are encouraged to raise them. The goal is clarity and honesty, not the appearance of expertise.

What Deep Learning Actually Does, Mathematically

First Order Framing

This essay explains modern AI through the lens of linear algebra — intentionally emphasizing first-order ideas to build intuition. In doing so, some concepts are highlighted for clarity (vectors, matrices, dot products, matrix multiplication), while others that are equally essential in practice — optimization, probability, calculus, numerical stability, hardware constraints — are intentionally glossed over. The goal is not completeness, but coherence: to show the structural backbone of AI systems before layering in the full mathematical and engineering machinery that makes them work at scale.

AI, Neural Networks and Deep Learning Networks

The systems most people refer to today as "AI" — the Large Language Models (or LLMs for short) that answer questions, write code, translate languages, and summarize documents — belong to a class of mathematical systems called deep learning networks. Understanding what these networks are, at a mathematical level, helps explain why the hardware required to train and run them looks the way it does.

Before going further, it is worth pausing on terminology — because the words used to describe these systems are often used interchangeably in public discourse, and the distinctions matter for what follows.

Traditional software follows explicit rules written by programmers. Machine learning is the broadest category — any system that learns patterns from data rather than following explicit rules. Neural networks are a specific family within machine learning, characterized by their layered architecture of weighted connections. Deep learning is a subset of neural networks specifically — networks with many layers, where the depth of the architecture is what gives the approach its name and much of its power. The systems at the frontier of modern AI, including the large language models this series is ultimately about, are deep learning systems. See Figure 1 below for an explanation.

Figure 1: The hierarchy of Terms in Machine Learning, Neural Networks and Deep Learning Networks

Scalars, Vectors, and Matrices

A useful way to ground these concepts is through a single analogy that makes all three visible at once: a digital image.

A scalar is a single number — think of one pixel on a greyscale screen, with a brightness value of, say, 137 on a scale of 0 to 255.

A vector is an ordered list of numbers. Extend that pixel across a row — one brightness value per pixel, left to right — and you have a vector. Order matters: each position corresponds to a specific location on the screen.

A matrix is a grid of numbers arranged in rows and columns. Stack every row of pixels across the entire image from top to bottom and you have a matrix. The full greyscale image, as a mathematical object, is a matrix.

And when you add color — red, green, and blue values at every pixel — you need one matrix per color channel, stacked in three layers. That three-dimensional structure is called a tensor: a generalization of a matrix to as many dimensions as the problem requires. The name is not accidental. Google's AI accelerator is called a Tensor Processing Unit (TPU) precisely because tensor operations are the heartbeat of modern AI computation.

Why does this matter for AI? Almost every modern AI system – from movie recommendation engines to speech recognition and image classification – boils down to applying large numbers of these vector and matrix operations to data.

How a Neural Network Is Built From This

A neural network is built from layers. Each layer is usually represented as a large grid of numbers — a matrix — representing the current state of whatever the network is processing: a word, a phrase, an input sentence, an intermediate calculation, a partially formed answer. All of these are called tokens.

What makes the network "learn" is a second set of numbers sitting between those layers: the weights. Weights are the learned multipliers that determine how strongly the output of each element in one layer should influence each element in the next. Training a network means adjusting those weights, gradually, until the network's outputs match what we want.

Figure 2: Weights and Layers

The operation that connects one layer to the next is a matrix multiplication — followed by a simple but essential step: every negative number in the result is set to zero, and every positive number is left unchanged. This operation is called ReLU, and its role is to act as a threshold — allowing signals that cross it to pass forward, and suppressing those that don't.

At the heart of every layer is a matrix multiplication. For those comfortable with notation, this can be expressed precisely: each output element C[i,j] is computed as

C[i,j] = Σₖ A[i,k] × B[k,j]

where A is the input data (the current layer's values) and B is the weight matrix (the learned multipliers the network has built up through training). Each output value C[i,j] is therefore a weighted combination of every input value in that row, where the weights in B determine how much each input contributes. This is the dot product.

To see how the layers connect, the notation makes it concrete. In the first layer, A₁ is the input data — the matrix of pixel values — and B₁ is that layer's weight matrix, learned through training. Their multiplication produces C₁ = Σₖ A₁[i,k] × B₁[k,j]. That result, C₁, does not stop there — it becomes A₂, the input to the next layer, which has its own weight matrix B₂, producing C₂ = Σₖ A₂[i,k] × B₂[k,j]. And so on, layer after layer. Each weight matrix has been trained to extract something progressively more abstract from whatever the previous layer passed forward — pixel differences, then edges, then shapes, then features. The final layer's output is no longer a matrix of image values but a set of probabilities: this is 0.73 cat, 0.27 dog. The entire chain — raw input to confident conclusion — is nothing more than a sequence of matrix multiplications, each one transforming the representation one step further.

To explain this another way: The matrix of pixel values is, in effect, an encoding of everything the image contains — every edge, every texture, every shape, captured as numbers. When the network assigns weights to those values, it is essentially asking: which of these encoded features matter most for identifying what this image is? By emphasizing certain features through higher weights, comparing the result against the known answer, and correcting the error iteratively, the network gets progressively closer to a set of weights that reliably extracts the right information from the encoding. The matrix multiplication is the mechanism that does that extraction — combining the encoded features according to the current weights to produce a prediction.

Training and Inference — Two Very Different Jobs

Two terms appear constantly in any discussion of Neural Networks, and it is worth defining them clearly before the rest of the post depends on them.

Training is the process by which a deep learning network learns — the computationally intensive phase where the model is exposed to vast quantities of data, its weights are repeatedly adjusted to reduce errors, and the billions or trillions of parameters that define its behavior are gradually shaped into something useful. Training a frontier model happens once, or a small number of times, and takes weeks or months on thousands of accelerators running continuously.

Inference is everything that happens after training is complete — every time the model is asked a question, generates a response, writes code, or translates a sentence.

Think of training like getting a professional qualification: studying, preparing for and taking the exams — grueling, expensive, done once. Inference is the career that follows — performed constantly, for years, in front of real people who expect a timely answer.

The two workloads have different hardware demands: training requires sustained, high-throughput parallel computation across enormous datasets; inference requires quick responses — the model must answer each individual request fast enough that the person on the other end doesn't notice a delay.

It's Raining Cats and Dogs — Learning in Practice

To see how all of this works in practice, consider the problem of recognizing whether an image contains a cat or a dog. The image arrives as a matrix of pixel brightness values. What the network learns to look for, in its earliest layers, are edges — places where brightness values change sharply from one pixel to the next. A sudden jump from 40 to 210 across adjacent pixels signals a boundary: the edge of an ear, the curve of a jaw, the outline of a tail. These edges are detected by comparing neighboring values in the matrix — a straightforward numerical operation, repeated across the entire image.

This is also where the term "deep" in deep learning earns its meaning. Each successive layer of the network learns to detect increasingly subtle and abstract features — built on top of what the previous layer found. The first layers detect raw edges and gradients. The next layers combine those edges into shapes: a curve, a pointed outline, a particular texture of fur. Deeper still, the network begins to respond to higher-order arrangements — the relative position of ears and eyes, the proportions of a snout, the overall structure of a face.

Each layer applies finer refinements to what came before, and it is this hierarchy of progressively more abstract representations, built up across many layers, that gives deep learning both its name and much of its power.

Think of each layer as a specialist. One layer, once its weights are trained, becomes good at detecting a particular curve — the arc of an ear. Another learns to recognise an eye. Another, the length and shape of a snout. No single layer knows what a cat is. But when you stack enough of them, the final layers receive a summary of everything the earlier layers found — two eyes, two pointed ears, a particular nose — and can combine those signals into a confident conclusion. Each layer adds one piece to the recognition. Together, they complete it. This is why the number of layers matters: a shallow network can only detect simple features, while a deeper one can build up increasingly complex descriptions of what it is looking at, one layer at a time.

But how does the network learn those associations in the first place? This is where probability and human feedback enter the picture — and where the iterative nature of training becomes important.

During training, the network is shown a large collection of labelled images — a human has already marked each one as "cat" or "dog." That labelling, it turns out, is not always done by specialists. When you complete a CAPTCHA — those prompts that ask you to "click all the squares containing a traffic light" or "identify the images with a bus" — you are, in many cases, doing exactly this work. You and I, without necessarily realizing it, have contributed to the labelled datasets that train these systems. In other cases, scores of human workers sit in front of computer screens doing this by hand, image by image, hour after hour — a largely invisible but essential step in building any AI system that learns from labelled examples.

For each image, the network produces not a definitive answer but a probability: perhaps 0.73 that it is a cat, 0.27 that it is a dog. When that probability is wrong — when the network says 0.73 cat and the label says dog — the error is measured and fed back through the network, nudging the weights very slightly in the direction that would have produced a better answer.

This process — feed a dataset of images into the network, have humans identify what is in each image, measure the error between the network's guess and the correct answer, adjust the weights, and repeat — is then iterated until the error falls below an acceptably small threshold. Not once or twice, but across the entire collection of images, again and again. Each complete pass through the training data is one iteration — working through every labelled image in the set, one by one, updating the weights each time. In the technical literature this is often called an epoch, but the idea is the same: one full cycle through the data. A practical cat-and-dog classifier might require tens of such iterations across tens of thousands of images, meaning the network has seen and learned from millions of individual examples before its probability estimates become reliable. Larger, more complex models require correspondingly more data and more passes.

One important distinction: the number of layers is fixed before training begins — the number of layers is an architectural decision made by the designer, informed by the requirements of the task, and fixed for the duration of training. The model learns within that structure — it does not choose the number of layers. What training adjusts, exclusively, are the weights. What training does is determine how well the chosen architecture performs — and if the results are poor, the model designer may go back and redesign with more or fewer layers and try again.

With each iterative pass, the probability estimates improve. The weights settle into configurations that correctly distinguish pointed ears from floppy ones, wet noses from dry, the particular texture of fur from the particular texture of grass. At some point, the network's answers become accurate enough — reliably above some acceptable threshold — and training stops. The network can now be said to "recognise" a cat or a dog.

And critically, the same process generalises. Once a network has learned to detect edges, shapes, and arrangements of features for cats and dogs, those same capabilities transfer. Train it on cars, and it learns to distinguish headlights from wheels from windscreens. Train it on medical images, and it learns to distinguish healthy tissue from abnormal tissue. Train it on enough different objects, with enough labelled examples, and the network develops a rich general ability to recognise and categorise the visual world — all from the same underlying mathematics, applied iteratively, at enormous scale.

The linear algebra — matrix multiplications, dot products, the movement of numbers through layers — is what makes the computation tractable at scale. The probability and the human feedback are what give it the ability to learn. Neither works without the other.

From Images to Language — The Same Process, Applied Differently

The cat-and-dog example works because images translate naturally into matrices of numbers. But what about language? Words are not pixels. A sentence does not arrive with a brightness value attached to each character.

This is where the same iterative refinement process is applied to an entirely different representational challenge. Just as images were broken down into pixels and arranged into matrices, language must be broken down into its own units — syllables, words, phrases, sentences — and each must be assigned a numerical representation that a network can operate on.

But applying this to language required solving a harder problem first. Images arrive as pixel values — numbers already. Language does not. How do you turn a word into a number? How do you turn a sentence, or the meaning of a concept, into something a processor can operate on? That mapping — from the messy, ambiguous, context-dependent world of human language into numerical structures — is what took decades to work out. It required figuring out that words could be represented as points in a high-dimensional space, where proximity meant similarity of meaning, and that a sentence could be encoded as a sequence of vectors, each capturing not just a word but its relationship to every other word around it.

Once that representational problem was solved, the same training process that worked for images could be applied to language. Feed examples through the network — a word and its context, a sentence and its likely continuation, a question and its answer. Measure the error. Adjust the weights. Repeat, across billions of examples, millions of iterations. With each pass, the numerical representations become more refined — words that appear in similar contexts drift closer together; words with opposite meanings drift apart. The network learns not just the symbol but something approximating the meaning, purely through the statistics of how words appear alongside each other at enormous scale. A Large Language Model is, at its core, the result of applying this same process to language — at a scale that would have been computationally impossible a decade ago.

The equations were ready. The map, once found, unlocked everything. But the processors were not ready — not yet. They had been improving steadily, but the ability to run these computations at the scale that frontier AI demands only became feasible recently, and is still actively being pushed forward.

Readers curious about how hardware evolved to meet this challenge will find that story in Evolving Hardware to Meet AI Needs.

How Large Is a Large Language Model?

Before moving to the hardware, it is worth pausing on what "large" actually means in a Large Language Model — in terms the preceding sections have already introduced.

A modest but capable deployed language model today might have somewhere in the range of 7 to 70 billion (10⁹) weights — each one a number that was arrived at through the iterative training process described above. A frontier model sits above a trillion. These are not metaphorical large numbers: a trillion is 1 x 10¹², and every one of those values must be stored, retrieved, and used in a computation every time the model processes a word.

Those computations are matrix multiplications — the same dot products described earlier, now applied across hundreds of layers, each containing millions of values, for every token the model processes. A single forward pass through a large model — generating one word of a response — involves billions of individual multiply-and-add operations. Training the model, which requires doing this billions of times across a dataset of trillions of words, with weight adjustments after each pass, demands a scale of computation that strains the most powerful hardware available.

Where to go next

If this perspective resonates, the next step is to expand beyond linear algebra into the broader mathematical and computational ecosystem that powers modern AI. That includes multivariable calculus (for gradients and backpropagation), probability and statistics (for modeling uncertainty and learning from data), numerical optimization (for training stability and convergence), and computer architecture (for understanding why matrix operations dominate hardware design). Good next steps include a standard text on linear algebra. For example:

[1]. Linear Algebra and Its Applications by Gilbert Strang,

[2]. Introductory machine learning text like Pattern Recognition and Machine Learning by Christopher M. Bishop,

[3]. A systems-oriented perspective such as Computer Architecture: A Quantitative Approach by John L. Hennessy and David A. Patterson.

Linear algebra may form the skeleton — but these disciplines provide the muscle, circulation, and metabolism of real AI systems.