The Essence of LLMs: Functions

A while back, my son asked me: “Dad, how does ChatGPT know what to say?”

I decided to give a real answer. Not a hand-wavy “it’s very smart,” but actually break down how LLMs work for him. So I made a slide deck – LLM for Kids – walking through Token, Embedding, Attention, and Transformer, using “the cat sat on the mat” as an example, “report cards” and “pie charts” as analogies.

Making that deck taught me more than I expected. When you’re forced to explain a concept so that an elementary schooler can understand it, you’re forced to strip away all the jargon and confront the essence.

And that essence is surprisingly simple:

An LLM is a function.

Not a metaphor. Not an analogy. A function in the mathematical sense. It takes a sequence of tokens as input and outputs a probability distribution. Every behavior that makes people think “AI seems to be thinking” is just this function calling itself repeatedly.

Starting from a d-Dimensional Space

Training an LLM begins with positing a d-dimensional space. d could be 4096, 8192 – the exact number depends on the model design.

Each token – a word, a subword, a punctuation mark – is mapped to a vector in this space. This operation is called Embedding, and it’s essentially a lookup table: token ID in, d-dimensional vector out.

Before training, these vectors are randomly initialized. “Cat” and “dog” might be far apart. “Cat” and “interest rate” might be right next to each other. But after training, semantically similar words get pulled closer together – not by human design, but by gradient descent tuning it on its own.

A word’s “meaning” is its position in high-dimensional space.

Attention: Dynamic Routing

But there’s a problem: Embedding gives each token a static, context-independent position. Whether “apple” appears in “I ate an apple” or “Apple released a new iPhone,” the lookup table returns the same vector – encoding only the average semantics of “apple,” with no idea whether it’s a fruit or a company in the current sentence.

What Attention does is: dynamically adjust each token’s representation based on context. Embedding assigns each token a “default identity.” Attention lets them communicate with each other and then adjust according to context. Without Attention, every word lives in its own world, unaware of its neighbors.

For each position in the sequence, Attention answers one question: Who should I pay attention to, and how much?

Mathematically, it transforms each vector into three roles:

Q (Query): What am I looking for
K (Key): What can I offer
V (Value): My actual content

Then a single formula handles matching and aggregation:

computes the relevance score between every pair of positions.

is a scaling factor that prevents dot products from growing too large, which would push softmax outputs toward one-hot (vanishing gradients). Softmax normalizes the scores into weights, which are then used to compute a weighted sum over V.

In one sentence: Attention is a learnable, dynamic weighted sum.

Multi-Head Attention runs multiple such operations in parallel. Each head learns a different attention pattern – some focus on syntactic dependencies, some on semantic similarity, some on positional distance. The results from all heads are concatenated and passed through a linear transformation.

FFN: The Knowledge Store

Inside each Transformer block, right after Attention, there’s a Feed-Forward Network (FFN):

Two fully connected layers with an activation function in between. Looks unremarkable, but recent Mechanistic Interpretability research has revealed an interesting division of labor:

Attention handles information routing – deciding where to pull information from. FFN handles knowledge storage – the “facts” the model has memorized are largely encoded in FFN parameters.

This means when you ask an LLM “What is the capital of France?”, Attention connects “France” and “capital,” while FFN “recalls” “Paris” from its parameters.

Training Objective: Almost Too Simple

The entire training process has a single objective: Next Token Prediction.

Given the first n tokens, predict the n+1th. Compute the cross-entropy loss between the predicted probability distribution and the ground truth, backpropagate, update parameters.

That’s the only objective. Nobody teaches it grammar, logic, or how to write code. Yet when the model is large enough and the data abundant enough, these capabilities “emerge.”

Why? Because to accurately predict the next token, you must understand context. To understand context, you implicitly learn grammar, semantics, logic, common sense, and even world knowledge. Predicting the next word is the ultimate compression of language understanding.

So What Is “Intelligence”?

Back to the opening thesis: an LLM is a function.

Billions to hundreds of billions of parameters , trained on massive data, mapping token sequences to probability distributions over the vocabulary. A single forward pass, pure matrix operations, no side effects, deterministic output.

What about conversation? It’s just autoregressive invocation of this function – append the previous output to the input and call it again. Temperature and Top-p sampling introduce randomness, but that’s an inference-stage engineering choice, not a property of the model itself.

This isn’t diminishing LLMs. Quite the opposite. The fact that a system “merely” doing function approximation can exhibit behavior that looks like reasoning, like creativity, like understanding – that is what’s truly awe-inspiring.

Conway’s Game of Life is also a function – a few simple rules that evolve into infinitely complex patterns. LLMs are similar: a simple training objective, through a sufficiently large parameter space and enough data, gives rise to capabilities that exceed intuition.

The Value of Demystification

Understanding “LLMs are functions” has practical value.

It lets you stop treating LLM errors as “AI is unreliable” and instead understand them as the function’s poor fit in certain input regions. It helps you see what Prompt Engineering actually does – adjusting the input vector’s position in high-dimensional space so it lands in a region where the function fits well. It helps you understand why the Context Window has a limit – it’s not just a technical constraint, but a consequence of Attention’s computational complexity.

No need for reverence. No need for fear. What’s needed is understanding. When you know what’s under the hood, you can push it to its limits.