How LLMs Work

AI Foundations

Module 02

Foundation

Choosing the Right Model: Maya discovers that writing a product description costs ~800 tokens (Haiku: $0.0006) and answering a customer email costs ~400 tokens (Haiku: $0.0003). At 2,000 products and 500 emails/week, ShopMate's monthly bill will be under $50 -- far less than hiring a copywriter. Understanding tokens and costs makes the business case obvious.

Large Language Models are transformer-based neural networks trained to predict the next token in a sequence. Understanding the core mechanics helps you prompt more effectively, interpret outputs more accurately, and know when to trust -- or question -- a model's response.

The Transformer Architecture

Every modern LLM is built on the Transformer architecture, introduced in the 2017 paper "Attention Is All You Need." Understanding its components demystifies how these models process and generate language.

Input Embedding

The model converts each token into a high-dimensional vector (typically 4,096 to 12,288 dimensions). These embeddings capture semantic relationships: tokens with similar meanings end up close together in this vector space. Positional encodings are added so the model knows the order of tokens -- without this, "the dog bit the man" and "the man bit the dog" would look identical.

Self-Attention Layers

The core innovation of the Transformer. Each token computes three vectors: Query (what am I looking for?), Key (what do I contain?), and Value (what information do I carry?). Attention scores are computed by comparing every Query against every Key, then using those scores to create a weighted sum of Values. This allows each token to "attend to" every other token in the context.

Multi-Head Attention

Rather than computing attention once, the Transformer splits it into multiple "heads" (typically 32-128). Each head learns to focus on different types of relationships: one head might track subject-verb agreement, another might track pronoun references, another might capture semantic similarity. The outputs are concatenated and projected back to the model's dimension.

Feed-Forward Networks

After attention, each token passes through a feed-forward neural network (two linear transformations with a non-linearity). This is where much of the model's "knowledge" is stored -- factual associations, learned patterns, and reasoning shortcuts are encoded in these weight matrices. These layers typically have 4x the hidden dimension.

How Attention Works -- An Analogy

Imagine reading a sentence and highlighting the most relevant words for understanding each word. In "The cat sat on the mat because it was tired," when processing "it," attention heavily weights "cat" (the referent) and lightly weights "mat" (less relevant). The model learns these weighting patterns from billions of examples during training. This is why LLMs can handle long-range dependencies that defeated earlier architectures.

Tokenization: How Models See Text

LLMs do not process raw text -- they process tokens. Tokenization is the process of breaking text into the sub-word units that the model actually operates on.

Byte-Pair Encoding (BPE)

Most LLMs use BPE or a variant. The algorithm starts with individual characters and iteratively merges the most frequent adjacent pairs. Common words like "the" become single tokens. Rare words are split: "tokenization" might become ["token", "ization"]. This balances vocabulary size against sequence length.

The ~0.75 Rule

A rough heuristic: 1 token is approximately 0.75 English words, or 4 characters. "Hello world" is 2 tokens. But this varies dramatically: code often has more tokens per line than prose, and non-English languages may use 2-3x more tokens per word. Always use your model provider's tokenizer tool for accurate counts.

Surprising Tokenization

"ChatGPT" tokenizes as ["Chat", "G", "PT"] (3 tokens)

" indivisible" tokenizes as [" ind", "ivis", "ible"] (3 tokens)

"123456789" tokenizes as ["123", "456", "789"] (3 tokens)

Why This Matters

Models struggle with tasks that require character-level reasoning (spelling, counting letters) because they never see individual characters -- they see tokens. This explains why LLMs can fail at "How many R's in 'strawberry'?" -- the model sees ["str", "aw", "berry"] and must reason about characters it cannot directly observe.

Training Pipeline

From Raw Data to Deployed Model

Training Stages Explained

Stage 1: Pre-training

The model reads trillions of tokens from the internet, books, code, and other sources. Its only objective: predict the next token. This self-supervised task requires no human labels. After pre-training, the model is a powerful text completer but not a useful assistant -- it will continue any text you give it, including harmful content, without judgment. Pre-training is the most expensive stage, costing tens of millions of dollars in compute.

Stage 2: Supervised Fine-Tuning (SFT)

Human annotators write thousands of ideal (prompt, response) pairs. The model is fine-tuned on these examples, learning the format and style of a helpful assistant. This is where "chat" behaviour emerges. The model learns to follow instructions, provide structured responses, and refuse harmful requests. SFT transforms a raw text completer into something that feels like a conversation partner.

Stage 3: RLHF / RLAIF

Reinforcement Learning from Human Feedback. Humans rank multiple model outputs for the same prompt. A reward model learns these preferences, then the LLM is optimised to produce outputs the reward model scores highly. This sharpens quality: the model learns nuanced preferences like "be concise but thorough" and "acknowledge uncertainty rather than guessing." RLAIF (AI feedback) uses another model instead of humans for scalability.

Stage 4: Constitutional AI (Claude-specific)

Anthropic's approach: define a set of principles (a "constitution") that the model should follow. The model critiques and revises its own outputs against these principles. This reduces reliance on human labellers for safety training and makes the model's values more explicit and auditable. The constitution includes rules like "choose the response that is most helpful while being honest and avoiding harm."

Fine-Tuning vs Prompting: When to Use Each

One of the most important decisions in applied AI: should you fine-tune a model or just prompt it well?

Dimension	Prompting (+ RAG)	Fine-Tuning
Cost to start	Near zero -- just write prompts	$100 - $10,000+ for training data + compute
Time to deploy	Minutes to hours	Days to weeks
Best for	General tasks, varied use cases, rapid iteration	Consistent style/format, domain-specific terminology, latency-sensitive apps
Knowledge injection	Via context window (RAG) -- ephemeral per request	Baked into weights -- persistent but expensive to update
Model updates	Instantly benefit from provider's new model versions	Must re-fine-tune when base model updates
Flexibility	Change behaviour by editing prompt text	Locked into trained behaviour; changes require retraining

Rule of Thumb

Start with prompting. Add RAG if the model needs access to your data. Only fine-tune if you have proven that prompting cannot achieve your quality bar despite extensive optimisation. Most production applications today use prompting + RAG, not fine-tuning.

Scaling Laws: Why Bigger Models Are Better

Research from OpenAI, DeepMind, and Anthropic has revealed predictable relationships between model performance and three variables: parameter count, training data size, and compute budget.

The Core Finding

Model performance (measured by loss on next-token prediction) follows a power law: double the compute, and loss decreases by a predictable, consistent amount. This relationship holds across many orders of magnitude, which is why labs continue investing in larger training runs -- the returns are predictable.

Chinchilla Scaling

DeepMind's Chinchilla paper (2022) showed that most models were over-parameterised and under-trained. The optimal strategy is to scale parameters and training tokens together. A 70B parameter model trained on 1.4T tokens outperforms a 175B model trained on 300B tokens. This finding reshaped how every lab trains models.

Emergent Capabilities

Some abilities appear suddenly at specific scale thresholds rather than improving gradually. Chain-of-thought reasoning, multilingual transfer, and in-context learning all "turn on" at certain model sizes. This makes it difficult to predict what the next generation of models will be capable of based on current models.

Inference-Time Scaling

Recent research (2024-25) shows you can also improve performance by giving models more compute at inference time -- allowing them to "think longer" before answering. Techniques like chain-of-thought, tree search, and extended thinking trade latency and cost for significantly better accuracy on reasoning-heavy tasks.

Key Operational Concepts

Tokens, Not Words

LLMs process text as tokens -- roughly 0.75 words each. A token is a byte-pair encoding unit. Understanding tokens explains why models sometimes split words oddly, why context windows are measured in tokens, and why code often costs more tokens than prose.

Attention Mechanism

The transformer's self-attention mechanism lets every token attend to every other token in the context window. This is why LLMs can reason over long documents -- but also why inference cost scales quadratically with context length.

Temperature and Sampling

After computing a probability distribution over possible next tokens, the model samples from it. Temperature controls the sharpness: 0 = always the most likely token (deterministic), 1 = sample proportionally. Top-p (nucleus sampling) is another control: it restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. Use low temperature for factual tasks, higher for creative work.

Context Windows

The context window is the maximum number of tokens a model can process in a single request (input + output combined). Claude supports up to 200K tokens; some models offer 1M+. Longer context windows enable processing entire codebases or books, but cost and latency increase. The model's ability to attend to information also degrades in the middle of very long contexts ("lost in the middle" effect).

Model Tiers: Choosing the Right Size

Model providers offer multiple tiers. Choosing the right one balances quality, speed, and cost.

Tier	Example Models	Best For	Relative Cost
Frontier / Large	Claude Opus, GPT-4o, Gemini Ultra	Complex reasoning, nuanced writing, multi-step analysis	$$$
Mid-tier	Claude Sonnet, GPT-4o-mini, Gemini Pro	Most production workloads -- good quality at reasonable cost	$$
Small / Fast	Claude Haiku, Gemini Flash, Llama 8B	High-volume, low-latency tasks: classification, extraction, routing	$

ThreadCo Cost Strategy

Maya's plan: use Haiku for high-volume tasks (product descriptions, email classification) and Sonnet for tasks requiring nuance (handling customer complaints, writing marketing copy). This keeps the monthly bill under $50 while maintaining quality where it matters most.

The Hallucination Problem

LLMs generate plausible-sounding text by predicting likely next tokens -- they are not retrieval systems and have no concept of "truth." This means they can confidently generate false facts, fake citations, or incorrect code. The root cause is fundamental to how these models work: they optimise for likelihood, not correctness. Always verify outputs for high-stakes decisions. Retrieval-Augmented Generation (RAG) can ground responses in verified sources.

Hands-On Exercises

Exercise 1: Token Counting

Go to OpenAI's Tokenizer tool (or Anthropic's token counter in the API docs). Paste these inputs and record the token count for each: (a) A 100-word paragraph of English prose. (b) The same paragraph translated to Japanese. (c) A 20-line Python function. (d) A JSON object with 10 key-value pairs. What patterns do you notice about which content uses more tokens?

Exercise 2: Temperature Experiment

Using any LLM with adjustable temperature, send the exact same prompt at temperature 0, 0.5, and 1.0. Prompt: "Write a one-paragraph product description for a vintage leather jacket." Run each temperature setting three times. Compare: How much do outputs vary within each temperature? How does creativity change across temperatures? At what point does quality degrade?

Exercise 3: Model Tier Comparison

Take a task relevant to your work (e.g., summarise a document, write a code function, draft an email). Run the same prompt through a small model (e.g., Haiku or GPT-4o-mini) and a large model (e.g., Sonnet/Opus or GPT-4o). Compare: quality, speed, and cost (use the provider's pricing page). For your specific task, is the quality difference worth the cost difference?

Exercise 4: Hallucination Detection

Ask an LLM to "List five academic papers about the impact of AI on small business productivity, with authors and publication years." Then verify each citation. How many are real? How many are fabricated? Now try the same prompt but add "If you are unsure about a citation, say so rather than guessing." Does the instruction change the hallucination rate?

Exercise 5: Cost Calculator

Estimate the monthly AI cost for a real or hypothetical application. Define: (a) Number of requests per day. (b) Average input tokens per request. (c) Average output tokens per request. (d) Which model tier you would use. Calculate the monthly cost using your chosen provider's pricing. Then calculate: what is the cost of doing this task manually (hours x hourly rate)? What is the ROI?

<-- What is Generative AI? Next: Prompting Principles -->