A
Accuracy / Performance Metric

In the paper, accuracy is the percentage of problems solved correctly out of a total.

Paper 14
Activation Function

A non-linear function applied to a neuron's weighted sum before passing the result to the next layer.

Paper 03
Add & Norm (Residual + Layer Norm)

The wrapping applied around every sub-layer: `output = LayerNorm(x + SubLayer(x))`.

Paper 08
Advantage / Advantage Estimation

In policy gradient RL, the difference between actual return and baseline: A = reward - V(prompt).

Paper 15
AI Winter

A period of reduced funding and interest in AI research, typically following a wave of over-promising and under-delivering.

Paper 02
AIME (American Invitational Mathematics Examination)

A competition mathematics exam (15 problems, 3 hours).

Paper 24
ALBERT

A BERT variant that reduces parameters by factorising the embedding matrix and sharing weights across Transformer layers.

Paper 11
Alignment

The process of training an AI system to behave in ways that align with human values, intentions, and safety constraints.

Paper 22
Alignment / Aligning Language Models

Making language models behave in accordance with human values and preferences.

Paper 15
Alignment matrix (attention heatmap)

A grid where each row corresponds to a target word and each column to a source word.

Paper 07
Alignment model

The small neural network inside the attention mechanism that scores how well a decoder state matches each encoder hidden state.

Paper 07
All-to-All Communication

A communication pattern where every GPU sends data to every other GPU.

Paper 19
AMC (American Mathematics Competitions)

A sequence of mathematics competitions for students (AMC 8, 10, 12).

Paper 24
Annotator Burnout

Psychological harm experienced by humans who repeatedly review harmful content (violence, abuse, self-harm).

Paper 22
Apache 2.0 License

A permissive open-source license allowing commercial use without restriction.

Paper 18
Attention

A mechanism introduced the same year as seq2seq (Bahdanau et al., 2014)

Paper 06
Attention Complexity

The computational cost of attention, typically measured in FLOPs (floating point operations).

Paper 18
Attention Mechanism

The key innovation in Transformers that allows each token to consider the relevance of all other tokens in the sequence.

3 papers
Attention weight (αₜᵢ)

The probability-like number, between 0 and 1, representing how much the decoder at decoding step t focuses on source position i.

2 papers
Autoregressive generation

The decoder's mode of operation: generate one token at a time, feed the generated token back as input, generate the next.

Paper 08
Autoregressive language model

A model that generates a sequence by predicting one token at a time, conditioning each prediction on all previously generated tokens.

Paper 10
Autoregressive Language Modeling

A training objective where the model learns to predict the next token given all previous tokens.

Paper 12
Auxiliary balancing loss (L_balance)

An additional loss term added to the main cross-entropy language modelling loss during MoE training.

Paper 09
B
Back-propagation (MCTS)

The process of updating node statistics (visit counts, accumulated rewards) as you trace back from a leaf node to the root after a rollout.

Paper 24
Backpropagation

Short for "backward propagation of errors." The algorithm that computes the gradient of the loss function with respect to every weight in a multi-layer network, by applying the chain rule backwards from the output layer to the input layer.

Paper 03
Backpropagation Through Time (BPTT)

The standard way of training RNNs and LSTMs.

Paper 04
Batch Size

The number of training examples (or sequences) processed in a single gradient update.

Paper 13
Beam search

An inference-time decoding algorithm.

2 papers
Benchmark

Standardised tests for evaluating language model quality (e.g., MMLU for general reasoning, GSM8k for math, HumanEval for coding).

Paper 18
Best-of-N (BoN)

A strategy where you generate N independent solutions to the same problem and select the best one according to some criterion (e.g., a Process Reward Model score).

Paper 23
Bidirectional context

A representation built from both left and right context simultaneously.

Paper 11
Bidirectional LSTM

An LSTM that processes a sequence in both directions — forward and

Paper 04
Bidirectional RNN (BiRNN)

Two recurrent networks — one reading left to right, one right to left — whose hidden states are concatenated at each position.

Paper 07
Binary Classification

The task of deciding whether an input belongs to one of two categories — yes or no, 0 or 1, cat or dog.

Paper 02
BLEU score

Bilingual Evaluation Understudy.

2 papers
Blockwise Attention

Computing attention in blocks (query chunk × KV chunk) rather than all-at-once.

Paper 19
BooksCorpus

The training dataset for GPT-1: approximately 7,000 unpublished novels scraped from the web, totalling ~800 million words.

Paper 10
Bootstrapping

A process where improvement in one component (the model) enables improvement in another (the data), which feeds back to improve the first component further.

Paper 24
Bottleneck (context vector bottleneck)

The design flaw in plain seq2seq (Paper 06): the entire source sentence's meaning must be compressed into a single fixed-size vector before decoding can begin.

Paper 07
BPE (Byte Pair Encoding)

A subword tokenisation algorithm that splits words into common subunits.

Paper 10
Bradley-Terry Model

A probabilistic ranking model from statistics, used here to model human preferences.

2 papers
C
Candidate vector (c̃ₜ)

A vector of proposed updates to the cell state, produced by a tanh

Paper 04
Capacity factor

A multiplier that sets the maximum number of tokens each expert can process per batch: `capacity = (batch_tokens / n_experts) × capacity_factor`.

Paper 09
Catastrophic Forgetting

When a model loses knowledge from pretraining while being fine-tuned on new data.

Paper 15
Causal (masked) self-attention

Self-attention where position i is prevented from attending to positions j > i (future tokens).

2 papers
Causal Language Modeling

Same as autoregressive language modeling: predict the next token given previous tokens.

Paper 12
Causal mask (autoregressive mask)

A (T × T) upper-triangular boolean mask applied in the decoder's self-attention.

Paper 08
Causal Masking

In autoregressive language modelling, preventing the model from attending to future tokens (tokens that come after the current position).

2 papers
CBOW (Continuous Bag of Words)

One of the two Word2Vec training tasks.

Paper 05
Cell state (c)

The "notebook" of an LSTM — a vector that flows from one time step to

Paper 04
Chain rule

A rule in calculus for differentiating composed functions.

Paper 04
Chain-of-Thought (CoT) Prompting

A prompting technique where intermediate reasoning steps are shown in few-shot examples, causing language models to generate their own step-by-step reasoning before producing a final answer.

2 papers
Chinchilla Ratio

An improvement on the compute-optimal frontier (from DeepMind's Chinchilla paper, 2022).

Paper 13
Chinese Room

A thought experiment proposed by philosopher John Searle in 1980 as a critique of the Turing Test.

Paper 01
Cloze test

A reading comprehension exercise invented in 1953 where words are systematically removed from a passage and the reader must fill them in.

Paper 11
Combined loss (L₃)

The total fine-tuning loss: L₃ = L_task + λ · L_language_model.

Paper 10
Communication Complexity

The amount of data that must be transferred between GPUs.

Paper 19
Computability

A mathematical property of problems: a problem is "computable" if it can be solved by a Turing Machine (i.e.

Paper 01
Compute Budget (C)

The total computational resources available for training, measured in FLOPs (floating point operations).

Paper 13
Compute-Communication Overlap

The simultaneous execution of computation and communication.

Paper 19
Compute-Optimal

Achieving the best accuracy for a given computational budget.

Paper 24
Compute-Optimal Frontier

The boundary of efficient training allocations: the curve of (N, D) pairs that minimize loss for a given compute budget C.

Paper 13
Compute-Optimal Strategy

The choice of which inference-time strategy (Best-of-N vs.

Paper 23
Connectionism

The school of thought in AI that believes intelligence emerges from the interactions of many simple connected units (like neurons), rather than from explicit symbolic rules.

Paper 02
Consciousness

The subjective experience of being aware, of having an inner life.

Paper 01
Constant error carousel

The original paper's name for the additive structure of the cell state.

Paper 04
Constitution

A written document specifying principles that an AI should follow.

Paper 22
Constitutional AI (CAI)

An alignment methodology that replaces human feedback with AI feedback.

Paper 22
Context Length

The maximum number of tokens a model can process in a single input.

Paper 18
Context Parallelism

Distributing a long sequence across multiple GPUs along the sequence dimension (distinct from data, tensor, or pipeline parallelism).

Paper 19
Context vector

Also called the **thought vector**.

2 papers
Context window

The maximum number of tokens the model can see at once.

2 papers
Convergence

In machine learning, a model has converged when its weights have settled and further training produces no improvement.

Paper 02
Convolutional Mode (Training)

During training, the recurrence x_t = Āx_{t-1} + B̄u_t can be unrolled and rearranged as a convolution: output = conv(input, kernel).

Paper 21
Corpus

A body of text used to train a model.

Paper 05
Cosine similarity

A measure of similarity between two vectors based on the angle between

Paper 05
Critique Prompt

A prompt asking an AI model to evaluate its own or another model's output against a specific principle.

Paper 22
Cross-attention

Attention where the query comes from one sequence (the decoder) and the keys and values come from another sequence (the encoder).

2 papers
Cross-Entropy Loss

The standard loss function for language modeling.

2 papers
D
d_model (model dimension)

The dimension of all input and output vectors in the Transformer.

Paper 08
Data Annotation / Labeling

The process of having humans provide labels (e.g., preference comparisons) for training data.

Paper 15
Dataset Size (D)

The number of training tokens (or training examples).

Paper 13
Decoder

The second half of the seq2seq architecture.

2 papers
Decoder-only Transformer

A Transformer that uses only the decoder stack (masked self-attention + feed-forward), without the encoder-decoder cross-attention.

2 papers
Decomposition / Reasoning Decomposition

Breaking a complex problem into smaller, simpler subproblems and solving each one before combining results.

Paper 14
Dense computation

The standard approach in neural networks: every parameter fires for every input.

Paper 09
Discretisation

The process of converting a continuous-time differential equation (dx/dt = Ax + Bu) into a discrete recurrence (x_t = Āx_{t-1} + B̄u_t).

Paper 21
DistilBERT

A compressed version of BERT created by knowledge distillation (training a small model to mimic the outputs of a larger one).

Paper 11
Distributional hypothesis

The linguistic claim (often attributed to Firth, 1957) that words

Paper 05
Distributional Shift

When the RL policy generates responses very different from the distribution the reward model was trained on.

Paper 15
dₖ (key dimension)

The dimension of the Query and Key vectors.

Paper 08
Dualism

The philosophical position, associated with René Descartes, that mind and matter are fundamentally different kinds of thing.

Paper 01
E
Efficient Attention

Modified attention patterns (sliding window, global, sparse) that reduce computation from O(n²) to O(n log n) or O(n), making long sequences tractable.

Paper 20
Eigenvalue (of State Matrix A)

A scalar λ such that Av = λv for some eigenvector v.

Paper 21
Element-wise product (Hadamard product, ⊙)

Multiplication of two vectors of equal length, slot by slot.

Paper 04
ELIZA

A chatbot created in 1966 by Joseph Weizenbaum at MIT.

Paper 01
Embedding

A dense, low-dimensional vector representation of something — a word,

3 papers
Embedding dimension (d)

The length of each word vector.

Paper 05
Embedding matrix (W)

The (V × d) matrix whose rows are the embeddings for each vocabulary

Paper 05
Emergent Abilities

Capabilities that appear when a language model reaches a certain scale, but were not present (or were very weak) in smaller versions.

Paper 12
Emergent Capability

A capability that appears in large language models above a certain size threshold, even though it was not explicitly trained for.

Paper 14
Encoder

The first half of the seq2seq architecture.

2 papers
Encoder hidden state (hᵢ)

The vector produced by the bidirectional encoder for source position i.

Paper 07
Encoder-decoder

The Transformer's two-part structure for seq2seq tasks (e.g., translation).

Paper 08
End-to-end learning

A philosophy, popularised by this paper, where a single neural network

Paper 06
Entropy Regularization

A term in RL that encourages exploration by rewarding policy entropy (randomness).

Paper 15
Epoch

One complete pass through all training examples.

2 papers
Expert

One of n specialised feed-forward networks in an MoE layer.

Paper 09
Expert collapse

A training failure mode where the gating network routes most tokens to a small number of popular experts, leaving the rest undertrained and effectively unused.

Paper 09
Exploding gradient problem

The opposite failure mode: the gradient grows without bound when

Paper 04
Exploration-Exploitation Trade-off

The fundamental challenge in search and learning: should you exploit what you've learned (focus on high-reward nodes) or explore new options (try under-explored nodes)?

Paper 24
Exponent (Alpha, Beta, Gamma)

The power in a power law.

Paper 13
Extrapolation

Extending a fitted line to predict values beyond the observed range.

Paper 13
G
Gated Recurrent Unit (GRU)

A simpler variant of the LSTM proposed by Cho et al.

Paper 04
Gating network

The learned routing function `G(x) = Softmax(TopK(H(x), k))`.

Paper 09
GELU activation

Gaussian Error Linear Unit: GELU(x) = x · Φ(x), where Φ is the Gaussian CDF.

Paper 10
Gemini Nano

The smallest variant (~2–7B parameters) of Gemini, optimized for on-device inference on mobile phones and edge devices.

Paper 20
Gemini Pro

The balanced variant (~50B parameters estimated) of Gemini, deployed for most production use (Google Bard, Workspace, Search).

Paper 20
Gemini Ultra

The largest variant (~1.3T parameters estimated) of Gemini, achieving the highest benchmarks (90.04% MMLU) but requiring significant compute.

Paper 20
Generalization

The ability of a model to perform well on new, unseen data (test set).

Paper 13
Generalization / Generalizing to New Domains

Whether a trained reward model (or policy) performs well on new, unseen tasks or domains.

Paper 15
Glossary: Let's Verify Step by Step

### Outcome Reward Model (ORM)

2 papers
GloVe

A 2014 alternative to Word2Vec, from Stanford.

Paper 05
GLUE benchmark

General Language Understanding Evaluation.

Paper 11
Gödel's Incompleteness Theorem

A 1931 result by mathematician Kurt Gödel: in any formal mathematical system powerful enough to describe arithmetic, there are true statements that cannot be proved within that system.

Paper 01
Gradient

The vector of partial derivatives of the loss with respect to every

Paper 04
Gradient Checkpointing

A memory optimisation technique where intermediate activations are discarded during forward pass and recomputed during backward pass.

Paper 19
Gradient Descent

The optimisation algorithm that trains neural networks.

Paper 03
Greedy decoding

The simplest decoder strategy.

Paper 06
Grouped Query Attention (GQA)

A variant of Multi-Head Attention where multiple query heads share the same key-value head.

Paper 18
GSM8K

A benchmark dataset of 8,500 grade-school math word problems, ranging from simple arithmetic to multi-step reasoning.

Paper 14
H
Hallucination

The phenomenon where a language model generates plausible-sounding but factually incorrect or entirely made-up information.

Paper 12
Hard alignment

In older statistical machine translation, each target word was explicitly assigned to exactly one source word.

Paper 07
Hardware-Aware Algorithm

An algorithm designed with GPU memory hierarchy in mind.

Paper 21
Harm Prevention

A principle in many AI constitutions stating that the AI should avoid providing information, advice, or assistance that could lead to physical, financial, psychological, or emotional harm.

Paper 22
HBM (High Bandwidth Memory)

High-capacity GPU memory (e.g., 80GB on an H100), with lower bandwidth than SRAM.

Paper 21
Head (attention head)

One of h = 8 parallel attention computations in multi-head attention, each operating in a lower-dimensional subspace (dₖ = d_model / h).

Paper 08
Hebbian Learning

The biological learning rule proposed by psychologist Donald Hebb in 1949: "neurons that fire together, wire together." When two neurons are active simultaneously, the connection between them strengthens.

Paper 02
Helpful, Harmless, Honest (HHH)

The alignment criteria used to train InstructGPT: helpful (answers user queries well), harmless (doesn't enable or encourage harmful acts), honest (doesn't hallucinate or mislead).

Paper 15
Helpfulness

A principle stating that the AI should genuinely assist the human in achieving their goals.

Paper 22
Hidden Layer

A layer of neurons between the input layer and the output layer.

Paper 03
Hidden state (h)

The LSTM's "spoken" output at each step.

2 papers
Honesty

A principle stating that the AI should be truthful and not deliberately mislead the human.

Paper 22
Human Feedback (HF)

Labels provided by humans comparing two AI outputs and indicating which one is better.

Paper 22
Human Preference / Human Feedback

Judgments by human raters about which model outputs are better.

Paper 15
Human Rater Agreement / Inter-Rater Reliability

Measure of how often different human raters agree on which output is better.

Paper 15
I
L
Language model

A probability distribution over sequences of tokens.

Paper 10
Large Language Model (LLM)

A neural network trained to predict the next token in a sequence, using next-token prediction as the training objective.

Paper 14
Latency

The time taken to produce a response.

Paper 23
Latency Hiding

Making communication latency disappear by overlapping it with computation.

Paper 19
Layer Normalisation (Layer Norm)

Applied after each sub-layer.

Paper 08
Learning Rate

A small positive number (e.g.

3 papers
Linear Recurrence

A recurrence relation of the form x_t = Āx_{t-1} + B̄u_t where future x_t depends only on previous x_{t-1}, not on all past history.

Paper 21
Linear Regression

A statistical method to fit a straight line through data points.

Paper 13
Linear Separability

A property of a dataset: two classes are linearly separable if you can draw a straight line (in 2D) or a flat hyperplane (in higher dimensions) that perfectly separates all examples of one class from all examples of the other.

Paper 02
Load balancing

The goal of ensuring that all n experts receive roughly equal numbers of tokens over training.

2 papers
Log-Log Plot

A graph where both axes are logarithmic.

Paper 13
Logit

A raw, unnormalised score before softmax.

Paper 08
Logits

Raw, unnormalised scores output by a neural network before applying softmax.

Paper 07
Long short-term memory

The full name of the paper.

Paper 04
Loss Function

A mathematical function that measures how wrong the network's prediction is.

Paper 03
LSTM (Long Short-Term Memory)

The RNN variant used by both the encoder and decoder in this paper.

Paper 06
M
Masked Language Modelling (MLM)

BERT's primary pre-training objective.

Paper 11
MATH Benchmark

A dataset of 12,500 competition-level math problems from AMC (American Mathematics Competitions) and AIME (American Invitational Mathematics Examination).

2 papers
Memory Scaling

With P GPUs using Ring Attention, per-GPU memory is O((n/P) × d), scaling linearly with the number of GPUs.

Paper 19
Mixture of Experts (MoE)

A layer type where multiple expert networks are available, and a router learns which expert(s) to use for each input.

Paper 18
MMLU (Massive Multitask Language Understanding)

A benchmark of 57 diverse academic subjects (history, law, science, medicine) with 14,042 multiple-choice questions.

Paper 20
Model Scaling / Emergent Threshold

The observation that CoT prompting's effectiveness depends critically on model size.

Paper 14
Model Size (N)

The number of parameters in a neural network.

Paper 13
MoE (Mixture of Experts) layer

A drop-in replacement for the FFN sub-layer in a Transformer.

Paper 09
Monte Carlo Tree Search (MCTS)

A search algorithm that explores a decision tree by: (1) selecting promising nodes using UCB, (2) expanding the tree with new candidate moves, (3) running rollouts to simulate outcomes, (4) backing up the results to update node statistics.

Paper 24
Multi-head attention (MHA)

`Concat(head₁, ..., headₕ) · W^O`.

2 papers
Multi-Query Attention (MQA)

An attention variant where all query heads share a single key-value head.

Paper 18
Multimodal

Capable of processing and reasoning over multiple modalities (text, images, audio, video) simultaneously.

Paper 20
P
Parallel Scan

A hardware-friendly algorithm (e.g., Blelloch scan) that computes a recurrence y_t = f(x_{t-1}, u_t) in parallel by decomposing it into a tree of subproblems.

Paper 21
Parameters

The learnable weights in a neural network model.

Paper 18
Pass@K

A metric that evaluates whether at least one out of K generated solutions is correct.

Paper 23
Patch (Image Patch)

A small rectangular region of an image, typically 14×14 pixels.

Paper 20
Peephole connections

An extension to LSTMs (Gers & Schmidhuber, 2000) in which the gates

Paper 04
Perceptron

The artificial neuron Rosenblatt described: takes several inputs, multiplies each by a weight, sums the results, and outputs 1 if the sum exceeds a threshold, 0 otherwise.

Paper 02
Perplexity

A metric for language models derived from cross-entropy loss.

Paper 12
Phrase table

A core data structure in pre-2014 statistical translation.

Paper 06
Policy Gradient / Policy Optimization

RL algorithms that improve a policy (probability distribution) by taking gradient steps that increase expected reward.

Paper 15
Policy Model

The language model being trained and improved across rounds.

Paper 24
Polysemy

The property of a word having multiple meanings.

Paper 05
Positional encoding (PE)

A fixed vector added to each input embedding to inject position information.

2 papers
Power Law

A mathematical relationship where one variable is proportional to another raised to a power: y = a * x^b.

Paper 13
PPO (Proximal Policy Optimization)

A stable reinforcement learning algorithm used in the RL stage.

Paper 15
Pre-training

Training a model on large-scale, typically unlabelled data before fine-tuning.

3 papers
Pretrained vectors

Word vectors trained on a large corpus by someone else and then

Paper 05
Process Reward Model (PRM)

A machine-learning model trained to evaluate the quality of individual steps in a multi-step reasoning process.

2 papers
Program-of-Thought (PoT)

Solving problems by writing Python code instead of natural language reasoning.

Paper 24
Prompt Engineering

The practice of carefully designing the text prompt to get better outputs from a language model.

2 papers
Prompt Format

The specific structure and wording of a prompt.

Paper 12
Python Verification

The process of executing Python code to check if a solution is correct.

Paper 24
R
Reasoning / Multi-Step Reasoning

The cognitive process of chaining ideas together across multiple steps to arrive at a conclusion.

Paper 14
Receptive Field

In deep networks, the range of input positions that influence a given output position.

2 papers
Recurrent Mode (Inference)

During token-by-token generation, apply the recurrence directly: x_t = Āx_{t-1} + B̄u_t.

Paper 21
Recurrent Neural Network (RNN)

A neural network that processes inputs one step at a time, feeding its

2 papers
Reinforcement Learning from Human Feedback (RLHF)

A three-stage training pipeline for aligning language models: (1) Supervised Fine-Tuning on human demonstrations, (2) training a Reward Model on human preference comparisons, (3) using Reinforcement Learning (PPO) to optimize the policy aga

Paper 15
Rejection Sampling

A data generation strategy: generate many candidate solutions, keep only the correct ones, discard the rest.

Paper 24
Representation learning

The broader idea — Word2Vec is an early instance — that useful

Paper 05
Residual connection

Adding the sub-layer's input directly to its output: `x + SubLayer(x)`.

Paper 08
Reverse-input trick

Sutskever's empirical hack: feed the source sentence to the encoder in

Paper 06
Revision Prompt

A prompt asking an AI to rewrite its own output to address a critique.

Paper 22
Reward Function

In MCTS, the function that assigns a reward to a rollout outcome.

Paper 24
Reward Hacking / Gaming the Reward Model

When the RL policy finds ways to get high reward scores without actually being helpful.

Paper 15
Reward Model (RM)

A neural network trained in the second stage of RLHF to predict which of two responses humans prefer.

2 papers
Ring Attention

A distributed attention algorithm where P GPUs are arranged in a ring topology.

Paper 19
Ring Topology

An arrangement of P GPUs in a logical circle where GPU i communicates with GPU i-1 (receives data) and GPU i+1 (sends data).

Paper 19
RL-CAI (Reinforcement Learning Constitutional AI)

The second stage of Constitutional AI.

Paper 22
RLAIF (Reinforcement Learning from AI Feedback)

The stage of Constitutional AI where an AI (rather than a human) provides feedback on which response better follows the constitution.

Paper 22
RoBERTa

Robustly Optimized BERT Pretraining Approach.

Paper 11
Rollout

In MCTS, a simulation of completing a partial solution to a full solution.

Paper 24
Rotary Position Embeddings (RoPE)

A method of encoding token position information by rotating query and key vectors.

Paper 18
S
Sampling Temperature

A hyperparameter in language model decoding that controls randomness.

Paper 23
Scaled dot-product attention

`Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V`.

2 papers
Scaling

Increasing the size of neural networks (more parameters, more data, more compute).

Paper 13
Scaling Laws / Emergent Capabilities

The observation that larger models have qualitatively different capabilities (reasoning, instruction-following) that smaller models lack.

Paper 15
Segment embedding

One of three embeddings summed to form each token's input representation.

Paper 11
Selective SSM

An SSM where the input projection (B), output projection (C), and step size (Δ) are functions of the input u_t, not fixed constants.

Paper 21
Self-attention

Attention where Q, K, and V all come from the same sequence.

Paper 08
Self-Consistency

A technique that improves chain-of-thought reasoning by sampling multiple independent reasoning chains from the same prompt and taking a majority vote on the final answer.

2 papers
Self-Critique

The process of an AI model reading its own output and identifying whether it violates constitutional principles.

Paper 22
Self-Evolution

A bootstrapping process where: (1) a model generates candidate solutions using search, (2) solutions are verified automatically, (3) correct, high-quality solutions become training data, (4) the model is trained on this data, improving for

Paper 24
SentencePiece

A subword tokenizer that converts text into tokens using a learned vocabulary.

Paper 20
Seq2seq (sequence-to-sequence)

The encoder-decoder architecture from Paper 06 (Sutskever et al., 2014).

Paper 07
Sequence Parallelism

Parallelising the sequence dimension of tensors.

Paper 19
Sequential Revision

A strategy where you iteratively refine a solution, using feedback from one attempt to improve the next.

Paper 23
Sigmoid

The activation function σ(z) = 1/(1+e⁻ᶻ).

Paper 03
Sigmoid function (σ)

A function that squashes any real number into the interval (0, 1).

2 papers
Skip-gram

The other Word2Vec training task, and the one people usually mean

Paper 05
SL-CAI (Supervised Learning Constitutional AI)

The first stage of Constitutional AI.

Paper 22
Sliding Window Attention (SWA)

An attention variant where each token attends only to the last W tokens (a sliding window), not all previous tokens.

Paper 18
Slope

On a log-log plot, the slope of a line is the exponent of the power law.

Paper 13
Soft alignment

Attention's approach: each target word is generated using a *weighted blend* of multiple source words, not a hard assignment to one.

Paper 07
Softmax

A function that turns a vector of raw scores into a probability

5 papers
Softplus Function

Smooth approximation of ReLU: softplus(x) = log(1 + e^x).

Paper 21
Sparse computation

The opposite of dense: only a fraction of parameters are active for any given input.

Paper 09
Specification Gaming

The problem where an AI system finds a way to satisfy the letter of a specification while violating its spirit.

Paper 22
SQuAD

Stanford Question Answering Dataset.

Paper 11
SRAM (Static RAM)

Tiny, ultra-fast on-GPU cache (e.g., 192KB per core).

Paper 21
State Space Model (SSM)

A continuous or discrete linear dynamical system.

2 papers
State Transition Matrix (A)

An n×n matrix governing how the hidden state x evolves over time.

Paper 21
Statistical Machine Translation (SMT)

The dominant pre-2014 translation approach.

Paper 06
Step Size (Δ)

A scalar (or per-head scalar) that controls the discretisation rate.

Paper 21
Straggler Problem

When one GPU is slower than others (older hardware, thermal throttling, interference), it becomes the bottleneck.

Paper 19
Structured State Space (S4)

A prior SSM architecture (Gu et al., 2021) that imposes structure on the A matrix (e.g., diagonal, plus rank-1 update) for efficiency.

Paper 21
Subsampling

A Word2Vec training trick where very common words (like "the", "of",

Paper 05
Supervised Fine-Tuning (SFT)

The first stage of RLHF.

2 papers
Switch Transformer

Google's 2021 simplification of MoE: k=1 routing (route each token to exactly one expert, no blending).

Paper 09
Sycophancy / Sycophantic Behavior

When a model agrees with users even when the user is wrong, in order to be pleasing.

Paper 15
Synchronisation Barrier

A point where all P GPUs pause and wait for the slowest GPU to finish.

Paper 19
T
tanh (hyperbolic tangent)

A function that squashes any real number into the interval (−1, +1).

Paper 04
Teacher forcing

A training technique.

Paper 06
Technical Report

A publication style (unlike peer-reviewed research papers) that allows companies to present results without the formal review process.

Paper 20
Temperature (in Generation)

A hyperparameter controlling randomness in generation.

Paper 12
Test-Time Compute

Spending additional computation at inference time (rather than training time) to improve performance.

2 papers
Thought vector

Another name for the **context vector** — Hinton's evocative label for

Paper 06
Threshold

The minimum weighted sum required for the Perceptron to output 1.

Paper 02
Throughput

The number of queries a system can handle per unit time.

Paper 23
Token

A unit of text, roughly a word or subword.

2 papers
Token Budget

The total number of tokens (words or subwords) available for generating a solution.

Paper 23
Token dropping

When an expert receives more tokens than its capacity allows, excess tokens skip the MoE layer and pass through the residual connection unchanged.

Paper 09
Token Position

The index of a token in the sequence (0 to n-1).

Paper 19
Tokenization

The process of converting input (text, images, audio) into discrete tokens.

Paper 20
Top-k selection (TopK)

The operation that keeps the k largest values in a vector and sets all others to −∞.

Paper 09
Training Data

The text corpus used to train a language model.

Paper 18
Training-Time Compute

Computation used to train the model initially.

Paper 23
Transfer learning

The reuse of knowledge (model weights) learned on one task/dataset for a different but related task.

Paper 10
Transformer

A neural network architecture based on self-attention, introduced in "Attention Is All You Need" (2017).

Paper 20
Transformer Decoder

The architecture used in GPT models: a stack of self-attention and feedforward layers that process tokens left-to-right (causally).

Paper 13
Transparency

A key benefit of Constitutional AI: the principles are written in human-readable natural language, making the intended values explicit and auditable.

Paper 22
Turing Machine

An abstract mathematical machine Turing described in 1936 — not a real physical device, but a thought experiment.

Paper 01
Turing Test

The test proposed by Turing: a machine passes if a human interrogator, communicating only by typed text, cannot reliably distinguish it from a human.

Paper 01