In the paper, accuracy is the percentage of problems solved correctly out of a total.
A non-linear function applied to a neuron's weighted sum before passing the result to the next layer.
The wrapping applied around every sub-layer: `output = LayerNorm(x + SubLayer(x))`.
In policy gradient RL, the difference between actual return and baseline: A = reward - V(prompt).
A period of reduced funding and interest in AI research, typically following a wave of over-promising and under-delivering.
A competition mathematics exam (15 problems, 3 hours).
A BERT variant that reduces parameters by factorising the embedding matrix and sharing weights across Transformer layers.
The process of training an AI system to behave in ways that align with human values, intentions, and safety constraints.
Making language models behave in accordance with human values and preferences.
A grid where each row corresponds to a target word and each column to a source word.
The small neural network inside the attention mechanism that scores how well a decoder state matches each encoder hidden state.
A communication pattern where every GPU sends data to every other GPU.
A sequence of mathematics competitions for students (AMC 8, 10, 12).
Psychological harm experienced by humans who repeatedly review harmful content (violence, abuse, self-harm).
A permissive open-source license allowing commercial use without restriction.
A mechanism introduced the same year as seq2seq (Bahdanau et al., 2014)
The computational cost of attention, typically measured in FLOPs (floating point operations).
The key innovation in Transformers that allows each token to consider the relevance of all other tokens in the sequence.
The probability-like number, between 0 and 1, representing how much the decoder at decoding step t focuses on source position i.
The decoder's mode of operation: generate one token at a time, feed the generated token back as input, generate the next.
A model that generates a sequence by predicting one token at a time, conditioning each prediction on all previously generated tokens.
A training objective where the model learns to predict the next token given all previous tokens.
An additional loss term added to the main cross-entropy language modelling loss during MoE training.
The process of updating node statistics (visit counts, accumulated rewards) as you trace back from a leaf node to the root after a rollout.
Short for "backward propagation of errors." The algorithm that computes the gradient of the loss function with respect to every weight in a multi-layer network, by applying the chain rule backwards from the output layer to the input layer.
The standard way of training RNNs and LSTMs.
The number of training examples (or sequences) processed in a single gradient update.
An inference-time decoding algorithm.
Standardised tests for evaluating language model quality (e.g., MMLU for general reasoning, GSM8k for math, HumanEval for coding).
A strategy where you generate N independent solutions to the same problem and select the best one according to some criterion (e.g., a Process Reward Model score).
A representation built from both left and right context simultaneously.
An LSTM that processes a sequence in both directions — forward and
Two recurrent networks — one reading left to right, one right to left — whose hidden states are concatenated at each position.
The task of deciding whether an input belongs to one of two categories — yes or no, 0 or 1, cat or dog.
Bilingual Evaluation Understudy.
Computing attention in blocks (query chunk × KV chunk) rather than all-at-once.
The training dataset for GPT-1: approximately 7,000 unpublished novels scraped from the web, totalling ~800 million words.
A process where improvement in one component (the model) enables improvement in another (the data), which feeds back to improve the first component further.
The design flaw in plain seq2seq (Paper 06): the entire source sentence's meaning must be compressed into a single fixed-size vector before decoding can begin.
A subword tokenisation algorithm that splits words into common subunits.
A probabilistic ranking model from statistics, used here to model human preferences.
A vector of proposed updates to the cell state, produced by a tanh
A multiplier that sets the maximum number of tokens each expert can process per batch: `capacity = (batch_tokens / n_experts) × capacity_factor`.
When a model loses knowledge from pretraining while being fine-tuned on new data.
Self-attention where position i is prevented from attending to positions j > i (future tokens).
Same as autoregressive language modeling: predict the next token given previous tokens.
A (T × T) upper-triangular boolean mask applied in the decoder's self-attention.
In autoregressive language modelling, preventing the model from attending to future tokens (tokens that come after the current position).
One of the two Word2Vec training tasks.
The "notebook" of an LSTM — a vector that flows from one time step to
A rule in calculus for differentiating composed functions.
A prompting technique where intermediate reasoning steps are shown in few-shot examples, causing language models to generate their own step-by-step reasoning before producing a final answer.
An improvement on the compute-optimal frontier (from DeepMind's Chinchilla paper, 2022).
A thought experiment proposed by philosopher John Searle in 1980 as a critique of the Turing Test.
A reading comprehension exercise invented in 1953 where words are systematically removed from a passage and the reader must fill them in.
The total fine-tuning loss: L₃ = L_task + λ · L_language_model.
The amount of data that must be transferred between GPUs.
A mathematical property of problems: a problem is "computable" if it can be solved by a Turing Machine (i.e.
The total computational resources available for training, measured in FLOPs (floating point operations).
The simultaneous execution of computation and communication.
Achieving the best accuracy for a given computational budget.
The boundary of efficient training allocations: the curve of (N, D) pairs that minimize loss for a given compute budget C.
The choice of which inference-time strategy (Best-of-N vs.
The school of thought in AI that believes intelligence emerges from the interactions of many simple connected units (like neurons), rather than from explicit symbolic rules.
The subjective experience of being aware, of having an inner life.
The original paper's name for the additive structure of the cell state.
A written document specifying principles that an AI should follow.
An alignment methodology that replaces human feedback with AI feedback.
The maximum number of tokens a model can process in a single input.
Distributing a long sequence across multiple GPUs along the sequence dimension (distinct from data, tensor, or pipeline parallelism).
Also called the **thought vector**.
The maximum number of tokens the model can see at once.
In machine learning, a model has converged when its weights have settled and further training produces no improvement.
During training, the recurrence x_t = Āx_{t-1} + B̄u_t can be unrolled and rearranged as a convolution: output = conv(input, kernel).
A body of text used to train a model.
A measure of similarity between two vectors based on the angle between
A prompt asking an AI model to evaluate its own or another model's output against a specific principle.
Attention where the query comes from one sequence (the decoder) and the keys and values come from another sequence (the encoder).
The standard loss function for language modeling.
The dimension of all input and output vectors in the Transformer.
The process of having humans provide labels (e.g., preference comparisons) for training data.
The number of training tokens (or training examples).
The second half of the seq2seq architecture.
A Transformer that uses only the decoder stack (masked self-attention + feed-forward), without the encoder-decoder cross-attention.
Breaking a complex problem into smaller, simpler subproblems and solving each one before combining results.
The standard approach in neural networks: every parameter fires for every input.
The process of converting a continuous-time differential equation (dx/dt = Ax + Bu) into a discrete recurrence (x_t = Āx_{t-1} + B̄u_t).
A compressed version of BERT created by knowledge distillation (training a small model to mimic the outputs of a larger one).
The linguistic claim (often attributed to Firth, 1957) that words
When the RL policy generates responses very different from the distribution the reward model was trained on.
The dimension of the Query and Key vectors.
The philosophical position, associated with René Descartes, that mind and matter are fundamentally different kinds of thing.
Modified attention patterns (sliding window, global, sparse) that reduce computation from O(n²) to O(n log n) or O(n), making long sequences tractable.
A scalar λ such that Av = λv for some eigenvector v.
Multiplication of two vectors of equal length, slot by slot.
A chatbot created in 1966 by Joseph Weizenbaum at MIT.
A dense, low-dimensional vector representation of something — a word,
The length of each word vector.
The (V × d) matrix whose rows are the embeddings for each vocabulary
Capabilities that appear when a language model reaches a certain scale, but were not present (or were very weak) in smaller versions.
A capability that appears in large language models above a certain size threshold, even though it was not explicitly trained for.
The first half of the seq2seq architecture.
The vector produced by the bidirectional encoder for source position i.
The Transformer's two-part structure for seq2seq tasks (e.g., translation).
A philosophy, popularised by this paper, where a single neural network
A term in RL that encourages exploration by rewarding policy entropy (randomness).
One complete pass through all training examples.
One of n specialised feed-forward networks in an MoE layer.
A training failure mode where the gating network routes most tokens to a small number of popular experts, leaving the rest undertrained and effectively unused.
The opposite failure mode: the gradient grows without bound when
The fundamental challenge in search and learning: should you exploit what you've learned (focus on high-reward nodes) or explore new options (try under-explored nodes)?
The power in a power law.
Extending a fitted line to predict values beyond the observed range.
A 2016 extension of Word2Vec (by Bojanowski et al.) that represents
A two-layer MLP applied position-wise after attention: `FFN(x) = max(0, xW₁ + b₁)W₂ + b₂`.
The ability to perform a task from a small number of examples — typically tens to hundreds.
A technique where a language model is shown a small number of examples (typically 2-8) before being asked to solve a new problem.
Adapting a pre-trained model to a specific task by continuing training on labelled task data with a small learning rate.
A sigmoid-valued vector that decides which slots of the previous cell
The computation that flows from input to output through the network: input → layer 1 → layer 2 → ...
A large model pre-trained on broad data that can be adapted to many downstream tasks.
A simpler variant of the LSTM proposed by Cho et al.
The learned routing function `G(x) = Softmax(TopK(H(x), k))`.
Gaussian Error Linear Unit: GELU(x) = x · Φ(x), where Φ is the Gaussian CDF.
The smallest variant (~2–7B parameters) of Gemini, optimized for on-device inference on mobile phones and edge devices.
The balanced variant (~50B parameters estimated) of Gemini, deployed for most production use (Google Bard, Workspace, Search).
The largest variant (~1.3T parameters estimated) of Gemini, achieving the highest benchmarks (90.04% MMLU) but requiring significant compute.
The ability of a model to perform well on new, unseen data (test set).
Whether a trained reward model (or policy) performs well on new, unseen tasks or domains.
### Outcome Reward Model (ORM)
A 2014 alternative to Word2Vec, from Stanford.
General Language Understanding Evaluation.
A 1931 result by mathematician Kurt Gödel: in any formal mathematical system powerful enough to describe arithmetic, there are true statements that cannot be proved within that system.
The vector of partial derivatives of the loss with respect to every
A memory optimisation technique where intermediate activations are discarded during forward pass and recomputed during backward pass.
The optimisation algorithm that trains neural networks.
The simplest decoder strategy.
A variant of Multi-Head Attention where multiple query heads share the same key-value head.
A benchmark dataset of 8,500 grade-school math word problems, ranging from simple arithmetic to multi-step reasoning.
The phenomenon where a language model generates plausible-sounding but factually incorrect or entirely made-up information.
In older statistical machine translation, each target word was explicitly assigned to exactly one source word.
An algorithm designed with GPU memory hierarchy in mind.
A principle in many AI constitutions stating that the AI should avoid providing information, advice, or assistance that could lead to physical, financial, psychological, or emotional harm.
High-capacity GPU memory (e.g., 80GB on an H100), with lower bandwidth than SRAM.
One of h = 8 parallel attention computations in multi-head attention, each operating in a lower-dimensional subspace (dₖ = d_model / h).
The biological learning rule proposed by psychologist Donald Hebb in 1949: "neurons that fire together, wire together." When two neurons are active simultaneously, the connection between them strengthens.
The alignment criteria used to train InstructGPT: helpful (answers user queries well), harmless (doesn't enable or encourage harmful acts), honest (doesn't hallucinate or mislead).
A principle stating that the AI should genuinely assist the human in achieving their goals.
A layer of neurons between the input layer and the output layer.
The LSTM's "spoken" output at each step.
A principle stating that the AI should be truthful and not deliberately mislead the human.
Labels provided by humans comparing two AI outputs and indicating which one is better.
Judgments by human raters about which model outputs are better.
Measure of how often different human raters agree on which output is better.
Performing a task by providing examples in the prompt, without updating model weights.
The ability to retrieve specific facts from long context.
The process of running a trained model on new inputs to generate predictions.
The time to generate a single token during inference.
Inference: generating tokens one-by-one (autoregressive).
The broader principle of improving model performance by allocating more compute at inference time, rather than only at training time.
A high-speed network fabric used in data centres (200+ GB/s).
A sigmoid-valued vector that decides how much of the candidate vector
A matrix or function that projects the input u_t into the state space.
GPT-1's technique for reformatting any NLP task's input as a flat token sequence wrapped in special tokens, allowing the unmodified pre-trained model to handle diverse task formats.
Teaching language models to follow user instructions accurately and safely.
The ability of a language model to accurately follow user instructions and respond helpfully.
One complete cycle of: MCTS search → solution verification → data collection → model training.
The number of experts selected per token.
The "advertisement" projection of each position.
A regularization term in the RL objective that constrains the policy to stay close to the SFT baseline: β · KL[π_RL || π_SFT].
The memory buffer storing Key and Value vectors from all previous tokens during autoregressive (token-by-token) generation.
A segment of the key and value matrices corresponding to a subset of the sequence.
A probability distribution over sequences of tokens.
A neural network trained to predict the next token in a sequence, using next-token prediction as the training objective.
The time taken to produce a response.
Making communication latency disappear by overlapping it with computation.
Applied after each sub-layer.
A small positive number (e.g.
A recurrence relation of the form x_t = Āx_{t-1} + B̄u_t where future x_t depends only on previous x_{t-1}, not on all past history.
A statistical method to fit a straight line through data points.
A property of a dataset: two classes are linearly separable if you can draw a straight line (in 2D) or a flat hyperplane (in higher dimensions) that perfectly separates all examples of one class from all examples of the other.
The goal of ensuring that all n experts receive roughly equal numbers of tokens over training.
A graph where both axes are logarithmic.
A raw, unnormalised score before softmax.
Raw, unnormalised scores output by a neural network before applying softmax.
The full name of the paper.
A mathematical function that measures how wrong the network's prediction is.
The RNN variant used by both the encoder and decoder in this paper.
BERT's primary pre-training objective.
A dataset of 12,500 competition-level math problems from AMC (American Mathematics Competitions) and AIME (American Invitational Mathematics Examination).
With P GPUs using Ring Attention, per-GPU memory is O((n/P) × d), scaling linearly with the number of GPUs.
A layer type where multiple expert networks are available, and a router learns which expert(s) to use for each input.
A benchmark of 57 diverse academic subjects (history, law, science, medicine) with 14,042 multiple-choice questions.
The observation that CoT prompting's effectiveness depends critically on model size.
The number of parameters in a neural network.
A drop-in replacement for the FFN sub-layer in a Transformer.
A search algorithm that explores a decision tree by: (1) selecting promising nodes using UCB, (2) expanding the tree with new candidate moves, (3) running rollouts to simulate outcomes, (4) backing up the results to update node statistics.
`Concat(head₁, ..., headₕ) · W^O`.
An attention variant where all query heads share a single key-value head.
Capable of processing and reasoning over multiple modalities (text, images, audio, video) simultaneously.
Total number of expert networks in one MoE layer.
Training a single model jointly on multiple modalities from the start, rather than training text-first and bolting on vision later.
Human language as it is actually spoken and written — English, Hindi, Tamil, etc.
The training trick that made Word2Vec fast.
The umbrella term for translation systems built entirely from neural
BERT's second pre-training objective.
The pre-training objective: given all previous tokens, predict the probability distribution over the next token.
The specific gating formulation from the 2017 paper: raw logits have Gaussian noise added before top-k selection during training.
A generation strategy where you only consider the top tokens that make up a certain cumulative probability (e.g., top_p=0.9 means consider tokens until their probabilities sum to 90%).
Ensuring computed values don't overflow, underflow, or lose precision.
NVIDIA's high-speed GPU interconnect (576 GB/s per link).
Off-policy: Learning from data generated by other policies (e.g., supervised data).
A vector of length V with a single 1 and the rest zeros.
Performing a task with exactly one example in the prompt.
An incremental softmax computation (using logsumexp trick) that maintains running statistics (max, sum of exponentials) as you process blocks.
For a given compute budget C, the best way to split resources between model size (N) and data size (D) to minimize loss.
A model that scores only the final output (right or wrong), without evaluating intermediate steps.
A sigmoid-valued vector that decides which slots of the cell state
A matrix or function that projects the hidden state x_t back to the output space.
Training error decreases, but test error increases.
A hardware-friendly algorithm (e.g., Blelloch scan) that computes a recurrence y_t = f(x_{t-1}, u_t) in parallel by decomposing it into a tree of subproblems.
The learnable weights in a neural network model.
A metric that evaluates whether at least one out of K generated solutions is correct.
A small rectangular region of an image, typically 14×14 pixels.
An extension to LSTMs (Gers & Schmidhuber, 2000) in which the gates
The artificial neuron Rosenblatt described: takes several inputs, multiplies each by a weight, sums the results, and outputs 1 if the sum exceeds a threshold, 0 otherwise.
A metric for language models derived from cross-entropy loss.
A core data structure in pre-2014 statistical translation.
RL algorithms that improve a policy (probability distribution) by taking gradient steps that increase expected reward.
The language model being trained and improved across rounds.
The property of a word having multiple meanings.
A fixed vector added to each input embedding to inject position information.
A mathematical relationship where one variable is proportional to another raised to a power: y = a * x^b.
A stable reinforcement learning algorithm used in the RL stage.
Training a model on large-scale, typically unlabelled data before fine-tuning.
Word vectors trained on a large corpus by someone else and then
A machine-learning model trained to evaluate the quality of individual steps in a multi-step reasoning process.
Solving problems by writing Python code instead of natural language reasoning.
The practice of carefully designing the text prompt to get better outputs from a language model.
The specific structure and wording of a prompt.
The process of executing Python code to check if a solution is correct.
The cognitive process of chaining ideas together across multiple steps to arrive at a conclusion.
In deep networks, the range of input positions that influence a given output position.
During token-by-token generation, apply the recurrence directly: x_t = Āx_{t-1} + B̄u_t.
A neural network that processes inputs one step at a time, feeding its
A three-stage training pipeline for aligning language models: (1) Supervised Fine-Tuning on human demonstrations, (2) training a Reward Model on human preference comparisons, (3) using Reinforcement Learning (PPO) to optimize the policy aga
A data generation strategy: generate many candidate solutions, keep only the correct ones, discard the rest.
The broader idea — Word2Vec is an early instance — that useful
Adding the sub-layer's input directly to its output: `x + SubLayer(x)`.
Sutskever's empirical hack: feed the source sentence to the encoder in
A prompt asking an AI to rewrite its own output to address a critique.
In MCTS, the function that assigns a reward to a rollout outcome.
When the RL policy finds ways to get high reward scores without actually being helpful.
A neural network trained in the second stage of RLHF to predict which of two responses humans prefer.
A distributed attention algorithm where P GPUs are arranged in a ring topology.
An arrangement of P GPUs in a logical circle where GPU i communicates with GPU i-1 (receives data) and GPU i+1 (sends data).
The second stage of Constitutional AI.
The stage of Constitutional AI where an AI (rather than a human) provides feedback on which response better follows the constitution.
Robustly Optimized BERT Pretraining Approach.
In MCTS, a simulation of completing a partial solution to a full solution.
A method of encoding token position information by rotating query and key vectors.
A hyperparameter in language model decoding that controls randomness.
`Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V`.
Increasing the size of neural networks (more parameters, more data, more compute).
The observation that larger models have qualitatively different capabilities (reasoning, instruction-following) that smaller models lack.
One of three embeddings summed to form each token's input representation.
An SSM where the input projection (B), output projection (C), and step size (Δ) are functions of the input u_t, not fixed constants.
Attention where Q, K, and V all come from the same sequence.
A technique that improves chain-of-thought reasoning by sampling multiple independent reasoning chains from the same prompt and taking a majority vote on the final answer.
The process of an AI model reading its own output and identifying whether it violates constitutional principles.
A bootstrapping process where: (1) a model generates candidate solutions using search, (2) solutions are verified automatically, (3) correct, high-quality solutions become training data, (4) the model is trained on this data, improving for
A subword tokenizer that converts text into tokens using a learned vocabulary.
The encoder-decoder architecture from Paper 06 (Sutskever et al., 2014).
Parallelising the sequence dimension of tensors.
A strategy where you iteratively refine a solution, using feedback from one attempt to improve the next.
The activation function σ(z) = 1/(1+e⁻ᶻ).
A function that squashes any real number into the interval (0, 1).
The other Word2Vec training task, and the one people usually mean
The first stage of Constitutional AI.
An attention variant where each token attends only to the last W tokens (a sliding window), not all previous tokens.
On a log-log plot, the slope of a line is the exponent of the power law.
Attention's approach: each target word is generated using a *weighted blend* of multiple source words, not a hard assignment to one.
A function that turns a vector of raw scores into a probability
Smooth approximation of ReLU: softplus(x) = log(1 + e^x).
The opposite of dense: only a fraction of parameters are active for any given input.
The problem where an AI system finds a way to satisfy the letter of a specification while violating its spirit.
Stanford Question Answering Dataset.
Tiny, ultra-fast on-GPU cache (e.g., 192KB per core).
A continuous or discrete linear dynamical system.
An n×n matrix governing how the hidden state x evolves over time.
The dominant pre-2014 translation approach.
A scalar (or per-head scalar) that controls the discretisation rate.
When one GPU is slower than others (older hardware, thermal throttling, interference), it becomes the bottleneck.
A prior SSM architecture (Gu et al., 2021) that imposes structure on the A matrix (e.g., diagonal, plus rank-1 update) for efficiency.
A Word2Vec training trick where very common words (like "the", "of",
The first stage of RLHF.
Google's 2021 simplification of MoE: k=1 routing (route each token to exactly one expert, no blending).
When a model agrees with users even when the user is wrong, in order to be pleasing.
A point where all P GPUs pause and wait for the slowest GPU to finish.
A function that squashes any real number into the interval (−1, +1).
A training technique.
A publication style (unlike peer-reviewed research papers) that allows companies to present results without the formal review process.
A hyperparameter controlling randomness in generation.
Spending additional computation at inference time (rather than training time) to improve performance.
Another name for the **context vector** — Hinton's evocative label for
The minimum weighted sum required for the Perceptron to output 1.
The number of queries a system can handle per unit time.
A unit of text, roughly a word or subword.
The total number of tokens (words or subwords) available for generating a solution.
When an expert receives more tokens than its capacity allows, excess tokens skip the MoE layer and pass through the residual connection unchanged.
The index of a token in the sequence (0 to n-1).
The process of converting input (text, images, audio) into discrete tokens.
The operation that keeps the k largest values in a vector and sets all others to −∞.
The text corpus used to train a language model.
Computation used to train the model initially.
The reuse of knowledge (model weights) learned on one task/dataset for a different but related task.
A neural network architecture based on self-attention, introduced in "Attention Is All You Need" (2017).
The architecture used in GPT models: a stack of self-attention and feedforward layers that process tokens left-to-right (causally).
A key benefit of Constitutional AI: the principles are written in human-readable natural language, making the intended values explicit and auditable.
An abstract mathematical machine Turing described in 1936 — not a real physical device, but a thought experiment.
The test proposed by Turing: a machine passes if a human interrogator, communicating only by typed text, cannot reliably distinguish it from a human.
When a language model generates intermediate reasoning steps that sound logical and plausible but don't actually reflect how the model arrived at its answer.
A formula that balances exploitation (choosing nodes with high average reward) and exploration (trying under-explored nodes).
The "what I send when selected" projection of each position.
In RL, an estimate of expected future reward used to reduce gradient variance.
The process of encoding organizational or societal values into an AI system.
The problem where gradients shrink toward zero as they propagate backwards through many layers (especially through sigmoid activations).
The phenomenon where, during BPTT on a long sequence, the gradient
In the context of scaling laws, the spread of loss values across multiple runs or models.
A component that evaluates whether a proposed solution is correct.
A Transformer applied to images by dividing them into patches and treating patches as tokens.
The set of words the model knows about.
A number attached to an input connection in a neural network.
The values given to weights before training begins.
Using the same weight matrix for both the token input embedding and the output projection (UW and UWᵀ).
How many words on each side of the target count as "context" in
An evaluation task of the form "A is to B as C is to ___".
Another name for a word embedding — a dense, low-dimensional vector
Collective name for the two 2013 papers (Mikolov et al.) and the
BERT's subword tokenisation algorithm.
A special token prepended to every BERT input.
Special token inserted between input segments (premise and hypothesis, question and answer) during fine-tuning.
Special token appended at the end of the input sequence during fine-tuning.
The special token used to replace selected tokens during MLM pre-training.
A separator token appended after each sentence in BERT's input.
Special token prepended to every input sequence during fine-tuning.