AI Dictionary — Ainiketan

Accuracy / Performance Metric

In the paper, accuracy is the percentage of problems solved correctly out of a total.

Paper 14

Activation Function

A non-linear function applied to a neuron's weighted sum before passing the result to the next layer.

Paper 03

Add & Norm (Residual + Layer Norm)

The wrapping applied around every sub-layer: `output = LayerNorm(x + SubLayer(x))`.

Paper 08

Advantage / Advantage Estimation

In policy gradient RL, the difference between actual return and baseline: A = reward - V(prompt).

Paper 15

AI Winter

A period of reduced funding and interest in AI research, typically following a wave of over-promising and under-delivering.

Paper 02

AIME (American Invitational Mathematics Examination)

A competition mathematics exam (15 problems, 3 hours).

Paper 24

ALBERT

A BERT variant that reduces parameters by factorising the embedding matrix and sharing weights across Transformer layers.

Paper 11

Alignment

The process of training an AI system to behave in ways that align with human values, intentions, and safety constraints.

Paper 22

Alignment / Aligning Language Models

Making language models behave in accordance with human values and preferences.

Paper 15

Alignment matrix (attention heatmap)

A grid where each row corresponds to a target word and each column to a source word.

Paper 07

Alignment model

The small neural network inside the attention mechanism that scores how well a decoder state matches each encoder hidden state.

Paper 07

All-to-All Communication

A communication pattern where every GPU sends data to every other GPU.

Paper 19

AMC (American Mathematics Competitions)

A sequence of mathematics competitions for students (AMC 8, 10, 12).

Paper 24

Annotator Burnout

Psychological harm experienced by humans who repeatedly review harmful content (violence, abuse, self-harm).

Paper 22

Apache 2.0 License

A permissive open-source license allowing commercial use without restriction.

Paper 18

Attention

A mechanism introduced the same year as seq2seq (Bahdanau et al., 2014)

Paper 06

Attention Complexity

The computational cost of attention, typically measured in FLOPs (floating point operations).

Paper 18

Attention Mechanism

The key innovation in Transformers that allows each token to consider the relevance of all other tokens in the sequence.

3 papers

Attention weight (αₜᵢ)

The probability-like number, between 0 and 1, representing how much the decoder at decoding step t focuses on source position i.

2 papers

Autoregressive generation

The decoder's mode of operation: generate one token at a time, feed the generated token back as input, generate the next.

Paper 08

Autoregressive language model

A model that generates a sequence by predicting one token at a time, conditioning each prediction on all previously generated tokens.

Paper 10

Autoregressive Language Modeling

A training objective where the model learns to predict the next token given all previous tokens.

Paper 12

Auxiliary balancing loss (L_balance)

An additional loss term added to the main cross-entropy language modelling loss during MoE training.

Paper 09

Back-propagation (MCTS)

The process of updating node statistics (visit counts, accumulated rewards) as you trace back from a leaf node to the root after a rollout.

Paper 24

Backpropagation

Short for "backward propagation of errors." The algorithm that computes the gradient of the loss function with respect to every weight in a multi-layer network, by applying the chain rule backwards from the output layer to the input layer.

Paper 03

Backpropagation Through Time (BPTT)

The standard way of training RNNs and LSTMs.

Paper 04

Batch Size

The number of training examples (or sequences) processed in a single gradient update.

Paper 13

Beam search

An inference-time decoding algorithm.

2 papers

Benchmark

Standardised tests for evaluating language model quality (e.g., MMLU for general reasoning, GSM8k for math, HumanEval for coding).

Paper 18

Best-of-N (BoN)

A strategy where you generate N independent solutions to the same problem and select the best one according to some criterion (e.g., a Process Reward Model score).

Paper 23

Bidirectional context

A representation built from both left and right context simultaneously.

Paper 11

Bidirectional LSTM

An LSTM that processes a sequence in both directions — forward and

Paper 04

Bidirectional RNN (BiRNN)

Two recurrent networks — one reading left to right, one right to left — whose hidden states are concatenated at each position.

Paper 07

Binary Classification

The task of deciding whether an input belongs to one of two categories — yes or no, 0 or 1, cat or dog.

Paper 02

BLEU score

Bilingual Evaluation Understudy.

2 papers

Blockwise Attention

Computing attention in blocks (query chunk × KV chunk) rather than all-at-once.

Paper 19

BooksCorpus

The training dataset for GPT-1: approximately 7,000 unpublished novels scraped from the web, totalling ~800 million words.

Paper 10

Bootstrapping

A process where improvement in one component (the model) enables improvement in another (the data), which feeds back to improve the first component further.

Paper 24

Bottleneck (context vector bottleneck)

The design flaw in plain seq2seq (Paper 06): the entire source sentence's meaning must be compressed into a single fixed-size vector before decoding can begin.

Paper 07

BPE (Byte Pair Encoding)

A subword tokenisation algorithm that splits words into common subunits.

Paper 10

Bradley-Terry Model

A probabilistic ranking model from statistics, used here to model human preferences.

2 papers

Candidate vector (c̃ₜ)

A vector of proposed updates to the cell state, produced by a tanh

Paper 04

Capacity factor

A multiplier that sets the maximum number of tokens each expert can process per batch: `capacity = (batch_tokens / n_experts) × capacity_factor`.

Paper 09

Catastrophic Forgetting

When a model loses knowledge from pretraining while being fine-tuned on new data.

Paper 15

Causal (masked) self-attention

Self-attention where position i is prevented from attending to positions j > i (future tokens).

2 papers

Causal Language Modeling

Same as autoregressive language modeling: predict the next token given previous tokens.

Paper 12

Causal mask (autoregressive mask)

A (T × T) upper-triangular boolean mask applied in the decoder's self-attention.

Paper 08

Causal Masking

In autoregressive language modelling, preventing the model from attending to future tokens (tokens that come after the current position).

2 papers

CBOW (Continuous Bag of Words)

One of the two Word2Vec training tasks.

Paper 05

Cell state (c)

The "notebook" of an LSTM — a vector that flows from one time step to

Paper 04

Chain rule

A rule in calculus for differentiating composed functions.

Paper 04

Chain-of-Thought (CoT) Prompting

A prompting technique where intermediate reasoning steps are shown in few-shot examples, causing language models to generate their own step-by-step reasoning before producing a final answer.

2 papers

Chinchilla Ratio

An improvement on the compute-optimal frontier (from DeepMind's Chinchilla paper, 2022).

Paper 13

Chinese Room

A thought experiment proposed by philosopher John Searle in 1980 as a critique of the Turing Test.

Paper 01

Cloze test

A reading comprehension exercise invented in 1953 where words are systematically removed from a passage and the reader must fill them in.

Paper 11

Combined loss (L₃)

The total fine-tuning loss: L₃ = L_task + λ · L_language_model.

Paper 10

Communication Complexity

The amount of data that must be transferred between GPUs.

Paper 19

Computability

A mathematical property of problems: a problem is "computable" if it can be solved by a Turing Machine (i.e.

Paper 01

Compute Budget (C)

The total computational resources available for training, measured in FLOPs (floating point operations).

Paper 13

Compute-Communication Overlap

The simultaneous execution of computation and communication.

Paper 19

Compute-Optimal

Achieving the best accuracy for a given computational budget.

Paper 24

Compute-Optimal Frontier

The boundary of efficient training allocations: the curve of (N, D) pairs that minimize loss for a given compute budget C.

Paper 13

Compute-Optimal Strategy

The choice of which inference-time strategy (Best-of-N vs.

Paper 23

Connectionism

The school of thought in AI that believes intelligence emerges from the interactions of many simple connected units (like neurons), rather than from explicit symbolic rules.

Paper 02

Consciousness

The subjective experience of being aware, of having an inner life.

Paper 01

Constant error carousel

The original paper's name for the additive structure of the cell state.

Paper 04

Constitution

A written document specifying principles that an AI should follow.

Paper 22

Constitutional AI (CAI)

An alignment methodology that replaces human feedback with AI feedback.

Paper 22

Context Length

The maximum number of tokens a model can process in a single input.

Paper 18

Context Parallelism

Distributing a long sequence across multiple GPUs along the sequence dimension (distinct from data, tensor, or pipeline parallelism).

Paper 19

Context vector

Also called the **thought vector**.

2 papers

Context window

The maximum number of tokens the model can see at once.

2 papers

Convergence

In machine learning, a model has converged when its weights have settled and further training produces no improvement.

Paper 02

Convolutional Mode (Training)

During training, the recurrence x_t = Āx_{t-1} + B̄u_t can be unrolled and rearranged as a convolution: output = conv(input, kernel).

Paper 21

Corpus

A body of text used to train a model.

Paper 05

Cosine similarity

A measure of similarity between two vectors based on the angle between

Paper 05

Critique Prompt

A prompt asking an AI model to evaluate its own or another model's output against a specific principle.

Paper 22

Cross-attention

Attention where the query comes from one sequence (the decoder) and the keys and values come from another sequence (the encoder).

2 papers

Cross-Entropy Loss

The standard loss function for language modeling.

2 papers

d_model (model dimension)

The dimension of all input and output vectors in the Transformer.

Paper 08

Data Annotation / Labeling

The process of having humans provide labels (e.g., preference comparisons) for training data.

Paper 15

Dataset Size (D)

The number of training tokens (or training examples).

Paper 13

Decoder

The second half of the seq2seq architecture.

2 papers

Decoder-only Transformer

A Transformer that uses only the decoder stack (masked self-attention + feed-forward), without the encoder-decoder cross-attention.

2 papers

Decomposition / Reasoning Decomposition

Breaking a complex problem into smaller, simpler subproblems and solving each one before combining results.

Paper 14

Dense computation

The standard approach in neural networks: every parameter fires for every input.

Paper 09

Discretisation

The process of converting a continuous-time differential equation (dx/dt = Ax + Bu) into a discrete recurrence (x_t = Āx_{t-1} + B̄u_t).

Paper 21

DistilBERT

A compressed version of BERT created by knowledge distillation (training a small model to mimic the outputs of a larger one).

Paper 11

Distributional hypothesis

The linguistic claim (often attributed to Firth, 1957) that words

Paper 05

Distributional Shift

When the RL policy generates responses very different from the distribution the reward model was trained on.

Paper 15

dₖ (key dimension)

The dimension of the Query and Key vectors.

Paper 08

Dualism

The philosophical position, associated with René Descartes, that mind and matter are fundamentally different kinds of thing.

Paper 01

Efficient Attention

Modified attention patterns (sliding window, global, sparse) that reduce computation from O(n²) to O(n log n) or O(n), making long sequences tractable.

Paper 20

Eigenvalue (of State Matrix A)

A scalar λ such that Av = λv for some eigenvector v.

Paper 21

Element-wise product (Hadamard product, ⊙)

Multiplication of two vectors of equal length, slot by slot.

Paper 04

ELIZA

A chatbot created in 1966 by Joseph Weizenbaum at MIT.

Paper 01

Embedding

A dense, low-dimensional vector representation of something — a word,

3 papers

Embedding dimension (d)

The length of each word vector.

Paper 05

Embedding matrix (W)

The (V × d) matrix whose rows are the embeddings for each vocabulary

Paper 05

Emergent Abilities

Capabilities that appear when a language model reaches a certain scale, but were not present (or were very weak) in smaller versions.

Paper 12

Emergent Capability

A capability that appears in large language models above a certain size threshold, even though it was not explicitly trained for.

Paper 14

Encoder

The first half of the seq2seq architecture.

2 papers

Encoder hidden state (hᵢ)

The vector produced by the bidirectional encoder for source position i.

Paper 07

Encoder-decoder

The Transformer's two-part structure for seq2seq tasks (e.g., translation).

Paper 08

End-to-end learning

A philosophy, popularised by this paper, where a single neural network

Paper 06

Entropy Regularization

A term in RL that encourages exploration by rewarding policy entropy (randomness).

Paper 15

Epoch

One complete pass through all training examples.

2 papers

Expert

One of n specialised feed-forward networks in an MoE layer.

Paper 09

Expert collapse

A training failure mode where the gating network routes most tokens to a small number of popular experts, leaving the rest undertrained and effectively unused.

Paper 09

Exploding gradient problem

The opposite failure mode: the gradient grows without bound when

Paper 04

Exploration-Exploitation Trade-off

The fundamental challenge in search and learning: should you exploit what you've learned (focus on high-reward nodes) or explore new options (try under-explored nodes)?

Paper 24

Exponent (Alpha, Beta, Gamma)

The power in a power law.

Paper 13

Extrapolation

Extending a fitted line to predict values beyond the observed range.

Paper 13

fastText

A 2016 extension of Word2Vec (by Bojanowski et al.) that represents

Paper 05

Feed-forward network (FFN)

A two-layer MLP applied position-wise after attention: `FFN(x) = max(0, xW₁ + b₁)W₂ + b₂`.

Paper 08

Few-shot learning

The ability to perform a task from a small number of examples — typically tens to hundreds.

2 papers

Few-Shot Prompting

A technique where a language model is shown a small number of examples (typically 2-8) before being asked to solve a new problem.

Paper 14

Fine-tuning

Adapting a pre-trained model to a specific task by continuing training on labelled task data with a small learning rate.

4 papers

Forget gate (fₜ)

A sigmoid-valued vector that decides which slots of the previous cell

Paper 04

Forward Pass

The computation that flows from input to output through the network: input → layer 1 → layer 2 → ...

Paper 03

Foundation model

A large model pre-trained on broad data that can be adapted to many downstream tasks.

Paper 10

Gated Recurrent Unit (GRU)

A simpler variant of the LSTM proposed by Cho et al.

Paper 04

Gating network

The learned routing function `G(x) = Softmax(TopK(H(x), k))`.

Paper 09

GELU activation

Gaussian Error Linear Unit: GELU(x) = x · Φ(x), where Φ is the Gaussian CDF.

Paper 10

Gemini Nano

The smallest variant (~2–7B parameters) of Gemini, optimized for on-device inference on mobile phones and edge devices.

Paper 20

Gemini Pro

The balanced variant (~50B parameters estimated) of Gemini, deployed for most production use (Google Bard, Workspace, Search).

Paper 20

Gemini Ultra

The largest variant (~1.3T parameters estimated) of Gemini, achieving the highest benchmarks (90.04% MMLU) but requiring significant compute.

Paper 20

Generalization

The ability of a model to perform well on new, unseen data (test set).

Paper 13

Generalization / Generalizing to New Domains

Whether a trained reward model (or policy) performs well on new, unseen tasks or domains.

Paper 15

Glossary: Let's Verify Step by Step

### Outcome Reward Model (ORM)

2 papers

GloVe

A 2014 alternative to Word2Vec, from Stanford.

Paper 05

GLUE benchmark

General Language Understanding Evaluation.

Paper 11

Gödel's Incompleteness Theorem

A 1931 result by mathematician Kurt Gödel: in any formal mathematical system powerful enough to describe arithmetic, there are true statements that cannot be proved within that system.

Paper 01

Gradient

The vector of partial derivatives of the loss with respect to every

Paper 04

Gradient Checkpointing

A memory optimisation technique where intermediate activations are discarded during forward pass and recomputed during backward pass.

Paper 19

Gradient Descent

The optimisation algorithm that trains neural networks.

Paper 03

Greedy decoding

The simplest decoder strategy.

Paper 06

Grouped Query Attention (GQA)

A variant of Multi-Head Attention where multiple query heads share the same key-value head.

Paper 18

GSM8K

A benchmark dataset of 8,500 grade-school math word problems, ranging from simple arithmetic to multi-step reasoning.

Paper 14

Hallucination

The phenomenon where a language model generates plausible-sounding but factually incorrect or entirely made-up information.

Paper 12

Hard alignment

In older statistical machine translation, each target word was explicitly assigned to exactly one source word.

Paper 07

Hardware-Aware Algorithm

An algorithm designed with GPU memory hierarchy in mind.

Paper 21

Harm Prevention

A principle in many AI constitutions stating that the AI should avoid providing information, advice, or assistance that could lead to physical, financial, psychological, or emotional harm.

Paper 22

HBM (High Bandwidth Memory)

High-capacity GPU memory (e.g., 80GB on an H100), with lower bandwidth than SRAM.

Paper 21

Head (attention head)

One of h = 8 parallel attention computations in multi-head attention, each operating in a lower-dimensional subspace (dₖ = d_model / h).

Paper 08

Hebbian Learning

The biological learning rule proposed by psychologist Donald Hebb in 1949: "neurons that fire together, wire together." When two neurons are active simultaneously, the connection between them strengthens.

Paper 02

Helpful, Harmless, Honest (HHH)

The alignment criteria used to train InstructGPT: helpful (answers user queries well), harmless (doesn't enable or encourage harmful acts), honest (doesn't hallucinate or mislead).

Paper 15

Helpfulness

A principle stating that the AI should genuinely assist the human in achieving their goals.

Paper 22

Hidden Layer

A layer of neurons between the input layer and the output layer.

Paper 03

Hidden state (h)

The LSTM's "spoken" output at each step.

2 papers

Honesty

A principle stating that the AI should be truthful and not deliberately mislead the human.

Paper 22

Human Feedback (HF)

Labels provided by humans comparing two AI outputs and indicating which one is better.

Paper 22

Human Preference / Human Feedback

Judgments by human raters about which model outputs are better.

Paper 15

Human Rater Agreement / Inter-Rater Reliability

Measure of how often different human raters agree on which output is better.

Paper 15

In-context learning

Performing a task by providing examples in the prompt, without updating model weights.

4 papers

In-Context Recall

The ability to retrieve specific facts from long context.

Paper 21

Inference

The process of running a trained model on new inputs to generate predictions.

2 papers

Inference Latency

The time to generate a single token during inference.

Paper 18

Inference vs Training

Inference: generating tokens one-by-one (autoregressive).

Paper 19

Inference-Time Scaling

The broader principle of improving model performance by allocating more compute at inference time, rather than only at training time.

Paper 23

InfiniBand

A high-speed network fabric used in data centres (200+ GB/s).

Paper 19

Input gate (iₜ)

A sigmoid-valued vector that decides how much of the candidate vector

Paper 04

Input Projection (B)

A matrix or function that projects the input u_t into the state space.

Paper 21

Input transformation

GPT-1's technique for reformatting any NLP task's input as a flat token sequence wrapped in special tokens, allowing the unmodified pre-trained model to handle diverse task formats.

Paper 10

Instruction Following / Alignment

Teaching language models to follow user instructions accurately and safely.

Paper 14

Instruction-Following

The ability of a language model to accurately follow user instructions and respond helpfully.

Paper 15

Iteration / Round

One complete cycle of: MCTS search → solution verification → data collection → model training.

Paper 24

Jamba

An LLM from AI21 Labs (2024) that alternates Mamba and Attention blocks in its architecture.

Paper 21

k (top-k experts)

The number of experts selected per token.

Paper 09

Key (K)

The "advertisement" projection of each position.

Paper 08

KL Divergence Penalty

A regularization term in the RL objective that constrains the policy to stay close to the SFT baseline: β · KL[π_RL || π_SFT].

Paper 15

KV Cache

The memory buffer storing Key and Value vectors from all previous tokens during autoregressive (token-by-token) generation.

Paper 18

KV Chunk (Key-Value Chunk)

A segment of the key and value matrices corresponding to a subset of the sequence.

Paper 19

Language model

A probability distribution over sequences of tokens.

Paper 10

Large Language Model (LLM)

A neural network trained to predict the next token in a sequence, using next-token prediction as the training objective.

Paper 14

Latency

The time taken to produce a response.

Paper 23

Latency Hiding

Making communication latency disappear by overlapping it with computation.

Paper 19

Layer Normalisation (Layer Norm)

Applied after each sub-layer.

Paper 08

Learning Rate

A small positive number (e.g.

3 papers

Linear Recurrence

A recurrence relation of the form x_t = Āx_{t-1} + B̄u_t where future x_t depends only on previous x_{t-1}, not on all past history.

Paper 21

Linear Regression

A statistical method to fit a straight line through data points.

Paper 13

Linear Separability

A property of a dataset: two classes are linearly separable if you can draw a straight line (in 2D) or a flat hyperplane (in higher dimensions) that perfectly separates all examples of one class from all examples of the other.

Paper 02

Load balancing

The goal of ensuring that all n experts receive roughly equal numbers of tokens over training.

2 papers

Log-Log Plot

A graph where both axes are logarithmic.

Paper 13

Logit

A raw, unnormalised score before softmax.

Paper 08

Logits

Raw, unnormalised scores output by a neural network before applying softmax.

Paper 07

Long short-term memory

The full name of the paper.

Paper 04

Loss Function

A mathematical function that measures how wrong the network's prediction is.

Paper 03

LSTM (Long Short-Term Memory)

The RNN variant used by both the encoder and decoder in this paper.

Paper 06

Masked Language Modelling (MLM)

BERT's primary pre-training objective.

Paper 11

MATH Benchmark

A dataset of 12,500 competition-level math problems from AMC (American Mathematics Competitions) and AIME (American Invitational Mathematics Examination).

2 papers

Memory Scaling

With P GPUs using Ring Attention, per-GPU memory is O((n/P) × d), scaling linearly with the number of GPUs.

Paper 19

Mixture of Experts (MoE)

A layer type where multiple expert networks are available, and a router learns which expert(s) to use for each input.

Paper 18

MMLU (Massive Multitask Language Understanding)

A benchmark of 57 diverse academic subjects (history, law, science, medicine) with 14,042 multiple-choice questions.

Paper 20

Model Scaling / Emergent Threshold

The observation that CoT prompting's effectiveness depends critically on model size.

Paper 14

Model Size (N)

The number of parameters in a neural network.

Paper 13

MoE (Mixture of Experts) layer

A drop-in replacement for the FFN sub-layer in a Transformer.

Paper 09

Monte Carlo Tree Search (MCTS)

A search algorithm that explores a decision tree by: (1) selecting promising nodes using UCB, (2) expanding the tree with new candidate moves, (3) running rollouts to simulate outcomes, (4) backing up the results to update node statistics.

Paper 24

Multi-head attention (MHA)

`Concat(head₁, ..., headₕ) · W^O`.

2 papers

Multi-Query Attention (MQA)

An attention variant where all query heads share a single key-value head.

Paper 18

Multimodal

Capable of processing and reasoning over multiple modalities (text, images, audio, video) simultaneously.

Paper 20

n (number of experts)

Total number of expert networks in one MoE layer.

Paper 09

Native Multimodality

Training a single model jointly on multiple modalities from the start, rather than training text-first and bolting on vision later.

Paper 20

Natural Language

Human language as it is actually spoken and written — English, Hindi, Tamil, etc.

Paper 01

Negative sampling

The training trick that made Word2Vec fast.

Paper 05

Neural Machine Translation (NMT)

The umbrella term for translation systems built entirely from neural

Paper 06

Next Sentence Prediction (NSP)

BERT's second pre-training objective.

Paper 11

Next-token prediction

The pre-training objective: given all previous tokens, predict the probability distribution over the next token.

Paper 10

Noisy top-k gating

The specific gating formulation from the 2017 paper: raw logits have Gaussian noise added before top-k selection during training.

Paper 09

Nucleus Sampling (Top-P Sampling)

A generation strategy where you only consider the top tokens that make up a certain cumulative probability (e.g., top_p=0.9 means consider tokens until their probabilities sum to 90%).

Paper 12

Numerical Stability

Ensuring computed values don't overflow, underflow, or lose precision.

Paper 19

NVLink

NVIDIA's high-speed GPU interconnect (576 GB/s per link).

Paper 19

Off-Policy vs. On-Policy RL

Off-policy: Learning from data generated by other policies (e.g., supervised data).

Paper 15

One-hot vector

A vector of length V with a single 1 and the rest zeros.

Paper 05

One-Shot Learning

Performing a task with exactly one example in the prompt.

Paper 12

Online Softmax

An incremental softmax computation (using logsumexp trick) that maintains running statistics (max, sum of exponentials) as you process blocks.

Paper 19

Optimal Allocation

For a given compute budget C, the best way to split resources between model size (N) and data size (D) to minimize loss.

Paper 13

Outcome Reward Model (ORM)

A model that scores only the final output (right or wrong), without evaluating intermediate steps.

2 papers

Output gate (oₜ)

A sigmoid-valued vector that decides which slots of the cell state

Paper 04

Output Projection (C)

A matrix or function that projects the hidden state x_t back to the output space.

Paper 21

Overfitting

Training error decreases, but test error increases.

Paper 13

Parallel Scan

A hardware-friendly algorithm (e.g., Blelloch scan) that computes a recurrence y_t = f(x_{t-1}, u_t) in parallel by decomposing it into a tree of subproblems.

Paper 21

Parameters

The learnable weights in a neural network model.

Paper 18

Pass@K

A metric that evaluates whether at least one out of K generated solutions is correct.

Paper 23

Patch (Image Patch)

A small rectangular region of an image, typically 14×14 pixels.

Paper 20

Peephole connections

An extension to LSTMs (Gers & Schmidhuber, 2000) in which the gates

Paper 04

Perceptron

The artificial neuron Rosenblatt described: takes several inputs, multiplies each by a weight, sums the results, and outputs 1 if the sum exceeds a threshold, 0 otherwise.

Paper 02

Perplexity

A metric for language models derived from cross-entropy loss.

Paper 12

Phrase table

A core data structure in pre-2014 statistical translation.

Paper 06

Policy Gradient / Policy Optimization

RL algorithms that improve a policy (probability distribution) by taking gradient steps that increase expected reward.

Paper 15

Policy Model

The language model being trained and improved across rounds.

Paper 24

Polysemy

The property of a word having multiple meanings.

Paper 05

Positional encoding (PE)

A fixed vector added to each input embedding to inject position information.

2 papers

Power Law

A mathematical relationship where one variable is proportional to another raised to a power: y = a * x^b.

Paper 13

PPO (Proximal Policy Optimization)

A stable reinforcement learning algorithm used in the RL stage.

Paper 15

Pre-training

Training a model on large-scale, typically unlabelled data before fine-tuning.

3 papers

Pretrained vectors

Word vectors trained on a large corpus by someone else and then

Paper 05

Process Reward Model (PRM)

A machine-learning model trained to evaluate the quality of individual steps in a multi-step reasoning process.

2 papers

Program-of-Thought (PoT)

Solving problems by writing Python code instead of natural language reasoning.

Paper 24

Prompt Engineering

The practice of carefully designing the text prompt to get better outputs from a language model.

2 papers

Prompt Format

The specific structure and wording of a prompt.

Paper 12

Python Verification

The process of executing Python code to check if a solution is correct.

Paper 24

Query (Q)

The "what am I looking for?" projection of each position.

Paper 08

Query Chunk (Q Chunk)

The subset of query vectors on a given GPU.

Paper 19

Query Head

In multi-head attention, one of n_heads independent attention mechanisms.

Paper 18

Reasoning / Multi-Step Reasoning

The cognitive process of chaining ideas together across multiple steps to arrive at a conclusion.

Paper 14

Receptive Field

In deep networks, the range of input positions that influence a given output position.

2 papers

Recurrent Mode (Inference)

During token-by-token generation, apply the recurrence directly: x_t = Āx_{t-1} + B̄u_t.

Paper 21

Recurrent Neural Network (RNN)

A neural network that processes inputs one step at a time, feeding its

2 papers

Reinforcement Learning from Human Feedback (RLHF)

A three-stage training pipeline for aligning language models: (1) Supervised Fine-Tuning on human demonstrations, (2) training a Reward Model on human preference comparisons, (3) using Reinforcement Learning (PPO) to optimize the policy aga

Paper 15

Rejection Sampling

A data generation strategy: generate many candidate solutions, keep only the correct ones, discard the rest.

Paper 24

Representation learning

The broader idea — Word2Vec is an early instance — that useful

Paper 05

Residual connection

Adding the sub-layer's input directly to its output: `x + SubLayer(x)`.

Paper 08

Reverse-input trick

Sutskever's empirical hack: feed the source sentence to the encoder in

Paper 06

Revision Prompt

A prompt asking an AI to rewrite its own output to address a critique.

Paper 22

Reward Function

In MCTS, the function that assigns a reward to a rollout outcome.

Paper 24

Reward Hacking / Gaming the Reward Model

When the RL policy finds ways to get high reward scores without actually being helpful.

Paper 15

Reward Model (RM)

A neural network trained in the second stage of RLHF to predict which of two responses humans prefer.

2 papers

Ring Attention

A distributed attention algorithm where P GPUs are arranged in a ring topology.

Paper 19

Ring Topology

An arrangement of P GPUs in a logical circle where GPU i communicates with GPU i-1 (receives data) and GPU i+1 (sends data).

Paper 19

RL-CAI (Reinforcement Learning Constitutional AI)

The second stage of Constitutional AI.

Paper 22

RLAIF (Reinforcement Learning from AI Feedback)

The stage of Constitutional AI where an AI (rather than a human) provides feedback on which response better follows the constitution.

Paper 22

RoBERTa

Robustly Optimized BERT Pretraining Approach.

Paper 11

Rollout

In MCTS, a simulation of completing a partial solution to a full solution.

Paper 24

Rotary Position Embeddings (RoPE)

A method of encoding token position information by rotating query and key vectors.

Paper 18

Sampling Temperature

A hyperparameter in language model decoding that controls randomness.

Paper 23

Scaled dot-product attention

`Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V`.

2 papers

Scaling

Increasing the size of neural networks (more parameters, more data, more compute).

Paper 13

Scaling Laws / Emergent Capabilities

The observation that larger models have qualitatively different capabilities (reasoning, instruction-following) that smaller models lack.

Paper 15

Segment embedding

One of three embeddings summed to form each token's input representation.

Paper 11

Selective SSM

An SSM where the input projection (B), output projection (C), and step size (Δ) are functions of the input u_t, not fixed constants.

Paper 21

Self-attention

Attention where Q, K, and V all come from the same sequence.

Paper 08

Self-Consistency

A technique that improves chain-of-thought reasoning by sampling multiple independent reasoning chains from the same prompt and taking a majority vote on the final answer.

2 papers

Self-Critique

The process of an AI model reading its own output and identifying whether it violates constitutional principles.

Paper 22

Self-Evolution

A bootstrapping process where: (1) a model generates candidate solutions using search, (2) solutions are verified automatically, (3) correct, high-quality solutions become training data, (4) the model is trained on this data, improving for

Paper 24

SentencePiece

A subword tokenizer that converts text into tokens using a learned vocabulary.

Paper 20

Seq2seq (sequence-to-sequence)

The encoder-decoder architecture from Paper 06 (Sutskever et al., 2014).

Paper 07

Sequence Parallelism

Parallelising the sequence dimension of tensors.

Paper 19

Sequential Revision

A strategy where you iteratively refine a solution, using feedback from one attempt to improve the next.

Paper 23

Sigmoid

The activation function σ(z) = 1/(1+e⁻ᶻ).

Paper 03

Sigmoid function (σ)

A function that squashes any real number into the interval (0, 1).

2 papers

Skip-gram

The other Word2Vec training task, and the one people usually mean

Paper 05

SL-CAI (Supervised Learning Constitutional AI)

The first stage of Constitutional AI.

Paper 22

Sliding Window Attention (SWA)

An attention variant where each token attends only to the last W tokens (a sliding window), not all previous tokens.

Paper 18

Slope

On a log-log plot, the slope of a line is the exponent of the power law.

Paper 13

Soft alignment

Attention's approach: each target word is generated using a *weighted blend* of multiple source words, not a hard assignment to one.

Paper 07

Softmax

A function that turns a vector of raw scores into a probability

5 papers

Softplus Function

Smooth approximation of ReLU: softplus(x) = log(1 + e^x).

Paper 21

Sparse computation

The opposite of dense: only a fraction of parameters are active for any given input.

Paper 09

Specification Gaming

The problem where an AI system finds a way to satisfy the letter of a specification while violating its spirit.

Paper 22

SQuAD

Stanford Question Answering Dataset.

Paper 11

SRAM (Static RAM)

Tiny, ultra-fast on-GPU cache (e.g., 192KB per core).

Paper 21

State Space Model (SSM)

A continuous or discrete linear dynamical system.

2 papers

State Transition Matrix (A)

An n×n matrix governing how the hidden state x evolves over time.

Paper 21

Statistical Machine Translation (SMT)

The dominant pre-2014 translation approach.

Paper 06

Step Size (Δ)

A scalar (or per-head scalar) that controls the discretisation rate.

Paper 21

Straggler Problem

When one GPU is slower than others (older hardware, thermal throttling, interference), it becomes the bottleneck.

Paper 19

Structured State Space (S4)

A prior SSM architecture (Gu et al., 2021) that imposes structure on the A matrix (e.g., diagonal, plus rank-1 update) for efficiency.

Paper 21

Subsampling

A Word2Vec training trick where very common words (like "the", "of",

Paper 05

Supervised Fine-Tuning (SFT)

The first stage of RLHF.

2 papers

Switch Transformer

Google's 2021 simplification of MoE: k=1 routing (route each token to exactly one expert, no blending).

Paper 09

Sycophancy / Sycophantic Behavior

When a model agrees with users even when the user is wrong, in order to be pleasing.

Paper 15

Synchronisation Barrier

A point where all P GPUs pause and wait for the slowest GPU to finish.

Paper 19

tanh (hyperbolic tangent)

A function that squashes any real number into the interval (−1, +1).

Paper 04

Teacher forcing

A training technique.

Paper 06

Technical Report

A publication style (unlike peer-reviewed research papers) that allows companies to present results without the formal review process.

Paper 20

Temperature (in Generation)

A hyperparameter controlling randomness in generation.

Paper 12

Test-Time Compute

Spending additional computation at inference time (rather than training time) to improve performance.

2 papers

Thought vector

Another name for the **context vector** — Hinton's evocative label for

Paper 06

Threshold

The minimum weighted sum required for the Perceptron to output 1.

Paper 02

Throughput

The number of queries a system can handle per unit time.

Paper 23

Token

A unit of text, roughly a word or subword.

2 papers

Token Budget

The total number of tokens (words or subwords) available for generating a solution.

Paper 23

Token dropping

When an expert receives more tokens than its capacity allows, excess tokens skip the MoE layer and pass through the residual connection unchanged.

Paper 09

Token Position

The index of a token in the sequence (0 to n-1).

Paper 19

Tokenization

The process of converting input (text, images, audio) into discrete tokens.

Paper 20

Top-k selection (TopK)

The operation that keeps the k largest values in a vector and sets all others to −∞.

Paper 09

Training Data

The text corpus used to train a language model.

Paper 18

Training-Time Compute

Computation used to train the model initially.

Paper 23

Transfer learning

The reuse of knowledge (model weights) learned on one task/dataset for a different but related task.

Paper 10

Transformer

A neural network architecture based on self-attention, introduced in "Attention Is All You Need" (2017).

Paper 20

Transformer Decoder

The architecture used in GPT models: a stack of self-attention and feedforward layers that process tokens left-to-right (causally).

Paper 13

Transparency

A key benefit of Constitutional AI: the principles are written in human-readable natural language, making the intended values explicit and auditable.

Paper 22

Turing Machine

An abstract mathematical machine Turing described in 1936 — not a real physical device, but a thought experiment.

Paper 01

Turing Test

The test proposed by Turing: a machine passes if a human interrogator, communicating only by typed text, cannot reliably distinguish it from a human.

Paper 01

Unfaithful Reasoning

When a language model generates intermediate reasoning steps that sound logical and plausible but don't actually reflect how the model arrived at its answer.

Paper 14

Upper Confidence Bound (UCB)

A formula that balances exploitation (choosing nodes with high average reward) and exploration (trying under-explored nodes).

Paper 24

Value (V)

The "what I send when selected" projection of each position.

Paper 08

Value Function / Baseline

In RL, an estimate of expected future reward used to reduce gradient variance.

Paper 15

Values Specification

The process of encoding organizational or societal values into an AI system.

Paper 22

Vanishing Gradient

The problem where gradients shrink toward zero as they propagate backwards through many layers (especially through sigmoid activations).

Paper 03

Vanishing gradient problem

The phenomenon where, during BPTT on a long sequence, the gradient

Paper 04

Variance

In the context of scaling laws, the spread of loss values across multiple runs or models.

Paper 13

Verifier

A component that evaluates whether a proposed solution is correct.

2 papers

Vision Transformer (ViT)

A Transformer applied to images by dividing them into patches and treating patches as tokens.

Paper 20

Vocabulary (V)

The set of words the model knows about.

2 papers

Weight

A number attached to an input connection in a neural network.

Paper 02

Weight Initialisation

The values given to weights before training begins.

Paper 03

Weight tying

Using the same weight matrix for both the token input embedding and the output projection (UW and UWᵀ).

Paper 10

Window size (c)

How many words on each side of the target count as "context" in

Paper 05

Word analogy task

An evaluation task of the form "A is to B as C is to ___".

Paper 05

Word vector

Another name for a word embedding — a dense, low-dimensional vector

Paper 05

Word2Vec

Collective name for the two 2013 papers (Mikolov et al.) and the

Paper 05

WordPiece

BERT's subword tokenisation algorithm.

Paper 11

XOR Problem

The function that the single-layer Perceptron cannot learn — because XOR is not linearly separable.

Paper 02

Zero-Shot CoT

A variant of chain-of-thought prompting where no examples are provided.

Paper 14

Zero-shot learning

Performing a task without any task-specific examples or fine-tuning.

2 papers

[

[CLS] token

A special token prepended to every BERT input.

Paper 11

[DELIM] token

Special token inserted between input segments (premise and hypothesis, question and answer) during fine-tuning.

Paper 10

[EXTRACT] token

Special token appended at the end of the input sequence during fine-tuning.

Paper 10

[MASK] token

The special token used to replace selected tokens during MLM pre-training.

Paper 11

[SEP] token

A separator token appended after each sentence in BERT's input.

Paper 11

[START] token

Special token prepended to every input sequence during fine-tuning.

Paper 10

`<EOS>` token

End-of-sentence token.

Paper 06

`<SOS>` token

Start-of-sentence token.

Paper 06

Dictionary of AI