Gradient Checkpointing

Appears in 1 paper

A memory optimisation technique where intermediate activations are discarded during forward pass and recomputed during backward pass.

As used in Paper 19 — Ring Attention with Blockwise Transformers for Near-Infinite Context →

A memory optimisation technique where intermediate activations are discarded during forward pass and recomputed during backward pass. Reduces memory at the cost of extra computation. Complements Ring Attention for training.