Gradient Checkpointing

Appears in 1 paper

A memory optimisation technique where intermediate activations are discarded during forward pass and recomputed during backward pass.

As used in Paper 19 — Ring Attention with Blockwise Transformers for Near-Infinite Context →

A memory optimisation technique where intermediate activations are discarded during forward pass and recomputed during backward pass. Reduces memory at the cost of extra computation. Complements Ring Attention for training.

Paper 19 — Ring Attention with Blockwise Transformers for Near-Infinite Context →

Appears in papers