Memory Scaling
With P GPUs using Ring Attention, per-GPU memory is O((n/P) × d), scaling linearly with the number of GPUs.
With P GPUs using Ring Attention, per-GPU memory is O((n/P) × d), scaling linearly with the number of GPUs. Contrast: single-GPU attention requires O(n × d). Enables processing sequences of arbitrary length by adding GPUs.