Receptive Field
In deep networks, the range of input positions that influence a given output position.
In deep networks, the range of input positions that influence a given output position. With SWA and k layers, Mistral's receptive field grows to approximately k × W tokens. With 32 layers and W=4,096, the receptive field is ~131K tokens, meaning the model can implicitly see information from 131K tokens back even though each layer only explicitly attends to 4K.
In Ring Attention, the range of input tokens that influence an output position. Unlike Mistral's SWA, Ring Attention maintains a full receptive field (all n tokens) without degradation, even across layers.