Hardware-Aware Algorithm
An algorithm designed with GPU memory hierarchy in mind.
An algorithm designed with GPU memory hierarchy in mind. Mamba's forward pass uses specific memory access patterns to be cache-efficient. Doesn't just call PyTorch ops; requires custom CUDA kernels for good performance.