Inference
The process of running a trained model on new inputs to generate predictions.
The process of running a trained model on new inputs to generate predictions. For LLMs, inference means generating tokens one at a time (autoregressive decoding). Inference is where the KV cache problem manifests — your compute is bounded by memory bandwidth, not GPU compute power.
The process of running a trained model to produce outputs for new inputs. Inference is when the model "answers" — in contrast to training, where parameters are updated.