Inference

Appears in 2 papers

The process of running a trained model on new inputs to generate predictions.

The process of running a trained model on new inputs to generate predictions. For LLMs, inference means generating tokens one at a time (autoregressive decoding). Inference is where the KV cache problem manifests — your compute is bounded by memory bandwidth, not GPU compute power.

As used in Paper 23 — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters →

The process of running a trained model to produce outputs for new inputs. Inference is when the model "answers" — in contrast to training, where parameters are updated.

Paper 18 — Mistral 7B → Paper 23 — Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters →

Appears in papers