The LLM Inference Performance Restaurant

With the constantly transforming landscape of artificial intelligence, Large Language Models (LLMs) are a remarkable step within the age of artificial intelligence.

But, how do we take user interactions and put them in good use of scarce GPU resources and costs?

Imagine our GPU as a high-end, modern, and beautiful kitchen that has everything to serve up to you in an absolutely perfect dining experience.

  • Prefill Phase: Knowing the Order. In the first step, a model absorbs the full input prompt in one successful forward pass to produce Key/Value (KV) tensors – the ‘remembering’ of the prompt. Consider that it’s your “Chef Reading the Order”. This stage consumes considerable computational power and requires the Time To First Token (TTFT) – how long it takes diners to see their initial course hit the table.
  • Decode Phase: The Plating stage. The words arrive step by step; every token is a step that depends on what has come before it. Think of it like the “Plating of the Risotto”. The chef sets one delicious grain, one after the other. Since the location of each grain depends on the position of the last, he has to view the plate, let go of a grain and gaze again. A lot depends on how much the chef can reach for the ingredients (memory speed) and therefore the Time Per Output Token (TPOT) or “typing speed” – the chef’s slow timing in adding each one as he continues to plate the meal.

Performance Metrics: The Kitchen’s Lifeblood.

Meeting Capacity Challenges The Counter Space Problem. How big is it possible to really expand our kitchen?

Through the use of Little’s Law, we understand the limitation has less to do with the chef’s speed than the size of the kitchen.

The Math of the Kitchen:

  • Chef’s Speed: Able to produce 1000 “grains” (tokens) per second.
  • Average Order: 100 grains per plate of Risotto.
  • Throughput: 10 Requests Per Second (RPS).

If a complete meal takes just 2.3 seconds to finish, the kitchen has to serve around 23 requests at the same time (10 RPS×2.3s) to be efficient.

Which means, it requires enough GPU VRAM (Memory = Counter Space) for 23 plates at once.

There is light at the End of the Tunnel. The following mechanism resolve these bottlenecks and improve the kitchen experience for three reasons.

  • Disaggregated serving shifts GPU inference into production mode by decoupling compute-heavy prefill (“reading”) from memory-heavy decode (“plating”) across multiple GPUs or servers, so each stage can scale separately.
  • Runtime efficiency becomes possible because of the iteration-level scheduling: When new requests are issued ongoing, they load instead of waiting for full batches, and global routing chooses the best node based on load and cached context.
  • Large context windows are handled by distributed KV caches that move beyond a single GPU into system memory or storage without the need for VRAM limits. Rapid, RDMA-based data transfer then links these two stages, allowing near-instant hand-offs.

Combined, this makes it possible to avoid conflicts between resources, boost throughput, and reduce latency, and enables large-scale, data-center-level inference. vLLM or SGLang is usually all you need …

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *