H100 Benchmarks Hide a 27x Cold Start Penalty

golden GPU trophy containing 27 stacked sand chambers draining through its circuitry for H100 Benchmarks Hide a 27x Cold S...

Affiliate Disclosure: This article contains affiliate links. We may earn a commission if you purchase through these links, at no additional cost to you. This helps us continue publishing free content. See our full disclosure.

Part 5 of 6 in the Benchmark Reality Checks series.

TensorRT-LLM finishes first in every throughput test in the most detailed H100 inference benchmarks published this month. It also needs 28 minutes before it can serve a single token. Nobody puts that number on the benchmark slide.

Spheron’s March comparison pits vLLM v0.18.0, TensorRT-LLM v1.2.0, and SGLang v0.5.9 against each other on identical hardware: one H100 SXM5 80GB running Llama 3.3 70B in FP8. At peak concurrency, TensorRT-LLM pushes nearly 16% more tokens per second than its closest rival. NVIDIA wins every metric Spheron tests.

Then a Kubernetes auto-scaler fires. The engine compilation that powers TensorRT-LLM’s speed — converting model weights into hardware-optimized execution plans — takes roughly 28 minutes on a single H100. vLLM and SGLang each load in just over a minute. That is a 27x cold start penalty buried behind the throughput numbers. What the documentation calls “one-time setup” becomes a recurring tax every time a new pod spins up, every time a spot instance gets reclaimed, every time traffic spikes past provisioned capacity.

This penalty hid for a year because inference benchmarks assumed it away. In early 2025, deployment meant “configure it once and leave it running” — vLLM’s PagedAttention was the headline innovation, SGLang was primarily a research project, and no major benchmark tested cold start as a first-class metric. The shift toward auto-scaled, spot-instance inference outpaced the performance tests, and a 28-minute penalty survived in plain sight while the infrastructure it breaks became the industry default.

Eight Percent to Fifteen — and Then the Clock Starts

At a single concurrent request, TensorRT-LLM’s throughput edge is modest: 130 tokens per second versus vLLM’s 120 — an 8% margin. Scale to 50 concurrent requests and the gap widens to 13%. At 100, it reaches 15.8%. Latency follows the same curve: p95 time-to-first-token at 100 concurrent requests hits 1,280 ms for TensorRT-LLM versus 1,450 ms for vLLM (Spheron benchmarks). The compiled engine’s advantage does not plateau under load — it accelerates. That is the throughput case for compilation, and under sustained serving it is real.

“It has been an absolute breeze,” said Naveen Rao, VP of Engineering at Databricks, describing TensorRT-LLM integration (NVIDIA Developer Blog). Rao’s team runs fixed infrastructure with pre-provisioned GPUs — the one environment where a 28-minute compilation vanishes into setup and never surfaces again.

But Databricks is not the median deployment.

TensorRT-LLM’s compilation runs deeper than a cache warm-up. The engine locks to the exact GPU, driver version, and checkpoint that built it — produced through CUDA graph optimizations, fused kernels, and Tensor Core acceleration. Change any variable and the 28-minute clock restarts. vLLM loads in 62 seconds; SGLang in 58 (Spheron benchmarks). Their weights are portable. TensorRT-LLM’s are not.

What Spheron’s benchmarks and NVIDIA’s own documentation reveal is what this analysis calls the Compilation Cliff — the point where a framework’s optimization step outlasts the infrastructure’s patience, converting peak throughput into a deployment liability. During a typical 30-minute traffic spike, compiled TensorRT-LLM spends 28 of those minutes unable to serve a single token. The fastest framework on the performance test cannot participate in the scaling event that triggered the need for more capacity.

During those 28 minutes, a TensorRT-LLM pod burns 1,680 seconds of GPU time at zero utilization. At the benchmark’s peak rate of 2,780 tok/s, that equals 4,670,400 tokens of unrealized capacity per scale-up event. A vLLM pod loading in 62 seconds forfeits 148,800 tokens at its own peak rate.

A single cold start on the faster framework wastes 31x more serving capacity than on the one it beats by 15.8%. But throughput tests do not track where production teams place their actual bets — and the gap between benchmark rankings and deployment choices turns out to be enormous.

Where 74,200 Stars Outweigh a Benchmark Win

vLLM counts 74,200 GitHub stars and 7,700 dependent projects — codebases that import it as a serving dependency. Stars measure attention. Dependent projects measure lock-in: 7,700 production systems that would need to rewrite their inference layer to switch frameworks. SGLang, backed by the LMSYS research group, reports powering over 400,000 GPUs worldwide with adoption spanning xAI, AMD, NVIDIA, and Intel. Neither framework leads the throughput rankings. Both dominate production.

Even Mitrasish, Co-founder at Spheron and author of the performance comparisons that crowned TensorRT-LLM the throughput winner, recommends vLLM for most teams. He cites broader model support, an active community, and competitive throughput without a compilation step (Spheron).

SGLang’s case rests on a different architectural bet. Its cache-aware load balancer routes requests through a RadixAttention layer that delivers 1.9x throughput gains on shared-prefix workloads — the exact traffic pattern chatbots, RAG pipelines, and multi-turn conversations generate. In testing, that routing pushed output from 82,665 to 158,596 tokens per second on eight A100 GPUs. Standard inference benchmarks of single-prompt throughput cannot capture this advantage. Production deployments where 60-80% of requests share system prompts can (LMSYS research).

Run the estimate. If SGLang’s 1.9x shared-prefix multiplier holds proportionally on H100 hardware, and 70% of a production workload shares system prompts — the midpoint of the 60-80% range typical for chatbot and RAG deployments — the weighted effective throughput is 0.70 × (base × 1.9) + 0.30 × base, or roughly 1.63x the framework’s baseline rate. Applied to SGLang’s H100 throughput in the Spheron tests, that estimate exceeds TensorRT-LLM’s compiled 2,780 tok/s — without a single second of compilation. The 1.9x figure comes from A100 testing, so the H100 multiplier will differ. But the direction is clear: for prefix-heavy workloads, the “slower” framework may already be faster in practice than the benchmark winner.

Currently, the benchmark winner sits third in production adoption. NVIDIA’s engineering team noticed — and their fix reveals how deep the tradeoff runs.

NVIDIA’s Fix Erases Its Own Lead

NVIDIA describes TensorRT-LLM as “architected on PyTorch” with support for loading checkpoints directly from HuggingFace format. This pathway reportedly cuts cold start to roughly 60-90 seconds by skipping the full compilation step. The tell is not the feature — it is that NVIDIA built it at all. You do not engineer around a number you can dismiss as a one-time cost.

Run the subtraction. If the reduced-compilation pathway’s throughput approximates vLLM’s 2,400 tok/s — a reasonable floor, given both skip the full engine build — the sacrifice is 380 tokens per second (2,780 minus 2,400). Over an 8-hour peak serving window on a single H100, that gap equals 10.9 million fewer tokens per day: the steady-state cost of eliminating a 28-minute cold start. TensorRT-LLM in this mode starts like vLLM, serves like vLLM, but carries additional framework complexity and tighter hardware requirements toward NVIDIA enterprise GPUs. NVIDIA’s fix for TensorRT-LLM’s deployment problem is, functionally, a second vLLM with a smaller hardware compatibility list.

“LLM inference is memory-IO bound, not compute bound,” wrote Eric Liang of Anyscale in foundational research on continuous batching (arXiv preprint). But Spheron’s concurrency sweep tells a different story. TensorRT-LLM’s compiled compute advantage widens from 8% at one concurrent request to 15.8% at 100. If inference were purely memory-IO bound, compiled kernel optimizations should deliver flat or shrinking gains under load — not growing ones.

Liang is right that KV-cache management gates single-request latency, but the claim that compute optimizations deliver only diminishing returns at production concurrency is partially contradicted by Spheron’s independent throughput data. Memory-first architectures like vLLM’s PagedAttention and SGLang’s RadixAttention compete at structural advantage without compilation — but they compete against an advantage that is real and growing under load, not marginal and fading.

The mechanism behind TensorRT-LLM’s benchmark dominance is the mechanism behind its production friction — and NVIDIA’s own fix dissolves the advantage that justified the choice. That leaves one question: when is the compilation cost still worth paying?

When 28 Minutes Is Worth It

Compile once. Persist the engine to disk. Reload in under two minutes on every subsequent start. For teams running a single model version on dedicated hardware with predictable traffic, the Compilation Cliff is irrelevant — TensorRT-LLM’s 15.8% throughput advantage compounds across every served request for months, and 28 minutes of initial compilation amortizes to noise.

Daniel Adeboye, writing for Northflank, noted that TensorRT-LLM is “designed specifically for NVIDIA enterprise GPUs” while vLLM “runs on most CUDA GPUs, from consumer cards to datacenter hardware” (Northflank). For organizations already locked into H100 clusters, the 15.8% throughput advantage translates directly to fewer GPUs per SLA tier (Spheron benchmarks) — and fewer GPUs means six-figure annual savings that dwarf a half-hour compilation wait.

But how long must TensorRT-LLM serve continuously before its throughput edge repays the cold start deficit? The performance data contains the answer. From the moment both frameworks receive a scale-up signal, vLLM begins serving at 2,400 tok/s after 62 seconds while TensorRT-LLM serves nothing for 1,680 seconds. Set total tokens equal — 2,400 × (T − 62) = 2,780 × (T − 1,680) — and solve:

4,670,400 − 148,800 = 380T
T = 11,899 seconds ≈ 3.3 hours

TensorRT-LLM needs 3.3 hours of continuous serving after a cold start to deliver more total tokens than vLLM. A 15.8% per-second advantage takes nearly 200 minutes to overcome vLLM’s 1,618-second head start. Any scaling event shorter than 3.3 hours — and most traffic spikes are — produces fewer total tokens on the faster framework than on the slower one.

As it stands, this is the steelman’s own math turning against it. The compilation advantage is real, but it amortizes across hours, not minutes. For the fixed-infrastructure use case — months of uninterrupted serving, no spot reclamation, no model version rotation — 3.3 hours is a rounding error. For everyone else, it is the deployment pattern itself.

And that deployment pattern is shrinking. Cloud inference increasingly runs on spot instances with auto-scaling as a default primitive. Model release cadence has compressed from quarterly to monthly. Multi-model serving — routing requests across versioned checkpoints for A/B testing — means “compile once” breaks every few weeks. Performance tests that favor TensorRT-LLM evaluate a world fewer teams inhabit each quarter — which means the decision needs a metric that benchmarks refuse to measure.

Calculate Before You Compile

Arguing about framework superiority without specifying the deployment context is benchmarking theater. What resolves the debate is the Elasticity Ratio: cold start time divided by mean traffic spike duration.

Elasticity Ratio = cold_start_seconds / mean_spike_duration_seconds

ER > 0.10  →  framework cannot scale into demand
ER < 0.05  →  cold start is operationally negligible

Run the numbers against a typical 30-minute traffic spike:

Framework Cold Start (s) ER (30-min spike) Verdict
TensorRT-LLM (compiled) 1,680 0.93 Misses 93% of spike
TensorRT-LLM (reduced compilation) ~75 0.04 Scales, reduced throughput
vLLM 62 0.03 Scales at full throughput
SGLang 58 0.03 Scales at full throughput

Cold start data for compiled, vLLM, and SGLang from Spheron benchmarks. Reduced-compilation estimate based on NVIDIA documentation. Elasticity Ratios derived.

A caveat: this analysis rests on Spheron’s single-hardware-configuration test running one model at one quantization level, plus NVIDIA’s self-reported documentation for the reduced-compilation pathway. No publicly available benchmark has yet tested across multiple GPU SKUs, driver versions, and production traffic patterns.

Compiled TensorRT-LLM’s Elasticity Ratio drops below 0.10 only when spikes last longer than 4.7 hours — 1,680 divided by 0.10 equals 16,800 seconds. Cross-reference the breakeven calculation: even after the spike exceeds 4.7 hours and the ER looks acceptable, the framework still needs 3.3 total hours of serving before it outproduces vLLM in cumulative tokens. The Elasticity Ratio tells you whether the framework can participate. The breakeven threshold tells you whether participation was worth it.

Now calculate what doing nothing costs. A team unable to auto-scale must over-provision. Assume peak demand requires 8 H100s, average demand requires 5, and cloud GPU rates run $2.50 per H100-hour — a mid-range figure across providers like RunPod, which offers on-demand H100 instances that scale up and down without long-term commitments.

Elastic scaling with vLLM costs approximately $120,000 per year — baseline compute for 5 GPUs plus burst capacity. Static provisioning with compiled TensorRT-LLM, unable to scale down, costs approximately $175,000 per year for 8 GPUs around the clock. Choosing the wrong framework based on throughput charts alone costs roughly $55,000 per year in idle GPU capacity — per cluster. That number only grows as the gap between the framework you benchmarked and the infrastructure you actually run widens.

Verdict: Match the Framework to the Infrastructure, Not the Benchmark

Calculate your Elasticity Ratio. If it exceeds 0.10, the fastest framework on the benchmark cannot serve your scaling pattern. Then check the breakeven: if your average scaling event lasts under 3.3 hours, the slower framework delivers more total tokens.

For teams running dedicated H100 clusters with fixed model versions and predictable traffic, TensorRT-LLM in compiled mode remains the right choice. The 15.8% throughput advantage at high concurrency is real, it compounds over months of serving, and the 28-minute compilation is a one-time cost that never recurs.

For teams running auto-scaled or spot-instance infrastructure, vLLM is the default recommendation. It starts in 62 seconds, delivers competitive throughput, supports the broadest range of hardware, and carries the largest community of production-tested integrations — 74,200 stars and 7,700 dependent projects.

For teams running mixed workloads with shared prefixes — chatbots, RAG pipelines, multi-turn agents — SGLang deserves evaluation alongside vLLM. Its RadixAttention and cache-aware routing deliver gains that standard single-prompt benchmarks cannot capture, and its 58-second cold start matches the best in this comparison. With 1.9x throughput on shared-prefix traffic and a 60-80% prefix-sharing rate in typical production, SGLang’s effective throughput on mixed workloads can exceed TensorRT-LLM’s compiled peak — no compilation required. SGLang is the framework most likely to be undervalued by teams that choose based on published inference benchmarks alone.

The Compilation Cliff will narrow as NVIDIA iterates. But framework competition is already shifting from raw throughput to deployment elasticity, multi-model routing, and prefix-aware scheduling — dimensions where SGLang and vLLM already lead. The teams that build around today’s throughput leaderboard risk optimizing for a deployment model that is already receding.

What to Read Next

References

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top