Why Your Load Balancer Is Wasting Your GPUs

2026-03-07T00:00:00+00:00

Your load balancer is blind. It routes LLM requests based on server availability, completely ignoring the fact that GPU-1 already holds the context your request needs. The result: redundant computation, wasted money, and slower responses than necessary.

The Problem: KV Cache Thrashing

Modern LLM inference engines like vLLM and TensorRT-LLM use KV caching to avoid recomputing attention for tokens they’ve already seen. When a user sends a follow-up question about a document, the engine can skip directly to generating the answer—if that document’s context is still in memory.

Here’s what actually happens with standard load balancing:

Request A loads a 4,000-token PDF into GPU-1. Request B, asking a question about that same PDF, arrives moments later. Your load balancer—following Least Connections or Round Robin—routes it to GPU-2. GPU-2 has no idea this document exists. It re-computes the entire 4,000-token prefill from scratch.

At scale, this pattern wastes 40-50% of your inference capacity. You’re paying for GPUs to redo work that’s already been done.

The Insight: Content-Aware Routing

What if your load balancer understood tokens, not just packets?

The fix is conceptually simple: route requests to the GPU that already has the relevant KV cache. Skip the prefill entirely. But implementing this requires solving a hard problem—you need sub-millisecond lookups across potentially millions of cached prefixes, updating in real-time as caches fill and evict.

This is the problem Ranvier solves.

Introducing Ranvier

Ranvier is a Layer 7+ load balancer purpose-built for LLM inference. Instead of treating requests as opaque HTTP packets, it inspects the token sequence and routes to the backend most likely to have that prefix cached.

How it works: Ranvier maintains an Adaptive Radix Tree (ART) that maps token prefixes to backend GPUs. When a request arrives, Ranvier tokenizes the prompt, performs an O(L) prefix lookup (where L is prefix length, not total keys), and routes to the matching backend. When backends report cache evictions via gossip protocol, Ranvier updates its routing table.

Why it’s fast: Built on C++20 and the Seastar framework—the same shared-nothing, thread-per-core architecture that powers ScyllaDB. No locks, no atomics, no cross-core coordination in the hot path. Total routing overhead is 6-8ms with server-side tokenization. For production deployments where clients send pre-tokenized requests, overhead drops to <1ms.

The name comes from the Nodes of Ranvier in neuroscience—the gaps in myelin sheath that allow nerve signals to “jump” between points (saltatory conduction), dramatically increasing signal speed. Ranvier does the same for inference: jumping past redundant computation to reach cached state.

The Results

We tested Ranvier on 8x A100 clusters running vLLM across multiple instances, comparing prefix-aware routing against standard round-robin:

Metric	Round Robin	Ranvier	Improvement
Cache Hit Rate	12% (1/8 random)	58-98%	5-8x
P99 Latency (13B, 30u 30m)	4.7-6.8s	0.9-1.0s	79-85%
Throughput (13B, 30u 30m)	36.3 req/s	44.4 req/s	+13-22%

CodeLlama-13b, 10-30 concurrent users, 10-30 minutes, stress-distribution prompts (70% large/xlarge prefixes). Results validated across four Lambda Labs instances (Feb-March 2026). Per-request, XLarge prompts (4K-8K tokens) saw 33-39% faster time-to-first-token on cache hits. Full methodology and raw data in the benchmark guide.

Larger models benefit more. 70B models show the highest per-request improvement (44-49% faster XLarge TTFT on cache hit) because the KV cache saves more compute per token. 13B models show the strongest aggregate gains—P99 tail latency drops 60-85% and throughput increases 12-22%—because backend queuing under load amplifies the benefit of cache-aware routing. 8B models are routing-neutral: inference is fast enough (~420ms) that cache savings don’t affect aggregate latency, though per-request XLarge improvement remains strong at 28-40%.

Routing overhead is 6-8ms with server-side tokenization, <1ms with pre-tokenized requests. You’re trading milliseconds of routing for seconds of tail latency improvement.

Where Ranvier shines: RAG pipelines, multi-turn chat with system prompts, few-shot learning—any workload where requests share common prefixes of 2K+ tokens.

Landscape and Positioning

The inference ecosystem is moving fast. vLLM shipped their own router in December 2025. Red Hat’s llm-d takes a different approach—introspecting vLLM’s KV cache directly via live events. SGLang builds prefix caching into the engine itself via RadixAttention.

Ranvier takes a third path: engine-agnostic, external routing based on token-level prefix matching. It works with any OpenAI-compatible backend—vLLM, SGLang, TensorRT-LLM, Ollama—without requiring engine modifications or instrumentation hooks. The tradeoff is that Ranvier infers cache state from routing history rather than observing it directly, which means slightly lower cache hit precision in exchange for portability.

We think there’s value in a routing layer that isn’t coupled to a specific inference engine, especially as the ecosystem fragments.

Full roadmap: VISION.md

Code and Docs

Ranvier is open source under Apache 2.0: github.com/Ranvier-Systems/ranvier-core

If you’re interested in the internals—why an Adaptive Radix Tree instead of a hash map, how the gossip protocol tracks distributed cache state, what we learned building on Seastar’s shared-nothing architecture—start with the system design doc.

Issues and feedback welcome—especially if you run it against a workload we haven’t tested.

Ranvier is a project of Minds Aspire, LLC.