What Happens When Your LLM Load Balancer Learns

A load balancer makes the same decision millions of times a day: which backend gets this request. Most load balancers make that decision the same way on request one million as they did on request one. Round-robin doesn’t learn, least-connections doesn’t learn, and even consistent hashing doesn’t learn. The mapping is fixed by an algorithm chosen before the first packet arrived.

But LLM traffic has structure that fixed algorithms can’t see. Conversations continue. Conversations branch. The same system prompt fans out into a thousand different user queries, and every one of them shares thousands of tokens of prefill with requests that came before. A load balancer that remembers where it sent those earlier requests can route the later ones to the GPU that already has the work cached.

This post is about what “remembering” actually looks like: how a routing table learns from traffic, and why one route per request isn’t enough. The second half is about the trade-offs that show up once your load balancer’s memory becomes a data structure you have to manage.

The Shape of Conversational Traffic

Consider a typical multi-turn chat request. By turn three, the request body contains the entire conversation so far:

[system: 256 tokens][user1: 50][assistant1: 100][user2: 40][assistant2: 80][user3: 30]

Two properties make this traffic special:

Conversations grow by appending. Turn N+1 contains turn N as an exact token prefix. The KV cache computed for turn N is reusable for turn N+1—if the request lands on the same GPU.

Conversations branch from shared prefixes. A hundred users hitting the same RAG application send a hundred different questions behind the same 256-token system prompt. Those requests share a prefix with each other even though no two are identical.

A learning router exploits both properties. After it forwards a request and sees a successful response, it records: this token sequence lives on that backend. The next request that starts with the same tokens routes to the same place, and the GPU skips the prefill it already did.

The question is: which token sequence do you record?

The Naive Answer Fails Quietly

The obvious approach is to record one route per request: hash or store the first N tokens (say, 128) and map them to the backend that served them.

This works for exact repetition and fails for everything else. Watch what happens with two requests that share a system prompt:

Request 1: [system: 256 tokens][user: "Summarize this contract"]
Request 2: [system: 256 tokens][user: "What's the termination clause?"]

If your route key is “the first 128 tokens,” both requests match. Great. But if your route key is the full request prefix, or a fixed length that straddles the user message, the different user queries produce different keys. Request 2 misses the route that Request 1 learned, falls back to hash routing, and the hash, computed over tokens that include the unique user query, sends it to a different backend. The system prompt’s KV cache sits warm on Backend 1 while Backend 2 recomputes it from scratch.

A fixed-length route key is always wrong for someone. Too short, and you can’t distinguish conversations that diverge after the system prompt. Too long, and you can’t match conversations that diverge before your key ends. The right boundary isn’t a length: it’s wherever the messages end, and that’s different for every request.

Learning at Every Boundary

The fix is to stop learning one route per request and start learning one route per message boundary. When a request succeeds, the router stores a route at every point where a message ends:

[system: 256][user1: 50][assistant1: 100][user2: 40]
            ↑          ↑                ↑          ↑
       route @256  route @306      route @406  route @446

                    (all → Backend 1)

One successful request produces four routes instead of one. Each route is a prefix of the next. Now look at what the router can match:

A branching conversation. Different user query, same system prompt:

[system: 256][user1': "different question"]
  → longest prefix match: route @256 → Backend 1 ✓

A continuing conversation. The next turn of the original thread:

[system: 256][user1: 50][assistant1: 100][user2: 40][assistant2: 80][user3: 30]
  → longest prefix match: route @446 → Backend 1 ✓

A conversation resumed from the middle. A client that truncated history:

[system: 256][user1: 50][assistant1: 100][new user turn]
  → longest prefix match: route @406 → Backend 1 ✓

Every one of these routes to the backend with the warm cache, and every one would have been a miss under single-route learning. A ten-turn conversation creates ten opportunities to match instead of one.

The data structure that makes this practical is a radix tree with longest-prefix-match semantics. The lookup doesn’t ask “is this exact sequence in the tree?” It asks “what is the deepest stored route that prefixes this request?” Storing routes at multiple depths costs nothing extra at lookup time: the traversal is O(L) in the prefix length either way, and it naturally passes through every stored boundary on its way down.

Where the Boundaries Come From

Learning at message boundaries requires knowing where the messages are, in token space rather than character space. The request JSON tells you where each message ends as a character offset; the routing tree needs token positions. Two strategies, tried in order:

Marker scan. Chat templates like llama3 and chatml inject a distinct token at the start of every message (<|start_header_id|>, <|im_start|>). Scan the token array for those markers and you have exact boundaries. O(n) in token count, precise, but only works when the template uses single-token markers.

Proportional estimation. For templates without scannable markers, map each message’s character offset to a token position using the overall token-per-character ratio. Accuracy is ±2-5 tokens. That sounds sloppy for a routing key, but it isn’t. The next section is why.

If both strategies fail, the router falls back to a fixed prefix length. Everything still works; you just get single-depth behavior instead of multi-depth.

Block Alignment Forgives Small Errors

vLLM’s PagedAttention manages the KV cache in fixed-size blocks, 16 tokens by default. Cache reuse happens at block granularity: two requests that agree on the first 240 tokens and differ at token 250 share exactly 15 blocks of cache, because tokens 240-255 form a partial block that can’t be reused.

So the router aligns every learned route down to a block boundary before storing it. A boundary detected at token 306 is stored as a route at token 304. This has two useful consequences:

Estimation error stops mattering. A boundary that’s off by ±5 tokens usually lands in the same 16-token block as the true boundary, and aligns to the same route. The routing key matches the granularity of the thing it’s actually predicting (cache blocks) rather than pretending to a precision the cache can’t use.

Nearby boundaries collapse. Boundaries at tokens 256, 260, and 262 all align to 256 and deduplicate into a single route. Short messages (a “yes,” an emoji, a one-line tool result) don’t bloat the tree with routes that could never produce distinct cache hits anyway.

The Cost: Your Routing Table Is Now a Cache Too

Multi-depth learning multiplies route count by the average messages per conversation. A tree sized for 100,000 routes holds 100,000 conversations under single-depth learning, but only ~10,000 ten-message conversations under multi-depth. The router’s memory is now itself a cache with an eviction policy, sitting in front of the GPU cache it’s trying to model.

Three mechanisms keep it bounded:

LRU eviction. Every route lives on an intrusive LRU list. When the tree hits its cap, the least-recently-matched route is evicted in O(1) via a tail pointer. Active conversations stay hot; abandoned ones age out. Note what this means for multi-depth: an idle conversation’s deepest routes expire first, while its system-prompt route survives because other traffic is still matching it. The tree automatically keeps the prefixes that are earning their memory.

Trust-ranked eviction. In a cluster, routes learned from your own traffic are tagged LOCAL; routes learned from peer gossip are tagged REMOTE. Under capacity pressure, REMOTE routes evict first. Your own observations are ground truth; a peer’s observations are secondhand and possibly stale.

TTL expiry. Routes older than an hour (configurable) expire regardless of position. A route’s claim—”this prefix is cached on that GPU”—decays with time, because the GPU’s own cache evicted it long ago.

The practical guidance: when you turn on multi-depth learning, scale your route capacity by your expected messages per conversation, or accept that eviction starts sooner. Watch the route-count metric. The failure mode is gentle (evicted routes mean hash fallback rather than errors), but it silently caps your cache hit rate.

What Learning Doesn’t Fix

A learned route is a prediction. The router believes the prefix is cached on that backend; the backend’s own memory pressure decides whether it still is. Three honest limitations:

The router can over-commit a backend. If 80% of traffic shares one system prompt, pure prefix affinity sends 80% of traffic to one GPU. A learning router needs a disobedience mechanism: when a backend’s in-flight count exceeds twice the median, divert to the least-loaded backend and eat the cache miss. In our benchmarks, that trade costs about 5 points of cache hit rate and buys a 45% improvement in P99 latency. Affinity is a preference the router has to be willing to break.

Learned state must propagate. In a cluster, a route learned on node A is useless to node B until gossip delivers it. Until then, B’s consistent-hash fallback sends same-prefix requests to a deterministic backend, so even unlearned traffic converges, just less cleverly.

The tree learns where traffic went, not where it should have gone. If the first request for a prefix lands on a bad backend (by hash), every subsequent request follows it there. Learning amplifies the initial placement, good or bad. Health checks and the load-pressure override are the correctives.

The Takeaway

The interesting shift isn’t the radix tree or the boundary detection. It’s that the load balancer has state that improves with traffic. Request one routes by hash. Request one thousand routes by accumulated knowledge of where every shared prefix in your workload already lives.

That only works because the router never fully trusts its own memory. The tree is capped, routes learned from gossip evict first, everything expires after an hour, and load pressure can override any match. Otherwise the tree fills with routes that outlive the caches they point to, and all that learning is just a memory leak.

We built this as part of Ranvier, a Layer 7 traffic controller for LLM inference. Earlier posts in this series covered why load balancers waste GPUs, what KV cache misses cost, and the tokenization bottleneck that feeds the router its tokens.

Ranvier is a project of Minds Aspire, LLC.