<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://ranvier.systems/feed.xml" rel="self" type="application/atom+xml" /><link href="https://ranvier.systems/" rel="alternate" type="text/html" /><updated>2026-06-10T22:24:18+00:00</updated><id>https://ranvier.systems/feed.xml</id><title type="html">Ranvier</title><subtitle>Intelligence Layer for LLM Inference</subtitle><author><name>Minds Aspire, LLC</name></author><entry><title type="html">What Happens When Your LLM Load Balancer Learns</title><link href="https://ranvier.systems/2026/06/10/what-happens-when-your-llm-load-balancer-learns.html" rel="alternate" type="text/html" title="What Happens When Your LLM Load Balancer Learns" /><published>2026-06-10T00:00:00+00:00</published><updated>2026-06-10T00:00:00+00:00</updated><id>https://ranvier.systems/2026/06/10/what-happens-when-your-llm-load-balancer-learns</id><content type="html" xml:base="https://ranvier.systems/2026/06/10/what-happens-when-your-llm-load-balancer-learns.html"><![CDATA[<p>A load balancer makes the same decision millions of times a day: which
backend gets this request. Most load balancers make that decision the same
way on request one million as they did on request one. Round-robin doesn’t
learn, least-connections doesn’t learn, and even consistent hashing doesn’t
learn. The mapping is fixed by an algorithm chosen before the first packet
arrived.</p>

<p>But LLM traffic has structure that fixed algorithms can’t see. Conversations
continue. Conversations branch. The same system prompt fans out into a
thousand different user queries, and every one of them shares thousands of
tokens of prefill with requests that came before. A load balancer that
remembers where it sent those earlier requests can route the later ones to
the GPU that already has the work cached.</p>

<p>This post is about what “remembering” actually looks like: how a routing
table learns from traffic, and why one route per request isn’t enough. The
second half is about the trade-offs that show up once your load balancer’s
memory becomes a data structure you have to manage.</p>

<h2 id="the-shape-of-conversational-traffic">The Shape of Conversational Traffic</h2>

<p>Consider a typical multi-turn chat request. By turn three, the request body
contains the entire conversation so far:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[system: 256 tokens][user1: 50][assistant1: 100][user2: 40][assistant2: 80][user3: 30]
</code></pre></div></div>

<p>Two properties make this traffic special:</p>

<p><strong>Conversations grow by appending.</strong> Turn N+1 contains turn N as an exact
token prefix. The KV cache computed for turn N is reusable for turn N+1—if
the request lands on the same GPU.</p>

<p><strong>Conversations branch from shared prefixes.</strong> A hundred users hitting the
same RAG application send a hundred different questions behind the same
256-token system prompt. Those requests share a prefix with each other even
though no two are identical.</p>

<p>A learning router exploits both properties. After it forwards a request and
sees a successful response, it records: <em>this token sequence lives on that
backend.</em> The next request that starts with the same tokens routes to the
same place, and the GPU skips the prefill it already did.</p>

<p>The question is: which token sequence do you record?</p>

<h2 id="the-naive-answer-fails-quietly">The Naive Answer Fails Quietly</h2>

<p>The obvious approach is to record one route per request: hash or store the
first N tokens (say, 128) and map them to the backend that served them.</p>

<p>This works for exact repetition and fails for everything else. Watch what
happens with two requests that share a system prompt:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Request 1: [system: 256 tokens][user: "Summarize this contract"]
Request 2: [system: 256 tokens][user: "What's the termination clause?"]
</code></pre></div></div>

<p>If your route key is “the first 128 tokens,” both requests match. Great. But
if your route key is the full request prefix, or a fixed length that
straddles the user message, the different user queries produce different
keys. Request 2 misses the route that Request 1 learned, falls back to hash
routing, and the hash, computed over tokens that include the unique user
query, sends it to a different backend. The system prompt’s KV cache sits
warm on Backend 1 while Backend 2 recomputes it from scratch.</p>

<p>A fixed-length route key is always wrong for someone. Too short, and you
can’t distinguish conversations that diverge after the system prompt. Too
long, and you can’t match conversations that diverge before your key ends.
The right boundary isn’t a length: it’s wherever the <em>messages</em> end, and
that’s different for every request.</p>

<h2 id="learning-at-every-boundary">Learning at Every Boundary</h2>

<p>The fix is to stop learning one route per request and start learning one
route per <em>message boundary</em>. When a request succeeds, the router stores a
route at every point where a message ends:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[system: 256][user1: 50][assistant1: 100][user2: 40]
            ↑          ↑                ↑          ↑
       route @256  route @306      route @406  route @446

                    (all → Backend 1)
</code></pre></div></div>

<p>One successful request produces four routes instead of one. Each route is a
prefix of the next. Now look at what the router can match:</p>

<p><strong>A branching conversation.</strong> Different user query, same system prompt:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[system: 256][user1': "different question"]
  → longest prefix match: route @256 → Backend 1 ✓
</code></pre></div></div>

<p><strong>A continuing conversation.</strong> The next turn of the original thread:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[system: 256][user1: 50][assistant1: 100][user2: 40][assistant2: 80][user3: 30]
  → longest prefix match: route @446 → Backend 1 ✓
</code></pre></div></div>

<p><strong>A conversation resumed from the middle.</strong> A client that truncated history:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[system: 256][user1: 50][assistant1: 100][new user turn]
  → longest prefix match: route @406 → Backend 1 ✓
</code></pre></div></div>

<p>Every one of these routes to the backend with the warm cache, and every one
would have been a miss under single-route learning. A ten-turn conversation
creates ten opportunities to match instead of one.</p>

<p>The data structure that makes this practical is a radix tree with
longest-prefix-match semantics. The lookup doesn’t ask “is this exact
sequence in the tree?” It asks “what is the deepest stored route that
prefixes this request?” Storing routes at multiple depths costs nothing
extra at lookup time: the traversal is O(L) in the prefix length either way,
and it naturally passes through every stored boundary on its way down.</p>

<h2 id="where-the-boundaries-come-from">Where the Boundaries Come From</h2>

<p>Learning at message boundaries requires knowing where the messages are, in
<em>token</em> space rather than character space. The request JSON tells you where
each message ends as a character offset; the routing tree needs token
positions. Two strategies, tried in order:</p>

<p><strong>Marker scan.</strong> Chat templates like llama3 and chatml inject a distinct
token at the start of every message (<code class="language-plaintext highlighter-rouge">&lt;|start_header_id|&gt;</code>,
<code class="language-plaintext highlighter-rouge">&lt;|im_start|&gt;</code>). Scan the token array for those markers and you have exact
boundaries. O(n) in token count, precise, but only works when the template
uses single-token markers.</p>

<p><strong>Proportional estimation.</strong> For templates without scannable markers, map
each message’s character offset to a token position using the overall
token-per-character ratio. Accuracy is ±2-5 tokens. That sounds sloppy for a
routing key, but it isn’t. The next section is why.</p>

<p>If both strategies fail, the router falls back to a fixed prefix length.
Everything still works; you just get single-depth behavior instead of
multi-depth.</p>

<h2 id="block-alignment-forgives-small-errors">Block Alignment Forgives Small Errors</h2>

<p>vLLM’s PagedAttention manages the KV cache in fixed-size blocks, 16 tokens by
default. Cache reuse happens at block granularity: two requests that agree on
the first 240 tokens and differ at token 250 share exactly 15 blocks of
cache, because tokens 240-255 form a partial block that can’t be reused.</p>

<p>So the router aligns every learned route down to a block boundary before
storing it. A boundary detected at token 306 is stored as a route at
token 304. This has two useful consequences:</p>

<p><strong>Estimation error stops mattering.</strong> A boundary that’s off by ±5 tokens
usually lands in the same 16-token block as the true boundary, and aligns to
the same route. The routing key matches the granularity of the thing it’s
actually predicting (cache blocks) rather than pretending to a precision the
cache can’t use.</p>

<p><strong>Nearby boundaries collapse.</strong> Boundaries at tokens 256, 260, and 262 all
align to 256 and deduplicate into a single route. Short messages (a “yes,”
an emoji, a one-line tool result) don’t bloat the tree with routes that
could never produce distinct cache hits anyway.</p>

<h2 id="the-cost-your-routing-table-is-now-a-cache-too">The Cost: Your Routing Table Is Now a Cache Too</h2>

<p>Multi-depth learning multiplies route count by the average messages per
conversation. A tree sized for 100,000 routes holds 100,000 conversations
under single-depth learning, but only ~10,000 ten-message conversations under
multi-depth. The router’s memory is now itself a cache with an eviction
policy, sitting in front of the GPU cache it’s trying to model.</p>

<p>Three mechanisms keep it bounded:</p>

<p><strong>LRU eviction.</strong> Every route lives on an intrusive LRU list. When the tree
hits its cap, the least-recently-matched route is evicted in O(1) via a tail
pointer. Active conversations stay hot; abandoned ones age out. Note what
this means for multi-depth: an idle conversation’s deepest routes expire
first, while its system-prompt route survives because other traffic is still
matching it. The tree automatically keeps the prefixes that are earning
their memory.</p>

<p><strong>Trust-ranked eviction.</strong> In a cluster, routes learned from your own
traffic are tagged LOCAL; routes learned from peer gossip are tagged REMOTE.
Under capacity pressure, REMOTE routes evict first. Your own observations
are ground truth; a peer’s observations are secondhand and possibly stale.</p>

<p><strong>TTL expiry.</strong> Routes older than an hour (configurable) expire regardless
of position. A route’s claim—”this prefix is cached on that GPU”—decays
with time, because the GPU’s own cache evicted it long ago.</p>

<p>The practical guidance: when you turn on multi-depth learning, scale your
route capacity by your expected messages per conversation, or accept that
eviction starts sooner. Watch the route-count metric. The failure mode is
gentle (evicted routes mean hash fallback rather than errors), but it
silently caps your cache hit rate.</p>

<h2 id="what-learning-doesnt-fix">What Learning Doesn’t Fix</h2>

<p>A learned route is a prediction. The router believes the prefix is cached on
that backend; the backend’s own memory pressure decides whether it still is.
Three honest limitations:</p>

<p><strong>The router can over-commit a backend.</strong> If 80% of traffic shares one
system prompt, pure prefix affinity sends 80% of traffic to one GPU. A
learning router needs a disobedience mechanism: when a backend’s in-flight
count exceeds twice the median, divert to the least-loaded backend and eat
the cache miss. In our benchmarks, that trade costs about 5 points of cache
hit rate and buys a 45% improvement in P99 latency. Affinity is a
preference the router has to be willing to break.</p>

<p><strong>Learned state must propagate.</strong> In a cluster, a route learned on node A is
useless to node B until gossip delivers it. Until then, B’s consistent-hash
fallback sends same-prefix requests to a deterministic backend, so even
unlearned traffic converges, just less cleverly.</p>

<p><strong>The tree learns where traffic went, not where it should have gone.</strong> If
the first request for a prefix lands on a bad backend (by hash), every
subsequent request follows it there. Learning amplifies the initial
placement, good or bad. Health checks and the load-pressure override are
the correctives.</p>

<h2 id="the-takeaway">The Takeaway</h2>

<p>The interesting shift isn’t the radix tree or the boundary detection. It’s
that the load balancer has <em>state that improves with traffic</em>. Request one
routes by hash. Request one thousand routes by accumulated knowledge of
where every shared prefix in your workload already lives.</p>

<p>That only works because the router never fully trusts its own memory. The
tree is capped, routes learned from gossip evict first, everything expires
after an hour, and load pressure can override any match. Otherwise the tree
fills with routes that outlive the caches they point to, and all that
learning is just a memory leak.</p>

<p>We built this as part of <a href="https://github.com/Ranvier-Systems/ranvier-core">Ranvier</a>,
a Layer 7 traffic controller for LLM inference. Earlier posts in this series
covered <a href="https://ranvier.systems/2026/03/16/why-your-load-balancer-is-wasting-your-gpus.html">why load balancers waste GPUs</a>,
<a href="https://ranvier.systems/2026/04/30/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost.html">what KV cache misses cost</a>,
and <a href="https://ranvier.systems/2026/05/25/tokenization-is-the-bottleneck-youre-not-measuring.html">the tokenization bottleneck</a>
that feeds the router its tokens.</p>

<hr />

<p><em>Ranvier is a project of Minds Aspire, LLC.</em></p>]]></content><author><name>Minds Aspire</name></author><summary type="html"><![CDATA[A load balancer makes the same decision millions of times a day: which backend gets this request. Most load balancers make that decision the same way on request one million as they did on request one. Round-robin doesn’t learn, least-connections doesn’t learn, and even consistent hashing doesn’t learn. The mapping is fixed by an algorithm chosen before the first packet arrived.]]></summary></entry><entry><title type="html">Tokenization Is the Bottleneck You’re Not Measuring</title><link href="https://ranvier.systems/2026/05/25/tokenization-is-the-bottleneck-youre-not-measuring.html" rel="alternate" type="text/html" title="Tokenization Is the Bottleneck You’re Not Measuring" /><published>2026-05-25T00:00:00+00:00</published><updated>2026-05-25T00:00:00+00:00</updated><id>https://ranvier.systems/2026/05/25/tokenization-is-the-bottleneck-youre-not-measuring</id><content type="html" xml:base="https://ranvier.systems/2026/05/25/tokenization-is-the-bottleneck-youre-not-measuring.html"><![CDATA[<p>You’ve optimized your GPU serving stack. You’ve tuned vLLM’s batch size,
configured PagedAttention, maybe even set up prefix-aware routing for KV cache
locality. Your P99 looks good. Your throughput is climbing. And somewhere in
your proxy layer, every single request is blocking for 5-13 milliseconds while
a tokenizer turns text into integers.</p>

<p>You’re probably not measuring it. Most LLM proxies treat tokenization as
instantaneous—call the function, get the tokens, move on. But on an
event-loop architecture, 5-13ms isn’t a rounding error. It’s an eternity.
Every millisecond your event loop spends inside a tokenizer FFI call is a
millisecond where no other request is read, no response is forwarded, no
health check is answered, no connection is accepted.</p>

<p>This post is about a bottleneck hiding in the gap between “fast enough” and
“actually non-blocking.”</p>

<h2 id="why-tokenization-blocks">Why Tokenization Blocks</h2>

<p>If you’re doing prefix-aware routing, request rewriting, cost estimation, or
priority classification, your proxy needs to tokenize the input before
forwarding it. That means calling a tokenizer, usually HuggingFace’s
<code class="language-plaintext highlighter-rouge">tokenizers</code> library, the same BPE implementation used by most serving
engines.</p>

<p>The problem is that tokenization is CPU-bound work executed through an FFI
boundary. The Rust <code class="language-plaintext highlighter-rouge">tokenizers</code> crate does the actual BPE encoding. Your
proxy calls it through a C binding. The call takes 5-13ms depending on input
length. During that call, your thread is gone.</p>

<p>In a thread-per-request architecture (Go, Java, threaded Python), this is
fine. One thread blocks; the others keep working. In an event-loop
architecture—Node.js, Seastar, anything built on epoll/io_uring with
cooperative scheduling—it’s a disaster. The event loop processes everything
sequentially. While it’s inside the tokenizer, it processes nothing else.</p>

<p>Let’s make this concrete. You have an event loop handling 1,000 requests per
second. Each tokenization call takes 10ms. If you tokenize synchronously on
the event loop, you can process at most 100 tokenizations per second on that
core. Your other 900 requests are queued, their latency inflating by 10ms for
each request ahead of them in line.</p>

<p>At 20 concurrent users, we measured tokenization accounting for <strong>10.6ms</strong> of
total routing overhead, while the actual routing decision (a radix tree
lookup) took <strong>0.01ms</strong>. The tokenizer was 1,000x slower than the thing it
was feeding.</p>

<h2 id="the-caching-layer-that-actually-works">The Caching Layer That Actually Works</h2>

<p>The first optimization is the most obvious: don’t tokenize the same text
twice.</p>

<p>LLM traffic has a property that makes caching extraordinarily effective:
repetition. Every request to a RAG application includes the same system
prompt. Every multi-turn conversation starts with the same instruction
prefix. Every API call from the same client sends the same role tags
(<code class="language-plaintext highlighter-rouge">&lt;|system|&gt;\n</code>, <code class="language-plaintext highlighter-rouge">&lt;|user|&gt;\n</code>).</p>

<p>We added an LRU cache in front of the tokenizer. Hit rates depend entirely on
content type, and the spread is dramatic. Here’s what we expect:</p>

<table>
  <thead>
    <tr>
      <th>Content type</th>
      <th>Cache hit rate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Role tags (<code class="language-plaintext highlighter-rouge">&lt;\|system\|&gt;\n</code>)</td>
      <td>95%+</td>
    </tr>
    <tr>
      <td>System messages</td>
      <td>80-90%</td>
    </tr>
    <tr>
      <td>User queries</td>
      <td>10-30%</td>
    </tr>
  </tbody>
</table>

<p>That 80-90% hit rate on system messages means that for most requests, the
expensive part—tokenizing the 2,000-4,000 token system prompt—is a hash
table lookup returning in microseconds instead of a 10ms FFI call.</p>

<p>The implementation is straightforward: a hash map keyed on the input text,
with an LRU eviction list capped at a configured maximum (we use 1,000
entries). On hit, move the entry to the front. On miss, tokenize, insert at
the front, evict from the tail if full. No locks needed: in a sharded
architecture, each core has its own cache.</p>

<p>Two details matter:</p>

<p><strong>Cap cached text—but don’t cap it too low.</strong> Our first instinct was to cap at
8KB: surely a long RAG document won’t repeat verbatim often enough to earn its
memory. That was a mistake. The long, stable system prefixes we most wanted to
cache routinely exceed 8KB, and refusing them reintroduced a 5-7ms P50
regression at 20+ concurrent users, exactly the cost we were trying to delete.
We raised the cap to 64KB. Worst case is 1,000 entries × 64KB ≈ 64MB per
shard, which is cheap insurance. The cache key is the full input text (the
tokens themselves are a small vector of int32s), so the cap is really about
bounding key memory. And the texts most worth caching are precisely the long
ones. (A separate, tighter 32KB limit applies to the cross-core dispatch path,
because that one copies the string across a core boundary, where large copies
aren’t free.)</p>

<p><strong>Don’t cache unique content.</strong> User queries have a 10-30% hit rate; most
are unique. The cache handles this naturally through LRU eviction: unique
queries enter the cache, never get hit, and fall off the tail. The system
prompt stays hot at the front.</p>

<h2 id="when-caching-isnt-enough">When Caching Isn’t Enough</h2>

<p>A 90% cache hit rate sounds great until you think about what happens on the
other 10%. At 1,000 requests per second, 10% misses means 100 tokenizer calls
per second, each blocking for 10ms. Your event loop can handle exactly 100 of
those per second. You’re at capacity with zero headroom. And that’s assuming
uniform arrival, which real traffic never is.</p>

<p>A burst of 20 cache misses in a row blocks your event loop for 200ms. Every
request that arrives during those 200ms—including the ones that would have
been cache hits—waits.</p>

<p>You need a way to tokenize without blocking the event loop.</p>

<h3 id="option-1-thread-pool-offload">Option 1: Thread Pool Offload</h3>

<p>The most direct solution: move the FFI call to a dedicated worker thread.
The event loop submits a job, gets a future back immediately, and continues
processing other requests. When the worker thread finishes tokenizing, it
signals the event loop to resume the request.</p>

<p>The implementation needs care:</p>

<p><strong>One thread per core.</strong> The HuggingFace tokenizer isn’t thread-safe for
concurrent calls on the same instance, so you need one tokenizer instance per
worker thread. In a sharded architecture, that means one worker thread per
shard: no contention, no locks on the tokenizer itself.</p>

<p><strong>Lock-free job queue.</strong> The event loop (producer) and worker thread
(consumer) communicate through a bounded SPSC (single-producer,
single-consumer) queue. No mutexes on the hot path. When the queue is full,
the event loop falls back to other strategies rather than blocking.</p>

<p><strong>Memory isolation across thread boundaries.</strong> This is the subtle one. If your
event loop uses a per-core memory allocator (Seastar does, and so does
anything using jemalloc with thread-local arenas), you can’t pass
heap-allocated objects from the event loop thread to the worker thread and
back without corrupting allocator metadata. The input string must be
reallocated on the worker thread before calling the tokenizer. The output
tokens must be reallocated on the event loop thread when the result returns.
Two copies that feel wasteful but prevent silent memory corruption.</p>

<p>The overhead is ~50-200μs per call—negligible compared to the 5-13ms it
keeps off the event loop.</p>

<h3 id="option-2-cross-core-dispatch">Option 2: Cross-Core Dispatch</h3>

<p>If you’re running a multi-core architecture with per-core sharding, you can
hand the tokenization to a different core. Instead of tokenizing locally
(blocking this core’s event loop), dispatch it elsewhere.</p>

<p>This doesn’t eliminate the blocking: the target core’s event loop still blocks
for 5-13ms. But it moves the blocking away from the core that’s serving the
request, keeping the request-handling core responsive.</p>

<p>The selection algorithm is a knob, and how far you turn it depends on how
often this path fires. Our shipping implementation keeps it as simple as
possible: rotate to the next core (round-robin). It costs nothing to compute,
and because cross-core dispatch is only a fallback (it fires when the thread
pool is saturated, which is rare), even distribution is good enough and load
skew rarely has time to matter. Round-robin spreads cost evenly; it does not
look at which cores are actually busy.</p>

<p>If your dispatch path fires often enough that even spreading isn’t good
enough, the next step up is <strong>Power-of-Two-Choices (P2C)</strong>: sample two cores
at random, pick the less loaded one. It’s O(1), avoids thundering herd, and
produces near-optimal distribution. We run a P2C balancer elsewhere in the
system, and the cross-core tokenization path is wired to adopt it, but today
the selector still ignores load and just rotates. The rule of thumb: match the
selector’s sophistication to the frequency of the path. Don’t pay for P2C on a
path that fires once in a thousand requests.</p>

<p>The cross-core dispatch has its own memory safety requirement: the input text
must be copied into an owned string before crossing the core boundary, and the
output tokens must be copied again when returning. Same principle as the
thread pool—allocator domains don’t mix.</p>

<h3 id="option-3-both">Option 3: Both</h3>

<p>The strategies compose. On a cache miss:</p>

<ol>
  <li><strong>Try the thread pool</strong> — if the SPSC queue has space, submit and
continue. Event loop never blocks. Best case.</li>
  <li><strong>Try cross-core dispatch</strong> — if the thread pool is full, hand the work to
another core (we rotate round-robin). Calling core stays unblocked. Target
core blocks briefly.</li>
  <li><strong>Local fallback</strong> — if everything else fails, tokenize on the local event
loop, gated by a semaphore that limits concurrent blocking tokenizations to
one per core. This caps the worst case: at most one 5-13ms stall at a time,
rather than unbounded stacking.</li>
</ol>

<p>In practice, the cache handles 80-90% of requests. The thread pool handles
most of the rest. Cross-core dispatch and local fallback are rarely needed but
prevent pathological behavior under burst traffic.</p>

<h2 id="the-semaphore-matters-more-than-you-think">The Semaphore Matters More Than You Think</h2>

<p>That local fallback semaphore deserves a closer look. Without it, a burst of
cache misses can compound: five concurrent misses on the same core means five
sequential tokenizations, 50-65ms of total blocking. Every other request on
that core is frozen for the duration.</p>

<p>A semaphore with one permit ensures at most one blocking tokenization at a
time. The second, third, and fourth concurrent misses either wait for the
semaphore (adding latency to those specific requests) or bail out and route
without tokens (falling back to hash-based routing instead of prefix-aware
routing).</p>

<p>The choice between “wait for the semaphore” and “bail out” depends on your
architecture. If tokenization is required for correctness (e.g., token
counting for billing), wait. If it’s an optimization (e.g., prefix routing),
bail out. A request routed by hash instead of prefix is slightly less optimal
but doesn’t block the event loop.</p>

<h2 id="measuring-the-problem-in-your-stack">Measuring the Problem in Your Stack</h2>

<p>Most LLM proxies don’t instrument tokenization latency. Here’s what to look
for:</p>

<p><strong>Histogram the tokenization call.</strong> Wrap your tokenizer call with a timer.
Bucket at 100μs, 500μs, 1ms, 5ms, 10ms, 50ms, 100ms. If you see significant
mass above 5ms, you have a blocking problem.</p>

<p><strong>Track cache hit rate.</strong> If you have a cache, measure it. Anything above 70%
means caching is working. Below 50% means your traffic patterns don’t repeat
enough for caching to help, and you need the thread pool or dispatch
strategies.</p>

<p><strong>Correlate with tail latency.</strong> If your P99 latency spikes correlate with
low cache hit periods, tokenization blocking is likely the cause. The
correlation is indirect—the blocking doesn’t slow the tokenized request
much, but it freezes every <em>other</em> request on that core.</p>

<p><strong>Monitor event loop stalls.</strong> If your framework reports reactor stalls or
event loop delays (Seastar does, Node.js has <code class="language-plaintext highlighter-rouge">monitorEventLoopDelay</code>), check
whether they correlate with tokenization activity. A 10ms stall that appears
under load and disappears when you reduce traffic is a classic sign of a
synchronous call that shouldn’t be synchronous.</p>

<h2 id="when-you-dont-need-any-of-this">When You Don’t Need Any of This</h2>

<p>If your proxy doesn’t tokenize, none of this applies. Many LLM proxies are
pure HTTP forwarders—they pass the request body through without parsing it.
No tokenization, no bottleneck.</p>

<p>You need to tokenize if you’re doing any of:</p>
<ul>
  <li><strong>Prefix-aware routing</strong> (matching token sequences for KV cache locality)</li>
  <li><strong>Token counting</strong> (enforcing context window limits before hitting the
backend)</li>
  <li><strong>Cost estimation</strong> (pricing based on input token count)</li>
  <li><strong>Request priority classification</strong> (longer inputs get different priority)</li>
  <li><strong>Request rewriting</strong> (injecting or modifying tokens before forwarding)</li>
</ul>

<p>If you’re doing these in a thread-per-request architecture (Go, Java), the
blocking is absorbed by the thread model. The problem is specific to
event-loop architectures, and it scales with the number of requests that miss
the cache.</p>

<h2 id="the-takeaway">The Takeaway</h2>

<p>Tokenization is a 5-13ms synchronous operation hiding inside systems that
assume everything is asynchronous. At low traffic, it’s invisible. At high
traffic, it’s the bottleneck. The tokenizer itself isn’t especially slow; the
damage is that it freezes every other request on the core while it runs.</p>

<p>The fix is layered defense: cache the common cases (80-90% of requests), 
offload the rest to worker threads (event loop never blocks), and fall back to
cross-core dispatch when the thread pool is full. Each layer handles a
different failure mode. Together, they keep a 5-13ms FFI call from becoming a
200ms tail latency event.</p>

<p>If you’re building an LLM proxy that does anything with tokens before
forwarding, wrap your tokenizer call in a histogram before you reach for any
of these fixes. You can’t budget for the mass above 5ms until you can see it.</p>

<hr />

<p><em>This is the fourth post in a series on LLM infrastructure performance. The
first covered
<a href="https://ranvier.systems/2026/03/16/why-your-load-balancer-is-wasting-your-gpus.html">why your load balancer is wasting your GPUs</a>.
The second covered
<a href="https://ranvier.systems/2026/03/29/24-hard-rules-for-writing-correct-async-cpp.html">24 hard rules for writing correct async C++</a>.
The third covered
<a href="https://ranvier.systems/2026/04/30/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost.html">KV cache locality, the hidden variable in your LLM serving cost</a>.
The next will cover how routing decisions improve when the load balancer
learns from every request it forwards.</em></p>

<p><em>Ranvier is a project of Minds Aspire, LLC.</em></p>]]></content><author><name>Minds Aspire</name></author><summary type="html"><![CDATA[You’ve optimized your GPU serving stack. You’ve tuned vLLM’s batch size, configured PagedAttention, maybe even set up prefix-aware routing for KV cache locality. Your P99 looks good. Your throughput is climbing. And somewhere in your proxy layer, every single request is blocking for 5-13 milliseconds while a tokenizer turns text into integers.]]></summary></entry><entry><title type="html">KV Cache Locality: The Hidden Variable in Your LLM Serving Cost</title><link href="https://ranvier.systems/2026/04/30/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost.html" rel="alternate" type="text/html" title="KV Cache Locality: The Hidden Variable in Your LLM Serving Cost" /><published>2026-04-30T00:00:00+00:00</published><updated>2026-04-30T00:00:00+00:00</updated><id>https://ranvier.systems/2026/04/30/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost</id><content type="html" xml:base="https://ranvier.systems/2026/04/30/kv-cache-locality-the-hidden-variable-in-your-llm-serving-cost.html"><![CDATA[<p>Every time your load balancer sends a request to the wrong GPU, that GPU
recomputes a prefill it already computed somewhere else. The KV cache for that
4,000-token system prompt exists. It’s just sitting on a different card. Your
load balancer doesn’t know. It can’t know. It’s counting connections, not
tokens.</p>

<p>That recomputation takes real time and real money. On a Llama 3.1 70B at
half precision, a 4,000-token prefill takes over a second. If eight GPUs each
recompute the same system prompt independently because round-robin sent one
request to each, you just paid for the same work eight times. Multiply by
every request, every hour, every day.</p>

<p>This post is about the cost of that mistake, how to measure it, and what
changes when your load balancer understands token locality.</p>

<h2 id="what-the-kv-cache-actually-saves-you">What the KV Cache Actually Saves You</h2>

<p>A transformer processes input tokens in two phases. <strong>Prefill</strong> computes the
key-value pairs for every input token: the system prompt, the conversation
history, the RAG context. This is the expensive part. It scales with token
count and model size, and it’s compute-bound on the GPU. <strong>Decode</strong> generates
output tokens one at a time, each one reusing the key-value pairs from
prefill. This is the cheap part.</p>

<p>vLLM and other serving engines cache the key-value pairs from prefill in GPU
memory. When a new request arrives with the same token prefix, the engine
skips prefill entirely and jumps straight to decode. This is the KV cache hit.</p>

<p>On our benchmarks, a cache hit on CodeLlama 13B returns in 18ms at P50. A
cache miss takes around 500ms. That’s a 28x gap in time-to-first-token,
decided entirely by whether the tokens were already on that GPU.</p>

<p>But here’s the thing: the KV cache is <strong>per-GPU</strong>. GPU 0’s cache doesn’t help
GPU 3. If your load balancer sends Request A to GPU 0 and the identical
Request B to GPU 3, Request B pays full prefill cost even though the work was
already done. The cache exists. It’s just on the wrong card.</p>

<h2 id="the-math-on-wasted-prefill">The Math on Wasted Prefill</h2>

<p>Let’s make this concrete. You’re running a RAG application with a 4,000-token
system prompt. You have 8 GPUs serving CodeLlama 13B. You’re handling 30
concurrent users with a stress workload (heavy on large and extra-large
prefixes). Here’s what we measured on 8x A100s:</p>

<p>Round-robin routing:</p>
<ul>
  <li>Cache hit rate: 12.5%</li>
  <li>P99 TTFT: 6,800ms</li>
  <li>Throughput: 36.3 req/s</li>
</ul>

<p>With 8 backends and random routing, you’d expect ~12.5% cache hits by chance.
One in eight requests happens to land on the GPU that already has its prefix
cached. The other 87.5% recompute from scratch.</p>

<p>Prefix-aware routing:</p>
<ul>
  <li>Cache hit rate: 97.5%</li>
  <li>P99 TTFT: 1,000ms</li>
  <li>Throughput: 44.4 req/s</li>
</ul>

<p>Same GPUs. Same model. Same workload. The only change is which GPU receives
which request.</p>

<p>That throughput difference, 36.3 vs 44.4 requests per second, is a 22.3%
improvement. On hardware costing ~$10/hour, that’s either 22% more throughput
for free or the same throughput on fewer GPUs. Over a month of continuous
operation, on a single 8-GPU node, the wasted prefill in round-robin comes to
roughly $1,200–$1,800 in GPU-hours (22% of ~$7,300/month at $10/hr) that
produce no useful work. Multiply by the number of nodes in your cluster.</p>

<h2 id="where-the-savings-compound">Where the Savings Compound</h2>

<p>The benefit scales with three variables: <strong>model size</strong>, <strong>prefix length</strong>,
and <strong>prefix sharing ratio</strong>.</p>

<h3 id="model-size">Model size</h3>

<p>Larger models have more expensive prefill, so cache misses cost more.</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>XLarge Cache Hit Improvement</th>
      <th>Aggregate Throughput Gain</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Llama 3.1 8B</td>
      <td>31.6%</td>
      <td>~0% (inference too fast)</td>
    </tr>
    <tr>
      <td>CodeLlama 13B</td>
      <td>35.9%</td>
      <td>+13.7% to +22.3%</td>
    </tr>
    <tr>
      <td>Llama 3.1 70B</td>
      <td>43.8%</td>
      <td>~0% (compute-bound)</td>
    </tr>
  </tbody>
</table>

<p>The 8B numbers are the warning case. When prefill is already fast (~420ms
total inference), the 7-10ms routing overhead eats into the savings. If
prefill isn’t your bottleneck, prefix-aware routing doesn’t help.</p>

<p>The 70B numbers tell a different story. Aggregate throughput doesn’t change
because the GPUs are already compute-saturated. But individual requests are
44% faster on cache hit (P50: 1,498ms hit vs 2,665ms miss). Your users feel
the difference even if your throughput dashboard doesn’t.</p>

<p>The sweet spot is 13B-70B models where prefill is expensive enough to matter
but the GPUs aren’t so saturated that they can’t benefit from skipping it.</p>

<h3 id="prefix-length">Prefix length</h3>

<p>Longer shared prefixes mean more wasted compute per cache miss.</p>

<table>
  <thead>
    <tr>
      <th>Max Prefix Tokens</th>
      <th>Cache Miss P50</th>
      <th>Cache Hit P50</th>
      <th>Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>8,192 (default)</td>
      <td>638ms</td>
      <td>448ms</td>
      <td>29.7%</td>
    </tr>
    <tr>
      <td>16,384</td>
      <td>817ms</td>
      <td>461ms</td>
      <td>43.6%</td>
    </tr>
  </tbody>
</table>

<p>At 16K tokens, a cache miss wastes nearly 400ms of GPU compute that a hit
avoids entirely. As context windows keep growing, this gap widens.</p>

<h3 id="prefix-sharing-ratio">Prefix sharing ratio</h3>

<p>This is the percentage of tokens shared across requests. A RAG application
where every request includes the same 4,000-token knowledge base has a high
sharing ratio. A chat application where every conversation is unique has a
low one.</p>

<table>
  <thead>
    <tr>
      <th>Sharing Ratio</th>
      <th>Round-Robin Hits</th>
      <th>Prefix-Aware Hits</th>
      <th>Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>50%</td>
      <td>~11%</td>
      <td>91%</td>
      <td>+80pp</td>
    </tr>
    <tr>
      <td>70%</td>
      <td>~13%</td>
      <td>90%</td>
      <td>+77pp</td>
    </tr>
    <tr>
      <td>90%</td>
      <td>~12%</td>
      <td>97-98%</td>
      <td>+85pp</td>
    </tr>
  </tbody>
</table>

<p>Even at 50% sharing, where half the tokens are unique, prefix-aware routing
still achieves 91% cache hits. A consistent hash fallback (deterministic
routing based on prefix when no learned route exists yet) ensures that
requests with the same prefix land on the same GPU even before the system
has observed them.</p>

<h2 id="the-p99-story">The P99 Story</h2>

<p>Cost isn’t just GPU-hours. It’s also the cost of slow responses.</p>

<p>At 30 concurrent users on CodeLlama 13B over 30 minutes of sustained load,
round-robin routing produced a P99 TTFT of 6,800ms. That’s 6.8 seconds before
the first token appears. For an interactive application like code completion
or chat, that’s a broken experience. Users don’t wait 6.8 seconds.</p>

<p>Prefix-aware routing brought that same P99 down to 1,000ms. Same hardware,
same model, same concurrency. An 85.3% improvement on tail latency.</p>

<p>Why does the tail improve so much? Because tail latency in LLM serving is
driven by cache misses under load. When the GPU is busy generating tokens for
other requests, a new request that requires full prefill gets queued behind
them. With round-robin, 87.5% of requests need full prefill, so the queue is
always full of expensive work. With prefix-aware routing, 97.5% of requests
skip prefill entirely, so the queue drains faster and the few remaining
misses get processed sooner.</p>

<p>This is the strongest argument for KV cache locality. Throughput improvements
look good on a dashboard. Tail latency is what users actually experience.</p>

<h2 id="what-doesnt-work">What Doesn’t Work</h2>

<p>Prefix-aware routing isn’t free, and it doesn’t help everywhere.</p>

<p><strong>Small models (≤8B):</strong> Inference is already fast enough that the routing
overhead (~10ms for tokenization + tree lookup) approaches the prefill
savings. The net effect is roughly zero.</p>

<p><strong>Short prefixes (&lt;500 tokens):</strong> The prefill cost for short sequences is
small enough that cache misses don’t meaningfully hurt. The routing overhead
(~3ms minimum) can exceed the savings.</p>

<p><strong>Unique conversations:</strong> If every request has a completely different prefix
(no shared system prompt, no shared context), there’s nothing to cache. The
routing tree learns routes that are never reused.</p>

<p><strong>Load imbalance:</strong> Strict prefix affinity can create hot spots. If 80% of
your traffic shares the same system prompt, prefix-aware routing sends 80% of
traffic to one GPU. We handle this with a load-aware fallback that diverts
requests when a backend’s in-flight count exceeds twice the median. This
trades a cache miss for a balanced GPU, reducing P95 by 36% and P99 by 45%
compared to strict affinity. The cache hit rate drops about 5 points, which
is the right trade.</p>

<h2 id="measuring-your-own-cache-locality">Measuring Your Own Cache Locality</h2>

<p>Before you change anything, measure your current cache hit rate. Most vLLM
deployments expose this via Prometheus:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">vllm:gpu_prefix_cache_hit_rate</code> (or <code class="language-plaintext highlighter-rouge">vllm:gpu_prefix_cache_queries_total</code>
and <code class="language-plaintext highlighter-rouge">_hits_total</code> on older versions; check your <code class="language-plaintext highlighter-rouge">/metrics</code> endpoint)</li>
  <li>Compare TTFT distributions between requests with shared vs unique prefixes</li>
  <li>Look at your P99/P50 ratio. A ratio above 5x suggests cache thrashing</li>
</ul>

<p>If your cache hit rate is already above 80%, you’re either lucky or your
traffic naturally clusters. If it’s below 30%, you’re leaving performance on
the table.</p>

<p>The variables that matter most:</p>

<ol>
  <li><strong>How many GPUs are you routing across?</strong> More GPUs = lower chance of
random cache hits. With 8 GPUs, random routing gives ~12.5% hits.</li>
  <li><strong>How long are your shared prefixes?</strong> Longer = more wasted compute per
miss.</li>
  <li><strong>What’s your prefix sharing ratio?</strong> Higher = more opportunity for reuse.</li>
  <li><strong>What model size are you serving?</strong> Larger = more expensive prefill per
miss.</li>
</ol>

<p>If you have many GPUs, long shared prefixes, high sharing ratios, and
large models, you’re likely wasting 20-40% of your GPU compute on redundant
prefill.</p>

<h2 id="the-takeaway">The Takeaway</h2>

<p>KV cache locality is not a tuning knob. It’s a multiplier on your existing
hardware. The same GPUs, serving the same model, handling the same traffic,
produce measurably different throughput and latency depending on one decision:
which GPU gets which request.</p>

<p>Round-robin doesn’t make that decision. Least-connections doesn’t make that
decision. They balance load without understanding what the load <em>is</em>. When
every request carries thousands of tokens that might already be cached
somewhere in your cluster, “balanced” and “efficient” are not the same thing.</p>

<p>We built <a href="https://github.com/Ranvier-Systems/ranvier-core">Ranvier</a> to make
that decision. It routes requests to the GPU that already has their token
prefix cached, using an adaptive radix tree that learns routes in real time.
The first post in this series covered
<a href="https://ranvier.systems/2026/03/16/why-your-load-balancer-is-wasting-your-gpus.html">why your load balancer is wasting your GPUs</a>.
This post covered what that waste costs. The next one will cover how we
tokenize 50,000 requests per second without blocking the event loop.</p>

<hr />

<p><em>All benchmarks run on 8x A100 GPUs (Lambda Labs), February 2026. Workloads
use the stress distribution (10% small, 20% medium, 30% large, 40% xlarge
prefixes) with 90% prefix sharing ratio unless noted. Full methodology and raw
data available in the
<a href="https://github.com/Ranvier-Systems/ranvier-core/tree/main/docs/benchmarks">benchmark guide</a>.</em></p>

<p><em>Ranvier is a project of Minds Aspire, LLC.</em></p>]]></content><author><name>Minds Aspire</name></author><summary type="html"><![CDATA[Every time your load balancer sends a request to the wrong GPU, that GPU recomputes a prefill it already computed somewhere else. The KV cache for that 4,000-token system prompt exists. It’s just sitting on a different card. Your load balancer doesn’t know. It can’t know. It’s counting connections, not tokens.]]></summary></entry><entry><title type="html">24 Hard Rules for Writing Correct Async C++</title><link href="https://ranvier.systems/2026/03/29/24-hard-rules-for-writing-correct-async-cpp.html" rel="alternate" type="text/html" title="24 Hard Rules for Writing Correct Async C++" /><published>2026-03-29T00:00:00+00:00</published><updated>2026-03-29T00:00:00+00:00</updated><id>https://ranvier.systems/2026/03/29/24-hard-rules-for-writing-correct-async-cpp</id><content type="html" xml:base="https://ranvier.systems/2026/03/29/24-hard-rules-for-writing-correct-async-cpp.html"><![CDATA[<p>Async C++ will let you write a use-after-free that only manifests under load, on the third Tuesday of the month, in a stack frame that has nothing to do with the bug. The compiler won’t warn you. Your tests will pass. Your sanitizers will shrug. And then production will teach you what you missed.</p>

<p>I maintain a ~50K LOC C++20 service built on Seastar. I catalogued every class of bug that burned me and turned them into 24 rules I enforce on every commit. Each one cost at least a day to diagnose. Here they are.</p>

<h2 id="memory-that-isnt-yours-anymore">Memory That Isn’t Yours Anymore</h2>

<p>Most of the worst async C++ bugs are lifetime bugs. In synchronous code, if the object exists, the scope that created it still exists. In async code, that’s not true. The object is alive but the scope that created it finished long ago.</p>

<p><strong>Rule 16 — Lambda coroutines in <code class="language-plaintext highlighter-rouge">.then()</code> are use-after-free.</strong> This is the scariest bug on the list because it looks completely correct. You write a lambda that contains <code class="language-plaintext highlighter-rouge">co_await</code>, pass it to <code class="language-plaintext highlighter-rouge">.then()</code>, and everything compiles. Here’s what actually happens: <code class="language-plaintext highlighter-rouge">.then()</code> moves the lambda into internal storage. The lambda’s <code class="language-plaintext highlighter-rouge">operator()</code> is called, which creates a coroutine frame on the heap. The coroutine suspends at <code class="language-plaintext highlighter-rouge">co_await</code>. <code class="language-plaintext highlighter-rouge">.then()</code> is done with the lambda and destroys it. The coroutine resumes into freed memory.</p>

<p>The broken version:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// BROKEN — use-after-free when the coroutine suspends</span>
<span class="n">future</span><span class="o">&lt;&gt;</span> <span class="n">handle</span><span class="p">(</span><span class="n">request</span> <span class="n">req</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">async_lookup</span><span class="p">(</span><span class="n">req</span><span class="p">.</span><span class="n">key</span><span class="p">()).</span><span class="n">then</span><span class="p">([</span><span class="n">req</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">req</span><span class="p">)](</span><span class="k">auto</span> <span class="n">val</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">future</span><span class="o">&lt;&gt;</span> <span class="p">{</span>
        <span class="k">co_await</span> <span class="n">async_log</span><span class="p">(</span><span class="n">req</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span>   <span class="c1">// .then() has already freed this lambda</span>
        <span class="k">co_return</span><span class="p">;</span>
    <span class="p">});</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The fix:</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// FIXED — seastar::coroutine::lambda() keeps the frame alive</span>
<span class="n">future</span><span class="o">&lt;&gt;</span> <span class="n">handle</span><span class="p">(</span><span class="n">request</span> <span class="n">req</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">async_lookup</span><span class="p">(</span><span class="n">req</span><span class="p">.</span><span class="n">key</span><span class="p">()).</span><span class="n">then</span><span class="p">(</span><span class="n">seastar</span><span class="o">::</span><span class="n">coroutine</span><span class="o">::</span><span class="n">lambda</span><span class="p">([</span><span class="n">req</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">req</span><span class="p">)](</span><span class="k">auto</span> <span class="n">val</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">future</span><span class="o">&lt;&gt;</span> <span class="p">{</span>
        <span class="k">co_await</span> <span class="n">async_log</span><span class="p">(</span><span class="n">req</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span>
        <span class="k">co_return</span><span class="p">;</span>
    <span class="p">}));</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The compiler will never warn you about this. I found it after three days of chasing a heap corruption that only appeared under sustained load.</p>

<p><strong>Rule 5 — Timer callbacks need gate guards.</strong> A repeating timer fires after <code class="language-plaintext highlighter-rouge">stop()</code> has already begun destroying <code class="language-plaintext highlighter-rouge">this</code>. The callback dereferences member variables that no longer exist. The fix is <code class="language-plaintext highlighter-rouge">seastar::gate</code>, but the gate holder must outlive the <em>entire</em> async operation, not just the try block.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// BROKEN — gate guard scoped to try body; catch runs outside the gate</span>
<span class="kt">void</span> <span class="nf">on_timer</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">try</span> <span class="p">{</span>
        <span class="k">auto</span> <span class="n">holder</span> <span class="o">=</span> <span class="n">_gate</span><span class="p">.</span><span class="n">hold</span><span class="p">();</span>
        <span class="k">co_await</span> <span class="n">do_work</span><span class="p">();</span>
    <span class="p">}</span> <span class="k">catch</span> <span class="p">(...)</span> <span class="p">{</span>
        <span class="n">_logger</span><span class="p">.</span><span class="n">warn</span><span class="p">(</span><span class="s">"failed"</span><span class="p">);</span>  <span class="c1">// _logger is destroyed during shutdown</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c1">// FIXED — gate guard covers the entire operation including error handling</span>
<span class="kt">void</span> <span class="n">on_timer</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">auto</span> <span class="n">holder</span> <span class="o">=</span> <span class="n">_gate</span><span class="p">.</span><span class="n">hold</span><span class="p">();</span>
    <span class="k">try</span> <span class="p">{</span>
        <span class="k">co_await</span> <span class="n">do_work</span><span class="p">();</span>
    <span class="p">}</span> <span class="k">catch</span> <span class="p">(...)</span> <span class="p">{</span>
        <span class="n">_logger</span><span class="p">.</span><span class="n">warn</span><span class="p">(</span><span class="s">"failed"</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>During shutdown, <code class="language-plaintext highlighter-rouge">_gate.close()</code> waits for outstanding holders. If the holder is scoped inside the try, the catch path runs unguarded, touching members that <code class="language-plaintext highlighter-rouge">stop()</code> has already destroyed.</p>

<p><strong>Rule 21 — Coroutine reference parameters dangle.</strong> Just take coroutine parameters by value. Always. A coroutine that takes <code class="language-plaintext highlighter-rouge">const std::string&amp;</code> looks correct, compiles fine, passes every unit test, and breaks under load. The caller’s string goes out of scope, the coroutine suspends, and when it resumes the reference points to freed memory.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// BROKEN — reference dangles when caller's scope ends before coroutine resumes</span>
<span class="n">future</span><span class="o">&lt;&gt;</span> <span class="n">process</span><span class="p">(</span><span class="k">const</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">&amp;</span> <span class="n">key</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">co_await</span> <span class="n">db</span><span class="p">.</span><span class="n">lookup</span><span class="p">(</span><span class="n">key</span><span class="p">);</span>  <span class="c1">// key may be freed by now</span>
<span class="p">}</span>

<span class="c1">// FIXED — take by value, always</span>
<span class="n">future</span><span class="o">&lt;&gt;</span> <span class="n">process</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">string</span> <span class="n">key</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">co_await</span> <span class="n">db</span><span class="p">.</span><span class="n">lookup</span><span class="p">(</span><span class="n">key</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The cost of a copy is nothing compared to debugging a dangling reference that only shows up at the 99.9th percentile.</p>

<p><strong>Rule 20 — Missing <code class="language-plaintext highlighter-rouge">&amp;</code> in <code class="language-plaintext highlighter-rouge">do_with</code> lambdas.</strong> <code class="language-plaintext highlighter-rouge">seastar::do_with</code> allocates objects on the heap and passes them by reference to your lambda. Forget a single <code class="language-plaintext highlighter-rouge">&amp;</code> and you get a copy instead.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// BROKEN — buf is captured by value; the copy dies when the lambda returns</span>
<span class="k">return</span> <span class="n">seastar</span><span class="o">::</span><span class="n">do_with</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">buf</span><span class="p">),</span> <span class="p">[](</span><span class="k">auto</span> <span class="n">buf</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">async_write</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>  <span class="c1">// dangling reference to destroyed copy</span>
<span class="p">});</span>

<span class="c1">// FIXED — capture by reference; do_with owns the object for the future's lifetime</span>
<span class="k">return</span> <span class="n">seastar</span><span class="o">::</span><span class="n">do_with</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">buf</span><span class="p">),</span> <span class="p">[](</span><span class="k">auto</span><span class="o">&amp;</span> <span class="n">buf</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">async_write</span><span class="p">(</span><span class="n">buf</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div></div>

<p>One missing character. The copy is destroyed when the lambda returns, but the future it spawned is still running, still holding a reference to the now-dead copy. Heap corruption that shows up in completely unrelated code, sometimes minutes later.</p>

<p>Two ways to keep this from biting you in the first place. If you control the type, make it move-only — the broken version won’t compile. And in C++20 coroutine style you rarely need <code class="language-plaintext highlighter-rouge">do_with</code> at all; a local variable in a coroutine lives across <code class="language-plaintext highlighter-rouge">co_await</code> suspensions, which is the whole point.</p>

<p><strong>Rule 23 — <code class="language-plaintext highlighter-rouge">share()</code> on <code class="language-plaintext highlighter-rouge">temporary_buffer</code> pins the whole allocation.</strong> You call <code class="language-plaintext highlighter-rouge">.share()</code> to grab a 32-byte header from a 64KB network buffer. Both shared views now pin the same underlying allocation. You cache the header, but the “temporary” buffer lives forever. The result is unexplained memory growth that doesn’t correlate with logical data sizes. The fix: copy the bytes you need into a new buffer, then release the shared view.</p>

<h2 id="the-reactor-is-not-a-thread">The Reactor Is Not a Thread</h2>

<p>Seastar is cooperative. There is no kernel to preempt you. Every microsecond you block is a microsecond where that core serves zero requests.</p>

<p><strong>Rule 2 — No <code class="language-plaintext highlighter-rouge">co_await</code> in unbounded loops over external resources.</strong> The pattern <code class="language-plaintext highlighter-rouge">for (auto&amp; item : items) { co_await process(item); }</code> is O(n) latency for that request. 100 items at 10ms each means a full second before the caller hears back. The shard keeps serving other connections in the meantime — it isn’t idle — but this one request is needlessly serialized. The point of async is that you get to pick the execution shape: serial, fully concurrent, or bounded concurrency. Pick deliberately. When you want parallelism, use <code class="language-plaintext highlighter-rouge">seastar::parallel_for_each</code> or <code class="language-plaintext highlighter-rouge">seastar::max_concurrent_for_each</code> with a cap so you don’t exhaust memory or downstream connections.</p>

<p><strong>Rule 12 — No <code class="language-plaintext highlighter-rouge">std::ifstream</code> in coroutines.</strong> It compiles. It works in testing with SSDs. In production, one 10ms disk stall freezes the entire shard. Every connection on that core drops packets for 10ms. Use Seastar’s file I/O, which goes through the reactor and yields properly. And don’t reach for <code class="language-plaintext highlighter-rouge">seastar::thread</code> as an escape hatch — it’s a stackful coroutine that runs <em>on the reactor</em>, so blocking inside it stalls the shard exactly the same way. If you genuinely have no async-friendly alternative, run the blocking call on a dedicated OS thread outside the reactor and ferry the result back through the alien-thread API.</p>

<p><strong>Rule 17 — Preemption points in hot loops.</strong> A tight loop that runs for 500μs without yielding starves everything else on that core. Insert <code class="language-plaintext highlighter-rouge">co_await seastar::coroutine::maybe_yield()</code> every ~100 iterations. The cost is a branch that’s almost never taken. The cost of <em>not</em> doing it is a reactor stall warning in your logs and a mystery latency spike that disappears when you reduce load.</p>

<h2 id="cross-shard-is-cross-universe">Cross-Shard Is Cross-Universe</h2>

<p>Each core in Seastar has its own memory allocator. This isn’t an implementation detail you can ignore. It’s a load-bearing invariant, and violating it corrupts allocator state silently.</p>

<p><strong>Rule 0 — <code class="language-plaintext highlighter-rouge">std::shared_ptr</code> destructs on the wrong shard.</strong> The refcount is atomic, so the decrement is “safe” from any core. But the destructor runs on whichever core decrements last. That destructor frees memory through the wrong core’s allocator.</p>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// BROKEN — destructor runs on whichever shard releases last</span>
<span class="n">std</span><span class="o">::</span><span class="n">shared_ptr</span><span class="o">&lt;</span><span class="n">session</span><span class="o">&gt;</span> <span class="n">s</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">make_shared</span><span class="o">&lt;</span><span class="n">session</span><span class="o">&gt;</span><span class="p">();</span>
<span class="c1">// ... shared across shards via submit_to() ...</span>
<span class="c1">// shard 3 drops the last reference; ~session() frees memory</span>
<span class="c1">// allocated by shard 0's allocator. Silent corruption.</span>

<span class="c1">// FIXED — foreign_ptr ensures destruction on the owning shard</span>
<span class="n">seastar</span><span class="o">::</span><span class="n">foreign_ptr</span><span class="o">&lt;</span><span class="n">seastar</span><span class="o">::</span><span class="n">lw_shared_ptr</span><span class="o">&lt;</span><span class="n">session</span><span class="o">&gt;&gt;</span> <span class="n">s</span><span class="p">;</span>
</code></pre></div></div>

<p>Use <code class="language-plaintext highlighter-rouge">seastar::lw_shared_ptr</code> (non-atomic refcount, shard-local only) for objects that stay on one shard. Wrap cross-shard pointers in <code class="language-plaintext highlighter-rouge">seastar::foreign_ptr</code>, which ensures the destructor runs on the owning shard. This was the first bug that burned me and the last one I expected, which is why it’s Rule 0.</p>

<p><strong>Rule 14 — Cross-shard heap data must be reallocated locally.</strong> You <code class="language-plaintext highlighter-rouge">submit_to()</code> another shard with a <code class="language-plaintext highlighter-rouge">std::string</code>. The target shard reads memory allocated by the source shard’s allocator. Maybe it works today. Maybe the allocator metadata is adjacent and you corrupt it on the next allocation. Copy on receive. Always. It feels wasteful but it prevents silent corruption.</p>

<p><strong>Rule 15 — FFI across shard boundaries needs reallocation in both directions.</strong> Passing Seastar-allocated memory to an FFI boundary (Rust, C libraries) means the foreign code may free or reallocate through a different allocator. Reallocate into standard <code class="language-plaintext highlighter-rouge">malloc</code> memory before calling FFI. Reallocate the result back into Seastar’s allocator before returning to the reactor.</p>

<h2 id="futures-are-not-exceptions">Futures Are Not Exceptions</h2>

<p>C++ effectively has two error propagation systems now: exceptions and future chains. Code that mixes them has gaps where errors fall through.</p>

<p><strong>Rule 18 — Discarded futures silently swallow errors.</strong> Calling an async function without <code class="language-plaintext highlighter-rouge">co_await</code> means the returned future is destroyed immediately. If that future eventually resolves with an exception, nobody sees it. Seastar logs a warning at runtime, but by then the damage is done: a write that didn’t complete, a cleanup that never ran. Every future must be <code class="language-plaintext highlighter-rouge">co_await</code>ed, returned, or explicitly discarded with a comment explaining why.</p>

<p><strong>Rule 22 — Throwing before returning a future bypasses <code class="language-plaintext highlighter-rouge">.finally()</code>.</strong> If an exception is thrown synchronously before the function returns a future, it propagates as a regular C++ exception. Any <code class="language-plaintext highlighter-rouge">.finally()</code> attached to the expected return value never executes. Cleanup is skipped. Resources leak. Use <code class="language-plaintext highlighter-rouge">seastar::futurize_invoke()</code> to wrap the call, which catches synchronous exceptions and converts them into failed futures. Or just use coroutines, which handle this naturally.</p>

<p><strong>Rule 19 — Raw <code class="language-plaintext highlighter-rouge">semaphore::wait()/signal()</code> leaks units on throw.</strong> You call <code class="language-plaintext highlighter-rouge">wait()</code>, do work, call <code class="language-plaintext highlighter-rouge">signal()</code> in a <code class="language-plaintext highlighter-rouge">.finally()</code>. But if the work throws synchronously before you attach <code class="language-plaintext highlighter-rouge">.finally()</code>, the units are never returned. The semaphore’s available count decreases monotonically until everything deadlocks. Use <code class="language-plaintext highlighter-rouge">seastar::with_semaphore()</code>, which handles the lifecycle correctly regardless of how the operation fails.</p>

<h2 id="the-rules-i-didnt-expect">The Rules I Didn’t Expect</h2>

<p>Some rules aren’t about the language at all.</p>

<p><strong>Rule 4 — Every growing container needs MAX_SIZE.</strong> No unbounded buffers, ever. A single malicious peer sending oversized messages will OOM your process if nothing caps the queue. Every <code class="language-plaintext highlighter-rouge">std::vector</code> and <code class="language-plaintext highlighter-rouge">std::deque</code>, every ring buffer gets a configured maximum.</p>

<p><strong>Rule 9 — Every catch block logs at warn level.</strong> A silent <code class="language-plaintext highlighter-rouge">catch(...)</code> is the number one cause of “it works but something is wrong” in production. If you’re catching an exception, something unexpected happened. Log it. If it’s too noisy, fix the root cause instead of silencing the symptom.</p>

<p><strong>Rule 7 — Persistence only stores, never validates.</strong> This is a design rule, not a language rule. When the persistence layer also validates, you can’t test business logic without spinning up storage. When it only stores, you can test validation in isolation and reason about correctness without thinking about I/O.</p>

<h2 id="the-remaining-rules">The Remaining Rules</h2>

<p>For completeness, here are the rules not covered in full above:</p>

<ul>
  <li><strong>Rule 1</strong> — Metrics accessors must be lock-free, no <code class="language-plaintext highlighter-rouge">std::mutex</code> in query methods.</li>
  <li><strong>Rule 3</strong> — Null-guard all C string returns. <code class="language-plaintext highlighter-rouge">sqlite3_column_text()</code> returns NULL on empty columns; dereferencing it is undefined behavior.</li>
  <li><strong>Rule 6</strong> — Deregister metrics first in <code class="language-plaintext highlighter-rouge">stop()</code>. Prometheus scrape lambdas capture <code class="language-plaintext highlighter-rouge">this</code>; if <code class="language-plaintext highlighter-rouge">this</code> is destroyed first, the next scrape is a use-after-free.</li>
  <li><strong>Rule 8</strong> — Single <code class="language-plaintext highlighter-rouge">ShardLocalState</code> struct per service, no scattered <code class="language-plaintext highlighter-rouge">thread_local</code> variables.</li>
  <li><strong>Rule 10</strong> — Validating helpers for string-to-number conversions. <code class="language-plaintext highlighter-rouge">std::stoi()</code> throws on bad input; raw calls in request parsing are a crash waiting to happen.</li>
  <li><strong>Rule 11</strong> — <code class="language-plaintext highlighter-rouge">std::call_once</code> or <code class="language-plaintext highlighter-rouge">std::atomic</code> for one-time global initialization, never a bare static with lazy init.</li>
  <li><strong>Rule 13</strong> — Thread-local <code class="language-plaintext highlighter-rouge">new</code> needs an explicit destroy function registered with the allocator, or the memory leaks on shard shutdown. (Switching to <code class="language-plaintext highlighter-rouge">std::make_unique</code> doesn’t fix this if the smart pointer itself has <code class="language-plaintext highlighter-rouge">thread_local</code> storage — the hazard is the destruction order at shard teardown, not the allocation syntax.)</li>
</ul>

<h2 id="how-i-enforce-them">How I Enforce Them</h2>

<p>These rules live in a reference document I consult on every commit. They’re enforced by discipline, not tooling. No linter can tell you that a lambda coroutine in <code class="language-plaintext highlighter-rouge">.then()</code> is a use-after-free.</p>

<p>Numbering them matters. “Rule 16” is a faster shorthand than re-deriving the coroutine frame lifetime problem each time you encounter it.</p>

<p>The list started at Rule 0 and grew to 24. I add a rule only when a bug burns me. Never speculatively. If you’re building something similar, start your own list. The specific rules matter less than the habit of writing them down.</p>

<h2 id="the-takeaway">The Takeaway</h2>

<p>Async C++ gives you performance that no garbage-collected language can match, but it takes away the safety nets. You have to build your own. Write your rules down.</p>

<p>I’m building <a href="https://github.com/Ranvier-Systems/ranvier-core">Ranvier</a>, a Layer 7 load balancer for LLM inference on Seastar. If this kind of systems work interests you, check out the <a href="https://github.com/Ranvier-Systems/ranvier-core">source</a>.</p>

<hr />

<p><em>Updated 2026-05-04 with corrections from <a href="https://www.reddit.com/r/cpp/comments/1s85qf1/24_hard_rules_for_writing_correct_async_c_lessons/">reader feedback on r/cpp</a>: Rule 12 no longer recommends <code class="language-plaintext highlighter-rouge">seastar::thread</code> for blocking I/O (it’s a stackful coroutine on the reactor, not an escape hatch), Rule 2 was reworded to clarify that the cost is per-request latency rather than shard-wide idleness, and Rule 20 notes that move-only types and C++20 coroutines largely sidestep the <code class="language-plaintext highlighter-rouge">do_with</code> hazard. Thanks to the commenters who flagged these.</em></p>

<p><em>Ranvier is a project of Minds Aspire, LLC.</em></p>]]></content><author><name>Minds Aspire</name></author><summary type="html"><![CDATA[Async C++ will let you write a use-after-free that only manifests under load, on the third Tuesday of the month, in a stack frame that has nothing to do with the bug. The compiler won’t warn you. Your tests will pass. Your sanitizers will shrug. And then production will teach you what you missed.]]></summary></entry><entry><title type="html">Why Your Load Balancer Is Wasting Your GPUs</title><link href="https://ranvier.systems/2026/03/16/why-your-load-balancer-is-wasting-your-gpus.html" rel="alternate" type="text/html" title="Why Your Load Balancer Is Wasting Your GPUs" /><published>2026-03-16T00:00:00+00:00</published><updated>2026-03-16T00:00:00+00:00</updated><id>https://ranvier.systems/2026/03/16/why-your-load-balancer-is-wasting-your-gpus</id><content type="html" xml:base="https://ranvier.systems/2026/03/16/why-your-load-balancer-is-wasting-your-gpus.html"><![CDATA[<p>Your load balancer is blind. It routes LLM requests based on server availability, completely ignoring the fact that GPU-1 already holds the context your request needs. The result: redundant computation, wasted money, and slower responses than necessary.</p>

<h2 id="the-problem-kv-cache-thrashing">The Problem: KV Cache Thrashing</h2>

<p>Modern LLM inference engines like vLLM and TensorRT-LLM use KV caching to avoid recomputing attention for tokens they’ve already seen. When a user sends a follow-up question about a document, the engine can skip directly to generating the answer—if that document’s context is still in memory.</p>

<p>Here’s what actually happens with standard load balancing:</p>

<p>Request A loads a 4,000-token PDF into GPU-1. Request B, asking a question about that same PDF, arrives moments later. Your load balancer—following Least Connections or Round Robin—routes it to GPU-2. GPU-2 has no idea this document exists. It re-computes the entire 4,000-token prefill from scratch.</p>

<p>At scale, this pattern can waste significant inference capacity. You’re paying for GPUs to redo work that’s already been done.</p>

<h2 id="the-insight-content-aware-routing">The Insight: Content-Aware Routing</h2>

<p>What if your load balancer understood tokens, not just packets?</p>

<p>The fix is conceptually simple: route requests to the GPU that already has the relevant KV cache. Skip the prefill entirely. But implementing this requires solving a hard problem—you need sub-millisecond lookups across potentially millions of cached prefixes, updating in real-time as caches fill and evict.</p>

<p>This is the problem Ranvier solves.</p>

<h2 id="introducing-ranvier">Introducing Ranvier</h2>

<p><a href="https://github.com/Ranvier-Systems/ranvier-core">Ranvier</a> is a Layer 7+ load balancer purpose-built for LLM inference. Instead of treating requests as opaque HTTP packets, it inspects the token sequence and routes to the backend most likely to have that prefix cached.</p>

<p><strong>How it works:</strong> Ranvier maintains an Adaptive Radix Tree (ART) that maps token prefixes to backend GPUs. When a request arrives, Ranvier tokenizes the prompt, performs an O(L) prefix lookup (where L is prefix length, not total keys), and routes to the matching backend. When backends report cache evictions via gossip protocol, Ranvier updates its routing table.</p>

<p><strong>Why it’s fast:</strong> Built on C++20 and the Seastar framework—the same shared-nothing, thread-per-core architecture that powers ScyllaDB. No locks, no atomics, no cross-core coordination in the hot path. Total routing overhead is 6-8ms with server-side tokenization. For production deployments where clients send pre-tokenized requests, overhead drops to &lt;1ms.</p>

<p>The name comes from the Nodes of Ranvier in neuroscience—the gaps in myelin sheath that allow nerve signals to “jump” between points (saltatory conduction), dramatically increasing signal speed. Ranvier does the same for inference: jumping past redundant computation to reach cached state.</p>

<h2 id="the-results">The Results</h2>

<p>We tested Ranvier on 8x A100 clusters running vLLM across multiple instances, comparing prefix-aware routing against standard round-robin:</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Round Robin</th>
      <th>Ranvier</th>
      <th>Improvement</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cache Hit Rate</td>
      <td>12% (1/8 random)</td>
      <td>58-98%</td>
      <td>5-8x</td>
    </tr>
    <tr>
      <td>P99 Latency (13B, 30u 30m)</td>
      <td>4.7-6.8s</td>
      <td>0.9-1.0s</td>
      <td>79-85%</td>
    </tr>
    <tr>
      <td>Throughput (13B, 30u 30m)</td>
      <td>36.3 req/s</td>
      <td>44.4 req/s</td>
      <td>+13-22%</td>
    </tr>
  </tbody>
</table>

<p>CodeLlama-13b, 10-30 concurrent users, 10-30 minutes, stress-distribution prompts (70% large/xlarge prefixes). Results validated across four Lambda Labs instances (Feb-March 2026). Per-request, XLarge prompts (4K-8K tokens) saw 33-39% faster time-to-first-token on cache hits. Full methodology and raw data in the <a href="https://github.com/Ranvier-Systems/ranvier-core/blob/main/docs/benchmarks/benchmark-guide-8xA100.md">benchmark guide</a>.</p>

<p>Larger models benefit more. 70B models show the highest per-request improvement (44-49% faster XLarge TTFT on cache hit) because the KV cache saves more compute per token. 13B models show the strongest aggregate gains—P99 tail latency drops 60-85% and throughput increases 12-22%—because backend queuing under load amplifies the benefit of cache-aware routing. 8B models are routing-neutral: inference is fast enough (~420ms) that cache savings don’t affect aggregate latency, though per-request XLarge improvement remains strong at 28-40%.</p>

<p>Routing overhead is 6-8ms with server-side tokenization, &lt;1ms with pre-tokenized requests. You’re trading milliseconds of routing for seconds of tail latency improvement.</p>

<p><strong>Where Ranvier shines:</strong> RAG pipelines, multi-turn chat with system prompts, few-shot learning—any workload where requests share common prefixes of 2K+ tokens.</p>

<h2 id="landscape-and-positioning">Landscape and Positioning</h2>

<p>The inference ecosystem is moving fast. vLLM shipped their <a href="https://github.com/vllm-project/production-stack">own router</a> in December 2025. Red Hat’s <a href="https://github.com/llm-d/llm-d">llm-d</a> takes a different approach—introspecting vLLM’s KV cache directly via live events. <a href="https://github.com/sgl-project/sglang">SGLang</a> builds prefix caching into the engine itself via RadixAttention.</p>

<p>Ranvier takes a third path: engine-agnostic, external routing based on token-level prefix matching. It works with any OpenAI-compatible backend—vLLM, SGLang, TensorRT-LLM, Ollama—without requiring engine modifications or instrumentation hooks. The tradeoff is that Ranvier infers cache state from routing history rather than observing it directly, which means slightly lower cache hit precision in exchange for portability.</p>

<p>We think there’s value in a routing layer that isn’t coupled to a specific inference engine, especially as the ecosystem fragments.</p>

<p>Full roadmap: <a href="https://github.com/Ranvier-Systems/ranvier-core/blob/main/docs/architecture/VISION.md">VISION.md</a></p>

<h2 id="code-and-docs">Code and Docs</h2>

<p>Ranvier is open source under Apache 2.0: <a href="https://github.com/Ranvier-Systems/ranvier-core">github.com/Ranvier-Systems/ranvier-core</a></p>

<p>If you’re interested in the internals—why an Adaptive Radix Tree instead of a hash map, how the gossip protocol tracks distributed cache state, what we learned building on Seastar’s shared-nothing architecture—start with the <a href="https://github.com/Ranvier-Systems/ranvier-core/blob/main/docs/architecture/system-design.md">system design doc</a>.</p>

<ul>
  <li><a href="https://github.com/Ranvier-Systems/ranvier-core/blob/main/docs/benchmarks/benchmark-guide-8xA100.md">Benchmark methodology</a></li>
  <li><a href="https://github.com/Ranvier-Systems/ranvier-core/blob/main/docs/deployment/kubernetes.md">Deployment guide</a></li>
</ul>

<p>Issues and feedback welcome—especially if you run it against a workload we haven’t tested.</p>

<hr />

<p><em>Ranvier is a project of Minds Aspire, LLC.</em></p>]]></content><author><name>Minds Aspire</name></author><summary type="html"><![CDATA[Your load balancer is blind. It routes LLM requests based on server availability, completely ignoring the fact that GPU-1 already holds the context your request needs. The result: redundant computation, wasted money, and slower responses than necessary.]]></summary></entry></feed>