Tokenization Is the Bottleneck You’re Not Measuring

You’ve optimized your GPU serving stack. You’ve tuned vLLM’s batch size, configured PagedAttention, maybe even set up prefix-aware routing for KV cache locality. Your P99 looks good. Your throughput is climbing. And somewhere in your proxy layer, every single request is blocking for 5-13 milliseconds while a tokenizer turns text into integers.

You’re probably not measuring it. Most LLM proxies treat tokenization as instantaneous—call the function, get the tokens, move on. But on an event-loop architecture, 5-13ms isn’t a rounding error. It’s an eternity. Every millisecond your event loop spends inside a tokenizer FFI call is a millisecond where no other request is read, no response is forwarded, no health check is answered, no connection is accepted.

This post is about a bottleneck hiding in the gap between “fast enough” and “actually non-blocking.”

Why Tokenization Blocks

If you’re doing prefix-aware routing, request rewriting, cost estimation, or priority classification, your proxy needs to tokenize the input before forwarding it. That means calling a tokenizer, usually HuggingFace’s tokenizers library, the same BPE implementation used by most serving engines.

The problem is that tokenization is CPU-bound work executed through an FFI boundary. The Rust tokenizers crate does the actual BPE encoding. Your proxy calls it through a C binding. The call takes 5-13ms depending on input length. During that call, your thread is gone.

In a thread-per-request architecture (Go, Java, threaded Python), this is fine. One thread blocks; the others keep working. In an event-loop architecture—Node.js, Seastar, anything built on epoll/io_uring with cooperative scheduling—it’s a disaster. The event loop processes everything sequentially. While it’s inside the tokenizer, it processes nothing else.

Let’s make this concrete. You have an event loop handling 1,000 requests per second. Each tokenization call takes 10ms. If you tokenize synchronously on the event loop, you can process at most 100 tokenizations per second on that core. Your other 900 requests are queued, their latency inflating by 10ms for each request ahead of them in line.

At 20 concurrent users, we measured tokenization accounting for 10.6ms of total routing overhead, while the actual routing decision (a radix tree lookup) took 0.01ms. The tokenizer was 1,000x slower than the thing it was feeding.

The Caching Layer That Actually Works

The first optimization is the most obvious: don’t tokenize the same text twice.

LLM traffic has a property that makes caching extraordinarily effective: repetition. Every request to a RAG application includes the same system prompt. Every multi-turn conversation starts with the same instruction prefix. Every API call from the same client sends the same role tags (<|system|>\n, <|user|>\n).

We added an LRU cache in front of the tokenizer. Hit rates depend entirely on content type, and the spread is dramatic. Here’s what we expect:

Content type	Cache hit rate
Role tags (`<\\|system\\|>\n`)	95%+
System messages	80-90%
User queries	10-30%

That 80-90% hit rate on system messages means that for most requests, the expensive part—tokenizing the 2,000-4,000 token system prompt—is a hash table lookup returning in microseconds instead of a 10ms FFI call.

The implementation is straightforward: a hash map keyed on the input text, with an LRU eviction list capped at a configured maximum (we use 1,000 entries). On hit, move the entry to the front. On miss, tokenize, insert at the front, evict from the tail if full. No locks needed: in a sharded architecture, each core has its own cache.

Two details matter:

Cap cached text—but don’t cap it too low. Our first instinct was to cap at 8KB: surely a long RAG document won’t repeat verbatim often enough to earn its memory. That was a mistake. The long, stable system prefixes we most wanted to cache routinely exceed 8KB, and refusing them reintroduced a 5-7ms P50 regression at 20+ concurrent users, exactly the cost we were trying to delete. We raised the cap to 64KB. Worst case is 1,000 entries × 64KB ≈ 64MB per shard, which is cheap insurance. The cache key is the full input text (the tokens themselves are a small vector of int32s), so the cap is really about bounding key memory. And the texts most worth caching are precisely the long ones. (A separate, tighter 32KB limit applies to the cross-core dispatch path, because that one copies the string across a core boundary, where large copies aren’t free.)

Don’t cache unique content. User queries have a 10-30% hit rate; most are unique. The cache handles this naturally through LRU eviction: unique queries enter the cache, never get hit, and fall off the tail. The system prompt stays hot at the front.

When Caching Isn’t Enough

A 90% cache hit rate sounds great until you think about what happens on the other 10%. At 1,000 requests per second, 10% misses means 100 tokenizer calls per second, each blocking for 10ms. Your event loop can handle exactly 100 of those per second. You’re at capacity with zero headroom. And that’s assuming uniform arrival, which real traffic never is.

A burst of 20 cache misses in a row blocks your event loop for 200ms. Every request that arrives during those 200ms—including the ones that would have been cache hits—waits.

You need a way to tokenize without blocking the event loop.

Option 1: Thread Pool Offload

The most direct solution: move the FFI call to a dedicated worker thread. The event loop submits a job, gets a future back immediately, and continues processing other requests. When the worker thread finishes tokenizing, it signals the event loop to resume the request.

The implementation needs care:

One thread per core. The HuggingFace tokenizer isn’t thread-safe for concurrent calls on the same instance, so you need one tokenizer instance per worker thread. In a sharded architecture, that means one worker thread per shard: no contention, no locks on the tokenizer itself.

Lock-free job queue. The event loop (producer) and worker thread (consumer) communicate through a bounded SPSC (single-producer, single-consumer) queue. No mutexes on the hot path. When the queue is full, the event loop falls back to other strategies rather than blocking.

Memory isolation across thread boundaries. This is the subtle one. If your event loop uses a per-core memory allocator (Seastar does, and so does anything using jemalloc with thread-local arenas), you can’t pass heap-allocated objects from the event loop thread to the worker thread and back without corrupting allocator metadata. The input string must be reallocated on the worker thread before calling the tokenizer. The output tokens must be reallocated on the event loop thread when the result returns. Two copies that feel wasteful but prevent silent memory corruption.

The overhead is ~50-200μs per call—negligible compared to the 5-13ms it keeps off the event loop.

Option 2: Cross-Core Dispatch

If you’re running a multi-core architecture with per-core sharding, you can hand the tokenization to a different core. Instead of tokenizing locally (blocking this core’s event loop), dispatch it elsewhere.

This doesn’t eliminate the blocking: the target core’s event loop still blocks for 5-13ms. But it moves the blocking away from the core that’s serving the request, keeping the request-handling core responsive.

The selection algorithm is a knob, and how far you turn it depends on how often this path fires. Our shipping implementation keeps it as simple as possible: rotate to the next core (round-robin). It costs nothing to compute, and because cross-core dispatch is only a fallback (it fires when the thread pool is saturated, which is rare), even distribution is good enough and load skew rarely has time to matter. Round-robin spreads cost evenly; it does not look at which cores are actually busy.

If your dispatch path fires often enough that even spreading isn’t good enough, the next step up is Power-of-Two-Choices (P2C): sample two cores at random, pick the less loaded one. It’s O(1), avoids thundering herd, and produces near-optimal distribution. We run a P2C balancer elsewhere in the system, and the cross-core tokenization path is wired to adopt it, but today the selector still ignores load and just rotates. The rule of thumb: match the selector’s sophistication to the frequency of the path. Don’t pay for P2C on a path that fires once in a thousand requests.

The cross-core dispatch has its own memory safety requirement: the input text must be copied into an owned string before crossing the core boundary, and the output tokens must be copied again when returning. Same principle as the thread pool—allocator domains don’t mix.

Option 3: Both

The strategies compose. On a cache miss:

Try the thread pool — if the SPSC queue has space, submit and continue. Event loop never blocks. Best case.
Try cross-core dispatch — if the thread pool is full, hand the work to another core (we rotate round-robin). Calling core stays unblocked. Target core blocks briefly.
Local fallback — if everything else fails, tokenize on the local event loop, gated by a semaphore that limits concurrent blocking tokenizations to one per core. This caps the worst case: at most one 5-13ms stall at a time, rather than unbounded stacking.

In practice, the cache handles 80-90% of requests. The thread pool handles most of the rest. Cross-core dispatch and local fallback are rarely needed but prevent pathological behavior under burst traffic.

The Semaphore Matters More Than You Think

That local fallback semaphore deserves a closer look. Without it, a burst of cache misses can compound: five concurrent misses on the same core means five sequential tokenizations, 50-65ms of total blocking. Every other request on that core is frozen for the duration.

A semaphore with one permit ensures at most one blocking tokenization at a time. The second, third, and fourth concurrent misses either wait for the semaphore (adding latency to those specific requests) or bail out and route without tokens (falling back to hash-based routing instead of prefix-aware routing).

The choice between “wait for the semaphore” and “bail out” depends on your architecture. If tokenization is required for correctness (e.g., token counting for billing), wait. If it’s an optimization (e.g., prefix routing), bail out. A request routed by hash instead of prefix is slightly less optimal but doesn’t block the event loop.

Measuring the Problem in Your Stack

Most LLM proxies don’t instrument tokenization latency. Here’s what to look for:

Histogram the tokenization call. Wrap your tokenizer call with a timer. Bucket at 100μs, 500μs, 1ms, 5ms, 10ms, 50ms, 100ms. If you see significant mass above 5ms, you have a blocking problem.

Track cache hit rate. If you have a cache, measure it. Anything above 70% means caching is working. Below 50% means your traffic patterns don’t repeat enough for caching to help, and you need the thread pool or dispatch strategies.

Correlate with tail latency. If your P99 latency spikes correlate with low cache hit periods, tokenization blocking is likely the cause. The correlation is indirect—the blocking doesn’t slow the tokenized request much, but it freezes every other request on that core.

Monitor event loop stalls. If your framework reports reactor stalls or event loop delays (Seastar does, Node.js has monitorEventLoopDelay), check whether they correlate with tokenization activity. A 10ms stall that appears under load and disappears when you reduce traffic is a classic sign of a synchronous call that shouldn’t be synchronous.

When You Don’t Need Any of This

If your proxy doesn’t tokenize, none of this applies. Many LLM proxies are pure HTTP forwarders—they pass the request body through without parsing it. No tokenization, no bottleneck.

You need to tokenize if you’re doing any of:

Prefix-aware routing (matching token sequences for KV cache locality)
Token counting (enforcing context window limits before hitting the backend)
Cost estimation (pricing based on input token count)
Request priority classification (longer inputs get different priority)
Request rewriting (injecting or modifying tokens before forwarding)

If you’re doing these in a thread-per-request architecture (Go, Java), the blocking is absorbed by the thread model. The problem is specific to event-loop architectures, and it scales with the number of requests that miss the cache.

The Takeaway

Tokenization is a 5-13ms synchronous operation hiding inside systems that assume everything is asynchronous. At low traffic, it’s invisible. At high traffic, it’s the bottleneck. The tokenizer itself isn’t especially slow; the damage is that it freezes every other request on the core while it runs.

The fix is layered defense: cache the common cases (80-90% of requests), offload the rest to worker threads (event loop never blocks), and fall back to cross-core dispatch when the thread pool is full. Each layer handles a different failure mode. Together, they keep a 5-13ms FFI call from becoming a 200ms tail latency event.

If you’re building an LLM proxy that does anything with tokens before forwarding, wrap your tokenizer call in a histogram before you reach for any of these fixes. You can’t budget for the mass above 5ms until you can see it.

This is the fourth post in a series on LLM infrastructure performance. The first covered why your load balancer is wasting your GPUs. The second covered 24 hard rules for writing correct async C++. The third covered KV cache locality, the hidden variable in your LLM serving cost. The next will cover how routing decisions improve when the load balancer learns from every request it forwards.

Ranvier is a project of Minds Aspire, LLC.