Adds unified_nixl_both to elastic_migration_v2: same picker as unified_kv_both (never triggers PD-sep), but launches vLLM with NixlConnector instead of MooncakeConnector. Compared against plain unified and unified_kv_both (Mooncake) we can now attribute the substrate overhead between "v1 connector framework irreducible cost" (proxied by the leaner NIXL) and "Mooncake implementation extra" (Mooncake - NIXL). Result (vs plain unified, both substrates never PD-sep): metric plain NIXL Mooncake TTFT p90 7.35s +37.9% +45.3% (NIXL: +7pp better) TPOT p90 17.1ms +15.5% +24.5% (NIXL: +9pp better) E2E p90 18.03s +17.4% +27.0% (NIXL: +10pp better) hotspot 3.667 +0.2% +19.0% (NIXL: keeps it flat) APC 79.4% -0.3pp -1.1pp interference - 5.58 8.57 (NIXL: ~35% lower) The cleanest signal is hotspot: NIXL preserves plain-unified's distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step O(|cache|) `set(self._block_pool.cache.keys())` diff against _known_hash_keys (mooncake_connector.py:432-456) inflates routing imbalance by 19%. The hash sync runs unconditionally even when no direct_read consumer is present. Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer GPU memory, per-step SchedulerOutput.kv_connector_metadata round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL ~= Mooncake-specific overhead (the hash-sync loop and stricter delay_free semantics). Practical implication: NIXL is meaningfully better than Mooncake on this stack, but even NIXL imposes 16-38% across metrics — too expensive for selective-PD-sep on agentic workloads where the trigger rate is < 0.5%. Launch fixes required for NIXL multi-instance: - VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default 5600; we use 5600+i). Without this, 7 of 8 instances silently hang in `zmq.error.ZMQError: Address already in use` and the launcher trap kills all of them at health-check timeout. - Health-check timeout raised from 180s to 360s; NIXL initialization (UCX agent + memory registration) is ~100-150s per instance under 8-way concurrent load, vs Mooncake's ~30-60s. New figure: fig_connector_substrate_attribution.png stacks plain / framework / Mooncake-extra / v2-branch overhead per metric. Existing figures (fig_kv_both_overhead, fig_three_way_hotspot) updated to include NIXL as a fourth bar. README updated with 4-way table, Result 1 reframed as "the cost is mostly framework, not Mooncake — but Mooncake adds the hotspot penalty", and the substrate-vs-PD-sep tradeoff math. Refs: nixl_connector.py:700 handshake listener bind, factory.py register_connector for the NixlConnector entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Elastic Migration v2: Selective PD-Separation via Mooncake
Date: 2026-05-26
Trace: traces/w600_r0.0015_st30.jsonl (1214 reqs, 274 sessions, 53.3 M tokens)
Model: Qwen3-Coder-30B-A3B-Instruct, 8 × TP1 on H20
TL;DR
This section explores whether the B2-confirmed same-worker
prefill–decode interference can be relieved by selectively
migrating prefill to a different worker for the requests where the
interference cost would dominate the transfer cost. We implement
two flavors of the routing policy (strict gates, then relaxed
gates) and two isolation controls that use the unified picker
but launch vLLMs in kv_role=kv_both so the connector substrate
is on but never PD-seps:
unified_kv_both: with MooncakeConnectorunified_nixl_both: with NixlConnector (NVIDIA's official v1 connector; isolates connector implementation from policy)
Four findings:
kv_role=kv_bothimposes a substantial always-on tax even when no PD-sep ever fires: with Mooncake it's TTFT p90 +45%, TPOT p90 +25%, hotspot +19%; with NIXL it's TTFT p90 +38%, TPOT p90 +16%, hotspot +0.2%.- About half of the substrate cost is generic v1-connector
framework overhead (proxied by NIXL since it's the leanest
implementation): KV buffer GPU memory cut from the model's
working budget,
SchedulerOutput.kv_connector_metadataround-trip, and alteredkv_cache_managerblock-lifecycle semantics. NIXL is meaningfully better than Mooncake but still imposes a 16-38% tax vs no connector. - PD-sep almost never triggers on a real agentic workload: 0.16% with strict gates, 0.41% with relaxed gates. Agentic workloads have 93% intra-session reuse, so most requests land on workers that already hold cache — the uncached tail is too small to be worth migrating.
- When PD-sep does fire, the cost model is wrong by ~10–20×:
the calibrated
0.3s + bytes / 2.7 GB/spredicts 1–2 s migrate cost; observed TTFT on triggered requests is 12–45 s.
The net latency of unified_v2 is not better than plain
unified under either Mooncake or NIXL substrate. Improving
agentic PD-sep requires (a) using the leaner connector (NIXL >
Mooncake by 5-19 pp across metrics), and (b) fixing the underlying
transfer mechanism (E2 patches 6.1 lazy block reservation and 6.3
layerwise pipelining), not just the routing decision.
Substrate
We compare four policies on identical traces:
| policy | picker | vLLM launch mode | what's it for |
|---|---|---|---|
unified |
hybrid affinity + LMetric | plain (no connector) | the headline baseline |
unified_kv_both |
same as unified |
MooncakeConnector + kv_both |
substrate control: Mooncake cost without PD-sep |
unified_nixl_both |
same as unified |
NixlConnector + kv_both |
substrate control: NIXL cost without PD-sep, attributes overhead to "framework vs Mooncake" |
unified_v2 |
unified + selective PD-sep | MooncakeConnector + kv_both + bootstrap |
the actual experiment |
All four use the same trace, the same 8-instance topology, the same
shadow-drift–corrected proxy (scripts/cache_aware_proxy.py post-fix
95c8ef8). Plain unified was rerun on the patched proxy
(b3_sweep_20260525_095043/unified) under the same conditions.
NIXL required two launch fixes beyond Mooncake:
VLLM_NIXL_SIDE_CHANNEL_PORTmust be unique per instance (default 5600 → 5600..5607); otherwise instances 2..8 silently hang inzmq.error.ZMQError: Address already in use.- Health-check timeout had to be raised from 180 s to 360 s because NIXL initialization (UCX agent + memory registration) takes ~100-150 s per instance under 8-way concurrent launch.
Result 1 — kv_both is expensive by itself, and only partly Mooncake's fault
Switching the vLLM launch from plain to kv_role=kv_both without
ever triggering PD-sep imposes a substrate tax. We compare the two
connectors available in vendored vLLM:
| metric | plain unified |
unified_nixl_both |
unified_kv_both (Mooncake) |
|---|---|---|---|
| TTFT p50 | 0.50 s | 0.51 s (+1%) | 0.50 s (+0%) |
| TTFT p90 | 7.35 s | 10.13 s (+38%) | 10.67 s (+45%) |
| TTFT p99 | 42.34 s | 44.58 s (+5%) | 45.19 s (+7%) |
| TPOT p90 | 17.1 ms | 19.8 ms (+16%) | 21.3 ms (+25%) |
| E2E p90 | 18.03 s | 21.18 s (+17%) | 22.89 s (+27%) |
| APC | 79.4% | 79.1% (−0.3 pp) | 78.3% (−1.1 pp) |
| hotspot index | 3.667 | 3.674 (+0.2%) | 4.363 (+19%) |
| interference index | n/a | 5.58 | 8.57 |
Reading the table from left to right gives a clean attribution:
- NIXL−plain = the v1-connector framework's irreducible cost
(TTFT p90 +38%, TPOT p90 +16%, E2E p90 +17%). This is the cost
any v1 KV connector imposes:
- the 1 GB
kv_buffer_sizecarved fromgpu-memory-utilization, reducing the KV cache budget; - per-step
SchedulerOutput.kv_connector_metadataserialization and round-trip through the connector worker; - altered block-lifecycle semantics in
kv_cache_manager(delay_free_blocks=Trueis the default once any connector is loaded, slowing LRU eviction).
- the 1 GB
- Mooncake−NIXL = the Mooncake-implementation-specific extra
(TTFT p90 +7 pp, TPOT p90 +9 pp, E2E p90 +10 pp, hotspot +19 pp).
This is the cost Mooncake's design choices add on top of the
generic framework:
- per-scheduler-step
set(self._block_pool.cache.keys())diff against_known_hash_keys(mooncake_connector.py:432-456) walks O(|cache|) on every step on every engine, costing ~4 M set operations per second on a 200 k-block cache; - the hash sync runs even when no
direct_readconsumer is present, so the cost is paid unconditionally; - block-lifecycle is further constrained because Mooncake
requires
delay_freeuntil the explicitfinished_sendingarrives, vs NIXL which can release blocks earlier.
- per-scheduler-step
The most striking gap is hotspot: Mooncake's per-step hash sync runs on the scheduler's GIL and disrupts the timeliness of routing decisions, amplifying load imbalance by 19%. NIXL has no equivalent global-state maintenance and preserves the plain-unified hotspot to within 0.2%.
Practical implication: you don't enable any v1 KV connector for free, but if you have to enable one, NIXL is meaningfully cheaper than Mooncake. Even NIXL's 38% TTFT p90 tax is large enough that PD-sep needs to recover it on a non-trivial fraction of requests before being worth it.
Result 2 — PD-sep rarely fires on a real agentic trace
We log every routing decision's v2_reason (why we did or did not
PD-sep). Two runs with different gate thresholds:
| fall-through bucket | v2.0 strict | v2.1 relaxed | what it means |
|---|---|---|---|
new_local < threshold |
1077 (88.7%) | 924 (76.1%) | uncached tail too small to justify transfer |
chosen_no_active_decode |
115 (9.5%) | 229 (18.9%) | no decode on chosen to protect |
src_cache_below_threshold |
14 (1.2%) | 36 (3.0%) | no alt instance holds enough cache |
src_not_meaningfully_more_cache |
6 (0.5%) | 16 (1.3%) | alt instance doesn't help vs chosen |
cost_benefit not enough margin |
0 | 4 (0.3%) | model says transfer cost + interference on src ≥ local interference |
| PD-sep TRIGGERED | 2 (0.16%) | 5 (0.41%) | passed all gates and cost-benefit favored migrate |
The dominant filter is new_local < threshold. Even with the
threshold dropped from 16 k to 8 k tokens, three out of four requests
have less than 8 k uncached tokens at the chosen worker. This is
structural: with intra-session reuse measured at 93% on the same
trace (window_1_results.md), most turns hit prefix cache on the
session's previous worker.
The second filter, chosen_no_active_decode, kills another fifth.
This is a snapshot-time phenomenon: at the moment the picker runs,
the chosen worker often has its previous request still in prefill,
not yet decoding. The gate's intent ("don't migrate if no decode is
being hurt by the prefill we're routing") is correct, but it ends up
suppressing PD-sep for a real situation where decode is about to
start.
Even after these two filters, the cost-benefit step itself rejects nearly half of remaining candidates (4 out of 9 in relaxed). So the final trigger rate of 0.41% is a structural property, not a parameter-tuning problem.
Result 3 — when PD-sep fires, the cost model is wrong by 10–20×
The 5 PD-sep-triggered requests in v2.1 relaxed:
| input | new_local | new_src | src→dst | cost_local | cost_migrate (model) | actual TTFT | actual E2E |
|---|---|---|---|---|---|---|---|
| 21963 | 21963 | 9163 | 6→5 | 4.39 s | 4.17 s | 3.69 s | 8.48 s |
| 8706 | 8706 | 2050 | 5→7 | 1.09 s | 0.73 s | 12.48 s | 14.31 s |
| 13616 | 13616 | 2352 | 4→0 | 1.70 s | 1.03 s | 18.33 s | 19.50 s |
| 49483 | 49483 | 843 | 3→4 | 11.75 s | 2.16 s | 45.13 s | 53.55 s |
| 19806 | 19806 | 350 | 3→6 | 3.96 s | 1.06 s | 20.06 s | 31.98 s |
The cost model predicts the migrate path will take 0.7–2.2 s; the
actual TTFT on these requests is 12–45 s. The model's 0.3 s + bytes / 2.7 GB/s calibration captures pure RDMA bandwidth in
isolation but misses everything else that happens on the
decode_sent → first_token clock: D-side scheduler step latency,
block reservation before KV arrives (so D's cache pressure
increases for the entire wait), the per-layer scatter of
batch_transfer_sync_write, and the next-step scheduler promotion
after finished_recving. The E2 audit measured this end-to-end at
p50 = 1.1 s and p90 = 6.7 s on production runs; the v2.1
triggered requests landed in the p99 tail of that distribution
because their dst was already loaded.
The first-token clock for the 49 k request is 21× the model's prediction. This is not a small mis-tuning — it's a structurally different model.
Result 4 — four-way comparison
The full table:
| metric | unified (plain) | unified_nixl_both | unified_kv_both (Mooncake) | unified_v2 (relaxed) |
|---|---|---|---|---|
| n_ok | 1214 | 1214 | 1214 | 1214 |
| TTFT p50 | 0.50 s | 0.51 s | 0.50 s | 0.49 s |
| TTFT p90 | 7.35 s | 10.13 s | 10.67 s | 10.98 s |
| TTFT p99 | 42.34 s | 44.58 s | 45.19 s | 49.45 s |
| TPOT p90 | 17.1 ms | 19.8 ms | 21.3 ms | 18.4 ms |
| E2E p90 | 18.03 s | 21.18 s | 22.89 s | 22.53 s |
| APC | 79.4% | 79.1% | 78.3% | 77.6% |
| interference index | n/a | 5.58 | 8.57 | 8.46 |
| hotspot index | 3.667 | 3.674 | 4.363 | 3.910 |
| n_slow | 189 | 192 | 198 | 198 |
v2 vs the kv_both control (the right comparison)
Compared to the kv_both control — same substrate, no PD-sep — the 5 PD-sep triggers in v2:
- slightly improve TPOT p90 (−14%) and hotspot (−10%)
- slightly worsen TTFT p90 (+3%) and TTFT p99 (+9%), because the triggered requests themselves take ~20× the predicted transfer time
The net effect against the kv_both control is in the noise. The hotspot improvement is within the run-to-run stochastic range we saw earlier (v2 strict run scored 2.733 hotspot under the same substrate; v2 relaxed scored 3.910).
v2 vs plain unified (the headline question)
unified_v2 is 27% slower on E2E p90 and 49% slower on TTFT
p90 than plain unified. The 45 pp of TTFT p90 inflation is from
kv_both substrate, not the routing decision; nothing PD-sep does can
recover this in our current Mooncake implementation.
Why v2's PD-sep is fundamentally choked
There are three independent structural problems, each by itself enough to make v2 not win:
-
The kv_both substrate is the wrong default. It pays a 45% TTFT p90 tax on every request. To make selective PD-sep beat plain
unified, the saved interference per triggered request times the trigger rate must exceed 45% × average TTFT, on average. With 0.41% trigger rate, even saving 100% of TTFT per triggered request would only save ~0.4%, which can't recover 45%. -
Agentic intra-session reuse leaves no headroom for migration. Most turns hit cache on the worker that handled the previous turn. Migrating prefill to a different worker is the exact thing intra-session affinity tries to avoid: it forces the new worker to pay for the cached prefix transfer instead of just reusing what's already on the affinity worker. This is a structural mismatch between PD-sep semantics ("send big prefills to a less-busy worker") and agentic workloads ("keep sessions sticky to wherever the cache is").
-
The Mooncake mechanism is 10–20× slower than the cost model predicts, primarily due to D-side pre-allocation of KV blocks and the absence of layerwise pipelining (E2 audit §6.1 / §6.3). The cost model can be re-calibrated, but doing so would push the gate even tighter, dropping the already-tiny trigger rate to nearly zero.
The three are stacked: even if any two were fixed, the remaining one would still make PD-sep a net loss on this trace.
What this section claims for the paper
- Same-worker prefill–decode interference is a real mechanism (B2 microbench), but agentic workloads rarely expose it: the typical request has high cache hit and small uncached tail, so the interference cost is bounded.
- Routing-only solutions (unified) already capture 79% of the intra-session APC ceiling and recover the latency by avoiding the heavy-tail sessions through the affinity gate. The remaining 23 pp gap to the ceiling is from APC LRU eviction under capacity pressure, not from prefill–decode interference.
- Per-request PD-sep via Mooncake on agentic workloads is not a net win in our measurements, even with a carefully-gated cost model. The combined effect of kv_both substrate overhead, low trigger rate, and mechanism-vs-model gap is uniformly negative.
- A productive direction is mechanism-level: fix the Mooncake D-side block reservation (E2 §6.1), implement layerwise transfer pipelining (E2 §6.3), and re-measure. Only if these patches drop the substrate tax to <10% and the realized transfer to ≤2 s p90 does PD-sep become competitive with routing on agentic traces.
What v2 still validates
- The cost model's qualitative shape is correct: when it says "migrate", that's a request where local interference would have been ≥ 4 s and src has ≥ 80% prefix cache. The model picks the right candidate requests.
- The gate logic catches the right exclusions: 88% by uncached tail size, 19% by no-decode-to-protect, the rest by missing source cache. Each is a structurally correct reason.
- The proxy shadow-drift fix is necessary infrastructure for any long-running routing experiment. We observed 3 phantom corrections per ~50-minute run.
Files
data/b3_policy_comparison.json— the four policies' headline metrics from the same B3 sweep root.data/breakdown_<policy>.json— per-request proxy breakdown including v2 gate fields and triggered-event metadata.data/per_worker_<policy>.json— per-worker TTFT/latency p90s used in the hotspot figure.figures/*.png— the four section figures referenced above.render_figures.py— regenerates the figures from data/.
Cross-references
analysis/characterization/window_1_results.md— B2 microbench (same-worker interference causal proof) and B3 baseline 5-policy sweepanalysis/characterization/agentic_dispatch_coupling.md— why the saturated-replay setup matches agentic productionanalysis/characterization/b3_policies_pseudocode.md— pickers for the five baseline policies;unified_v2extendsunified- E1 / E2 subagent reports (commit
4b833d3message and the conversation log) — full mechanism audit that informed v2's design




