Files
Gahow Wang dc6d24d1ca Add NIXL substrate isolation control + attribution decomposition
Adds unified_nixl_both to elastic_migration_v2: same picker as
unified_kv_both (never triggers PD-sep), but launches vLLM with
NixlConnector instead of MooncakeConnector. Compared against plain
unified and unified_kv_both (Mooncake) we can now attribute the
substrate overhead between "v1 connector framework irreducible
cost" (proxied by the leaner NIXL) and "Mooncake implementation
extra" (Mooncake - NIXL).

Result (vs plain unified, both substrates never PD-sep):

   metric          plain    NIXL          Mooncake
   TTFT p90        7.35s    +37.9%        +45.3%      (NIXL: +7pp better)
   TPOT p90        17.1ms   +15.5%        +24.5%      (NIXL: +9pp better)
   E2E p90         18.03s   +17.4%        +27.0%      (NIXL: +10pp better)
   hotspot         3.667    +0.2%         +19.0%      (NIXL: keeps it flat)
   APC             79.4%    -0.3pp        -1.1pp
   interference    -        5.58          8.57         (NIXL: ~35% lower)

The cleanest signal is hotspot: NIXL preserves plain-unified's
distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step
O(|cache|) `set(self._block_pool.cache.keys())` diff against
_known_hash_keys (mooncake_connector.py:432-456) inflates routing
imbalance by 19%. The hash sync runs unconditionally even when no
direct_read consumer is present.

Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer
GPU memory, per-step SchedulerOutput.kv_connector_metadata
round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL
~= Mooncake-specific overhead (the hash-sync loop and stricter
delay_free semantics).

Practical implication: NIXL is meaningfully better than Mooncake on
this stack, but even NIXL imposes 16-38% across metrics — too
expensive for selective-PD-sep on agentic workloads where the
trigger rate is < 0.5%.

Launch fixes required for NIXL multi-instance:
- VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default
  5600; we use 5600+i). Without this, 7 of 8 instances silently hang
  in `zmq.error.ZMQError: Address already in use` and the launcher
  trap kills all of them at health-check timeout.
- Health-check timeout raised from 180s to 360s; NIXL initialization
  (UCX agent + memory registration) is ~100-150s per instance under
  8-way concurrent load, vs Mooncake's ~30-60s.

New figure: fig_connector_substrate_attribution.png stacks plain /
framework / Mooncake-extra / v2-branch overhead per metric.
Existing figures (fig_kv_both_overhead, fig_three_way_hotspot)
updated to include NIXL as a fourth bar.

README updated with 4-way table, Result 1 reframed as "the cost is
mostly framework, not Mooncake — but Mooncake adds the hotspot
penalty", and the substrate-vs-PD-sep tradeoff math.

Refs: nixl_connector.py:700 handshake listener bind, factory.py
register_connector for the NixlConnector entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 16:02:12 +08:00
..

Elastic Migration v2: Selective PD-Separation via Mooncake

Date: 2026-05-26 Trace: traces/w600_r0.0015_st30.jsonl (1214 reqs, 274 sessions, 53.3 M tokens) Model: Qwen3-Coder-30B-A3B-Instruct, 8 × TP1 on H20

TL;DR

This section explores whether the B2-confirmed same-worker prefilldecode interference can be relieved by selectively migrating prefill to a different worker for the requests where the interference cost would dominate the transfer cost. We implement two flavors of the routing policy (strict gates, then relaxed gates) and two isolation controls that use the unified picker but launch vLLMs in kv_role=kv_both so the connector substrate is on but never PD-seps:

  • unified_kv_both: with MooncakeConnector
  • unified_nixl_both: with NixlConnector (NVIDIA's official v1 connector; isolates connector implementation from policy)

Four findings:

  1. kv_role=kv_both imposes a substantial always-on tax even when no PD-sep ever fires: with Mooncake it's TTFT p90 +45%, TPOT p90 +25%, hotspot +19%; with NIXL it's TTFT p90 +38%, TPOT p90 +16%, hotspot +0.2%.
  2. About half of the substrate cost is generic v1-connector framework overhead (proxied by NIXL since it's the leanest implementation): KV buffer GPU memory cut from the model's working budget, SchedulerOutput.kv_connector_metadata round-trip, and altered kv_cache_manager block-lifecycle semantics. NIXL is meaningfully better than Mooncake but still imposes a 16-38% tax vs no connector.
  3. PD-sep almost never triggers on a real agentic workload: 0.16% with strict gates, 0.41% with relaxed gates. Agentic workloads have 93% intra-session reuse, so most requests land on workers that already hold cache — the uncached tail is too small to be worth migrating.
  4. When PD-sep does fire, the cost model is wrong by ~1020×: the calibrated 0.3s + bytes / 2.7 GB/s predicts 12 s migrate cost; observed TTFT on triggered requests is 1245 s.

The net latency of unified_v2 is not better than plain unified under either Mooncake or NIXL substrate. Improving agentic PD-sep requires (a) using the leaner connector (NIXL > Mooncake by 5-19 pp across metrics), and (b) fixing the underlying transfer mechanism (E2 patches 6.1 lazy block reservation and 6.3 layerwise pipelining), not just the routing decision.

Substrate

We compare four policies on identical traces:

policy picker vLLM launch mode what's it for
unified hybrid affinity + LMetric plain (no connector) the headline baseline
unified_kv_both same as unified MooncakeConnector + kv_both substrate control: Mooncake cost without PD-sep
unified_nixl_both same as unified NixlConnector + kv_both substrate control: NIXL cost without PD-sep, attributes overhead to "framework vs Mooncake"
unified_v2 unified + selective PD-sep MooncakeConnector + kv_both + bootstrap the actual experiment

All four use the same trace, the same 8-instance topology, the same shadow-driftcorrected proxy (scripts/cache_aware_proxy.py post-fix 95c8ef8). Plain unified was rerun on the patched proxy (b3_sweep_20260525_095043/unified) under the same conditions.

NIXL required two launch fixes beyond Mooncake:

  • VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default 5600 → 5600..5607); otherwise instances 2..8 silently hang in zmq.error.ZMQError: Address already in use.
  • Health-check timeout had to be raised from 180 s to 360 s because NIXL initialization (UCX agent + memory registration) takes ~100-150 s per instance under 8-way concurrent launch.

Result 1 — kv_both is expensive by itself, and only partly Mooncake's fault

Switching the vLLM launch from plain to kv_role=kv_both without ever triggering PD-sep imposes a substrate tax. We compare the two connectors available in vendored vLLM:

metric plain unified unified_nixl_both unified_kv_both (Mooncake)
TTFT p50 0.50 s 0.51 s (+1%) 0.50 s (+0%)
TTFT p90 7.35 s 10.13 s (+38%) 10.67 s (+45%)
TTFT p99 42.34 s 44.58 s (+5%) 45.19 s (+7%)
TPOT p90 17.1 ms 19.8 ms (+16%) 21.3 ms (+25%)
E2E p90 18.03 s 21.18 s (+17%) 22.89 s (+27%)
APC 79.4% 79.1% (0.3 pp) 78.3% (1.1 pp)
hotspot index 3.667 3.674 (+0.2%) 4.363 (+19%)
interference index n/a 5.58 8.57

Reading the table from left to right gives a clean attribution:

  • NIXLplain = the v1-connector framework's irreducible cost (TTFT p90 +38%, TPOT p90 +16%, E2E p90 +17%). This is the cost any v1 KV connector imposes:
    • the 1 GB kv_buffer_size carved from gpu-memory-utilization, reducing the KV cache budget;
    • per-step SchedulerOutput.kv_connector_metadata serialization and round-trip through the connector worker;
    • altered block-lifecycle semantics in kv_cache_manager (delay_free_blocks=True is the default once any connector is loaded, slowing LRU eviction).
  • MooncakeNIXL = the Mooncake-implementation-specific extra (TTFT p90 +7 pp, TPOT p90 +9 pp, E2E p90 +10 pp, hotspot +19 pp). This is the cost Mooncake's design choices add on top of the generic framework:
    • per-scheduler-step set(self._block_pool.cache.keys()) diff against _known_hash_keys (mooncake_connector.py:432-456) walks O(|cache|) on every step on every engine, costing ~4 M set operations per second on a 200 k-block cache;
    • the hash sync runs even when no direct_read consumer is present, so the cost is paid unconditionally;
    • block-lifecycle is further constrained because Mooncake requires delay_free until the explicit finished_sending arrives, vs NIXL which can release blocks earlier.

The most striking gap is hotspot: Mooncake's per-step hash sync runs on the scheduler's GIL and disrupts the timeliness of routing decisions, amplifying load imbalance by 19%. NIXL has no equivalent global-state maintenance and preserves the plain-unified hotspot to within 0.2%.

Practical implication: you don't enable any v1 KV connector for free, but if you have to enable one, NIXL is meaningfully cheaper than Mooncake. Even NIXL's 38% TTFT p90 tax is large enough that PD-sep needs to recover it on a non-trivial fraction of requests before being worth it.

Result 2 — PD-sep rarely fires on a real agentic trace

We log every routing decision's v2_reason (why we did or did not PD-sep). Two runs with different gate thresholds:

fall-through bucket v2.0 strict v2.1 relaxed what it means
new_local < threshold 1077 (88.7%) 924 (76.1%) uncached tail too small to justify transfer
chosen_no_active_decode 115 (9.5%) 229 (18.9%) no decode on chosen to protect
src_cache_below_threshold 14 (1.2%) 36 (3.0%) no alt instance holds enough cache
src_not_meaningfully_more_cache 6 (0.5%) 16 (1.3%) alt instance doesn't help vs chosen
cost_benefit not enough margin 0 4 (0.3%) model says transfer cost + interference on src ≥ local interference
PD-sep TRIGGERED 2 (0.16%) 5 (0.41%) passed all gates and cost-benefit favored migrate

The dominant filter is new_local < threshold. Even with the threshold dropped from 16 k to 8 k tokens, three out of four requests have less than 8 k uncached tokens at the chosen worker. This is structural: with intra-session reuse measured at 93% on the same trace (window_1_results.md), most turns hit prefix cache on the session's previous worker.

The second filter, chosen_no_active_decode, kills another fifth. This is a snapshot-time phenomenon: at the moment the picker runs, the chosen worker often has its previous request still in prefill, not yet decoding. The gate's intent ("don't migrate if no decode is being hurt by the prefill we're routing") is correct, but it ends up suppressing PD-sep for a real situation where decode is about to start.

Even after these two filters, the cost-benefit step itself rejects nearly half of remaining candidates (4 out of 9 in relaxed). So the final trigger rate of 0.41% is a structural property, not a parameter-tuning problem.

Result 3 — when PD-sep fires, the cost model is wrong by 1020×

The 5 PD-sep-triggered requests in v2.1 relaxed:

input new_local new_src src→dst cost_local cost_migrate (model) actual TTFT actual E2E
21963 21963 9163 6→5 4.39 s 4.17 s 3.69 s 8.48 s
8706 8706 2050 5→7 1.09 s 0.73 s 12.48 s 14.31 s
13616 13616 2352 4→0 1.70 s 1.03 s 18.33 s 19.50 s
49483 49483 843 3→4 11.75 s 2.16 s 45.13 s 53.55 s
19806 19806 350 3→6 3.96 s 1.06 s 20.06 s 31.98 s

The cost model predicts the migrate path will take 0.72.2 s; the actual TTFT on these requests is 1245 s. The model's 0.3 s + bytes / 2.7 GB/s calibration captures pure RDMA bandwidth in isolation but misses everything else that happens on the decode_sent → first_token clock: D-side scheduler step latency, block reservation before KV arrives (so D's cache pressure increases for the entire wait), the per-layer scatter of batch_transfer_sync_write, and the next-step scheduler promotion after finished_recving. The E2 audit measured this end-to-end at p50 = 1.1 s and p90 = 6.7 s on production runs; the v2.1 triggered requests landed in the p99 tail of that distribution because their dst was already loaded.

The first-token clock for the 49 k request is 21× the model's prediction. This is not a small mis-tuning — it's a structurally different model.

Result 4 — four-way comparison

The full table:

metric unified (plain) unified_nixl_both unified_kv_both (Mooncake) unified_v2 (relaxed)
n_ok 1214 1214 1214 1214
TTFT p50 0.50 s 0.51 s 0.50 s 0.49 s
TTFT p90 7.35 s 10.13 s 10.67 s 10.98 s
TTFT p99 42.34 s 44.58 s 45.19 s 49.45 s
TPOT p90 17.1 ms 19.8 ms 21.3 ms 18.4 ms
E2E p90 18.03 s 21.18 s 22.89 s 22.53 s
APC 79.4% 79.1% 78.3% 77.6%
interference index n/a 5.58 8.57 8.46
hotspot index 3.667 3.674 4.363 3.910
n_slow 189 192 198 198

v2 vs the kv_both control (the right comparison)

Compared to the kv_both control — same substrate, no PD-sep — the 5 PD-sep triggers in v2:

  • slightly improve TPOT p90 (14%) and hotspot (10%)
  • slightly worsen TTFT p90 (+3%) and TTFT p99 (+9%), because the triggered requests themselves take ~20× the predicted transfer time

The net effect against the kv_both control is in the noise. The hotspot improvement is within the run-to-run stochastic range we saw earlier (v2 strict run scored 2.733 hotspot under the same substrate; v2 relaxed scored 3.910).

v2 vs plain unified (the headline question)

unified_v2 is 27% slower on E2E p90 and 49% slower on TTFT p90 than plain unified. The 45 pp of TTFT p90 inflation is from kv_both substrate, not the routing decision; nothing PD-sep does can recover this in our current Mooncake implementation.

Why v2's PD-sep is fundamentally choked

There are three independent structural problems, each by itself enough to make v2 not win:

  1. The kv_both substrate is the wrong default. It pays a 45% TTFT p90 tax on every request. To make selective PD-sep beat plain unified, the saved interference per triggered request times the trigger rate must exceed 45% × average TTFT, on average. With 0.41% trigger rate, even saving 100% of TTFT per triggered request would only save ~0.4%, which can't recover 45%.

  2. Agentic intra-session reuse leaves no headroom for migration. Most turns hit cache on the worker that handled the previous turn. Migrating prefill to a different worker is the exact thing intra-session affinity tries to avoid: it forces the new worker to pay for the cached prefix transfer instead of just reusing what's already on the affinity worker. This is a structural mismatch between PD-sep semantics ("send big prefills to a less-busy worker") and agentic workloads ("keep sessions sticky to wherever the cache is").

  3. The Mooncake mechanism is 1020× slower than the cost model predicts, primarily due to D-side pre-allocation of KV blocks and the absence of layerwise pipelining (E2 audit §6.1 / §6.3). The cost model can be re-calibrated, but doing so would push the gate even tighter, dropping the already-tiny trigger rate to nearly zero.

The three are stacked: even if any two were fixed, the remaining one would still make PD-sep a net loss on this trace.

What this section claims for the paper

  1. Same-worker prefilldecode interference is a real mechanism (B2 microbench), but agentic workloads rarely expose it: the typical request has high cache hit and small uncached tail, so the interference cost is bounded.
  2. Routing-only solutions (unified) already capture 79% of the intra-session APC ceiling and recover the latency by avoiding the heavy-tail sessions through the affinity gate. The remaining 23 pp gap to the ceiling is from APC LRU eviction under capacity pressure, not from prefilldecode interference.
  3. Per-request PD-sep via Mooncake on agentic workloads is not a net win in our measurements, even with a carefully-gated cost model. The combined effect of kv_both substrate overhead, low trigger rate, and mechanism-vs-model gap is uniformly negative.
  4. A productive direction is mechanism-level: fix the Mooncake D-side block reservation (E2 §6.1), implement layerwise transfer pipelining (E2 §6.3), and re-measure. Only if these patches drop the substrate tax to <10% and the realized transfer to ≤2 s p90 does PD-sep become competitive with routing on agentic traces.

What v2 still validates

  • The cost model's qualitative shape is correct: when it says "migrate", that's a request where local interference would have been ≥ 4 s and src has ≥ 80% prefix cache. The model picks the right candidate requests.
  • The gate logic catches the right exclusions: 88% by uncached tail size, 19% by no-decode-to-protect, the rest by missing source cache. Each is a structurally correct reason.
  • The proxy shadow-drift fix is necessary infrastructure for any long-running routing experiment. We observed 3 phantom corrections per ~50-minute run.

Files

  • data/b3_policy_comparison.json — the four policies' headline metrics from the same B3 sweep root.
  • data/breakdown_<policy>.json — per-request proxy breakdown including v2 gate fields and triggered-event metadata.
  • data/per_worker_<policy>.json — per-worker TTFT/latency p90s used in the hotspot figure.
  • figures/*.png — the four section figures referenced above.
  • render_figures.py — regenerates the figures from data/.

Cross-references

  • analysis/characterization/window_1_results.md — B2 microbench (same-worker interference causal proof) and B3 baseline 5-policy sweep
  • analysis/characterization/agentic_dispatch_coupling.md — why the saturated-replay setup matches agentic production
  • analysis/characterization/b3_policies_pseudocode.md — pickers for the five baseline policies; unified_v2 extends unified
  • E1 / E2 subagent reports (commit 4b833d3 message and the conversation log) — full mechanism audit that informed v2's design