Adds unified_nixl_both to elastic_migration_v2: same picker as
unified_kv_both (never triggers PD-sep), but launches vLLM with
NixlConnector instead of MooncakeConnector. Compared against plain
unified and unified_kv_both (Mooncake) we can now attribute the
substrate overhead between "v1 connector framework irreducible
cost" (proxied by the leaner NIXL) and "Mooncake implementation
extra" (Mooncake - NIXL).
Result (vs plain unified, both substrates never PD-sep):
metric plain NIXL Mooncake
TTFT p90 7.35s +37.9% +45.3% (NIXL: +7pp better)
TPOT p90 17.1ms +15.5% +24.5% (NIXL: +9pp better)
E2E p90 18.03s +17.4% +27.0% (NIXL: +10pp better)
hotspot 3.667 +0.2% +19.0% (NIXL: keeps it flat)
APC 79.4% -0.3pp -1.1pp
interference - 5.58 8.57 (NIXL: ~35% lower)
The cleanest signal is hotspot: NIXL preserves plain-unified's
distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step
O(|cache|) `set(self._block_pool.cache.keys())` diff against
_known_hash_keys (mooncake_connector.py:432-456) inflates routing
imbalance by 19%. The hash sync runs unconditionally even when no
direct_read consumer is present.
Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer
GPU memory, per-step SchedulerOutput.kv_connector_metadata
round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL
~= Mooncake-specific overhead (the hash-sync loop and stricter
delay_free semantics).
Practical implication: NIXL is meaningfully better than Mooncake on
this stack, but even NIXL imposes 16-38% across metrics — too
expensive for selective-PD-sep on agentic workloads where the
trigger rate is < 0.5%.
Launch fixes required for NIXL multi-instance:
- VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default
5600; we use 5600+i). Without this, 7 of 8 instances silently hang
in `zmq.error.ZMQError: Address already in use` and the launcher
trap kills all of them at health-check timeout.
- Health-check timeout raised from 180s to 360s; NIXL initialization
(UCX agent + memory registration) is ~100-150s per instance under
8-way concurrent load, vs Mooncake's ~30-60s.
New figure: fig_connector_substrate_attribution.png stacks plain /
framework / Mooncake-extra / v2-branch overhead per metric.
Existing figures (fig_kv_both_overhead, fig_three_way_hotspot)
updated to include NIXL as a fourth bar.
README updated with 4-way table, Result 1 reframed as "the cost is
mostly framework, not Mooncake — but Mooncake adds the hotspot
penalty", and the substrate-vs-PD-sep tradeoff math.
Refs: nixl_connector.py:700 handshake listener bind, factory.py
register_connector for the NixlConnector entry.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>