Add NIXL substrate isolation control + attribution decomposition

Adds unified_nixl_both to elastic_migration_v2: same picker as
unified_kv_both (never triggers PD-sep), but launches vLLM with
NixlConnector instead of MooncakeConnector. Compared against plain
unified and unified_kv_both (Mooncake) we can now attribute the
substrate overhead between "v1 connector framework irreducible
cost" (proxied by the leaner NIXL) and "Mooncake implementation
extra" (Mooncake - NIXL).

Result (vs plain unified, both substrates never PD-sep):

   metric          plain    NIXL          Mooncake
   TTFT p90        7.35s    +37.9%        +45.3%      (NIXL: +7pp better)
   TPOT p90        17.1ms   +15.5%        +24.5%      (NIXL: +9pp better)
   E2E p90         18.03s   +17.4%        +27.0%      (NIXL: +10pp better)
   hotspot         3.667    +0.2%         +19.0%      (NIXL: keeps it flat)
   APC             79.4%    -0.3pp        -1.1pp
   interference    -        5.58          8.57         (NIXL: ~35% lower)

The cleanest signal is hotspot: NIXL preserves plain-unified's
distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step
O(|cache|) `set(self._block_pool.cache.keys())` diff against
_known_hash_keys (mooncake_connector.py:432-456) inflates routing
imbalance by 19%. The hash sync runs unconditionally even when no
direct_read consumer is present.

Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer
GPU memory, per-step SchedulerOutput.kv_connector_metadata
round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL
~= Mooncake-specific overhead (the hash-sync loop and stricter
delay_free semantics).

Practical implication: NIXL is meaningfully better than Mooncake on
this stack, but even NIXL imposes 16-38% across metrics — too
expensive for selective-PD-sep on agentic workloads where the
trigger rate is < 0.5%.

Launch fixes required for NIXL multi-instance:
- VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default
  5600; we use 5600+i). Without this, 7 of 8 instances silently hang
  in `zmq.error.ZMQError: Address already in use` and the launcher
  trap kills all of them at health-check timeout.
- Health-check timeout raised from 180s to 360s; NIXL initialization
  (UCX agent + memory registration) is ~100-150s per instance under
  8-way concurrent load, vs Mooncake's ~30-60s.

New figure: fig_connector_substrate_attribution.png stacks plain /
framework / Mooncake-extra / v2-branch overhead per metric.
Existing figures (fig_kv_both_overhead, fig_three_way_hotspot)
updated to include NIXL as a fourth bar.

README updated with 4-way table, Result 1 reframed as "the cost is
mostly framework, not Mooncake — but Mooncake adds the hotspot
penalty", and the substrate-vs-PD-sep tradeoff math.

Refs: nixl_connector.py:700 handshake listener bind, factory.py
register_connector for the NixlConnector entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-26 16:02:12 +08:00
parent 645b067dd4
commit dc6d24d1ca
8 changed files with 235 additions and 83 deletions

View File

@@ -9,82 +9,127 @@ Model: Qwen3-Coder-30B-A3B-Instruct, 8 × TP1 on H20
This section explores whether the **B2-confirmed same-worker This section explores whether the **B2-confirmed same-worker
prefilldecode interference** can be relieved by selectively prefilldecode interference** can be relieved by selectively
migrating prefill to a different worker for the requests where the migrating prefill to a different worker for the requests where the
interference cost would dominate the transfer cost. We implement two interference cost would dominate the transfer cost. We implement
flavors of the policy (strict gates, then relaxed gates) and a clean two flavors of the routing policy (strict gates, then relaxed
isolation control (`unified_kv_both`: same picker as `unified`, but gates) and **two isolation controls** that use the unified picker
the vLLMs are launched in `kv_role=kv_both` so the Mooncake but launch vLLMs in `kv_role=kv_both` so the connector substrate
substrate is on but never triggers). is on but never PD-seps:
Three findings: - `unified_kv_both`: with **MooncakeConnector**
- `unified_nixl_both`: with **NixlConnector** (NVIDIA's official
v1 connector; isolates connector implementation from policy)
1. **`kv_role=kv_both` alone imposes a heavy always-on tax**: TTFT Four findings:
p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain `unified`,
with no PD-sep ever firing. 1. **`kv_role=kv_both` imposes a substantial always-on tax even
2. **PD-sep almost never triggers on a real agentic workload**: when no PD-sep ever fires**: with Mooncake it's TTFT p90 +45%,
TPOT p90 +25%, hotspot +19%; with NIXL it's TTFT p90 +38%,
TPOT p90 +16%, hotspot +0.2%.
2. **About half of the substrate cost is generic v1-connector
framework overhead** (proxied by NIXL since it's the leanest
implementation): KV buffer GPU memory cut from the model's
working budget, `SchedulerOutput.kv_connector_metadata`
round-trip, and altered `kv_cache_manager` block-lifecycle
semantics. **NIXL is meaningfully better than Mooncake** but
still imposes a 16-38% tax vs no connector.
3. **PD-sep almost never triggers on a real agentic workload**:
0.16% with strict gates, 0.41% with relaxed gates. Agentic 0.16% with strict gates, 0.41% with relaxed gates. Agentic
workloads have 93% intra-session reuse, so most requests land on workloads have 93% intra-session reuse, so most requests land
workers that already hold cache — the uncached tail is too small on workers that already hold cache — the uncached tail is too
to be worth migrating. small to be worth migrating.
3. **When PD-sep does fire, the cost model is wrong by ~1020×**: 4. **When PD-sep does fire, the cost model is wrong by ~1020×**:
the calibrated `0.3s + bytes / 2.7 GB/s` predicts 12 s migrate the calibrated `0.3s + bytes / 2.7 GB/s` predicts 12 s migrate
cost; observed TTFT on triggered requests is 1245 s. The same cost; observed TTFT on triggered requests is 1245 s.
D-side block-reservation pressure and absence of layerwise
pipelining that the E2 audit flagged still dominate.
The net latency of `unified_v2` is **not better than plain The net latency of `unified_v2` is **not better than plain
`unified`**. Improving agentic PD-sep requires fixing the underlying `unified`** under either Mooncake or NIXL substrate. Improving
Mooncake transfer mechanism (E2 patches 6.1 lazy block reservation agentic PD-sep requires (a) using the leaner connector (NIXL >
and 6.3 layerwise pipelining), not the routing decision. Mooncake by 5-19 pp across metrics), and (b) fixing the underlying
transfer mechanism (E2 patches 6.1 lazy block reservation and 6.3
layerwise pipelining), not just the routing decision.
## Substrate ## Substrate
We compare three policies on identical traces: We compare four policies on identical traces:
| policy | picker | vLLM launch mode | what's it for | | policy | picker | vLLM launch mode | what's it for |
|---|---|---|---| |---|---|---|---|
| `unified` | hybrid affinity + LMetric | plain (no Mooncake) | the headline baseline | | `unified` | hybrid affinity + LMetric | plain (no connector) | the headline baseline |
| `unified_kv_both` | same as `unified` | `kv_role=kv_both` + bootstrap | isolation control: how much does kv_both *alone* cost? | | `unified_kv_both` | same as `unified` | `MooncakeConnector` + `kv_both` | substrate control: Mooncake cost without PD-sep |
| `unified_v2` | unified + selective PD-sep | `kv_role=kv_both` + bootstrap | the actual experiment | | `unified_nixl_both` | same as `unified` | `NixlConnector` + `kv_both` | substrate control: NIXL cost without PD-sep, attributes overhead to "framework vs Mooncake" |
| `unified_v2` | unified + selective PD-sep | `MooncakeConnector` + `kv_both` + bootstrap | the actual experiment |
All three use the same trace, the same 8-instance topology, the same All four use the same trace, the same 8-instance topology, the same
shadow-driftcorrected proxy (`scripts/cache_aware_proxy.py` post-fix shadow-driftcorrected proxy (`scripts/cache_aware_proxy.py` post-fix
`95c8ef8`). Plain `unified` was rerun on the patched proxy `95c8ef8`). Plain `unified` was rerun on the patched proxy
(`b3_sweep_20260525_095043/unified`) under the same conditions. (`b3_sweep_20260525_095043/unified`) under the same conditions.
## Result 1 — kv_both is expensive by itself NIXL required two launch fixes beyond Mooncake:
- `VLLM_NIXL_SIDE_CHANNEL_PORT` must be unique per instance
(default 5600 → 5600..5607); otherwise instances 2..8 silently
hang in `zmq.error.ZMQError: Address already in use`.
- Health-check timeout had to be raised from 180 s to 360 s
because NIXL initialization (UCX agent + memory registration)
takes ~100-150 s per instance under 8-way concurrent launch.
## Result 1 — kv_both is expensive by itself, and only partly Mooncake's fault
![](figures/fig_kv_both_overhead.png) ![](figures/fig_kv_both_overhead.png)
Switching the vLLM launch from plain to `kv_role=kv_both` without Switching the vLLM launch from plain to `kv_role=kv_both` without
ever triggering PD-sep already costs: ever triggering PD-sep imposes a substrate tax. We compare the two
connectors available in vendored vLLM:
| metric | plain `unified` | `unified_kv_both` | Δ | | metric | plain `unified` | `unified_nixl_both` | `unified_kv_both` (Mooncake) |
|---|---:|---:|---| |---|---:|---:|---:|
| TTFT p50 | 0.50 s | 0.50 s | +0% | | TTFT p50 | 0.50 s | 0.51 s (+1%) | 0.50 s (+0%) |
| TTFT p90 | 7.35 s | 10.67 s | **+45%** | | **TTFT p90** | 7.35 s | **10.13 s (+38%)** | **10.67 s (+45%)** |
| TTFT p99 | 42.34 s | 45.19 s | +7% | | TTFT p99 | 42.34 s | 44.58 s (+5%) | 45.19 s (+7%) |
| TPOT p90 | 17.1 ms | 21.3 ms | **+25%** | | TPOT p90 | 17.1 ms | **19.8 ms (+16%)** | **21.3 ms (+25%)** |
| E2E p90 | 18.03 s | 22.89 s | **+27%** | | E2E p90 | 18.03 s | **21.18 s (+17%)** | **22.89 s (+27%)** |
| APC | 79.4% | 78.3% | 1.1 pp | | APC | 79.4% | 79.1% (0.3 pp) | 78.3% (1.1 pp) |
| hotspot index | 3.667 | **4.363** | **+19%** | | **hotspot index** | 3.667 | **3.674 (+0.2%)** | **4.363 (+19%)** |
| interference index | n/a | 5.58 | 8.57 |
Two contributing factors: ![](figures/fig_connector_substrate_attribution.png)
1. **The Mooncake `MooncakeConnector` runs even when no transfer is Reading the table from left to right gives a clean attribution:
pending.** Every scheduler step it walks `set(cache.keys())`
against `_known_hash_keys` (E2 audit §6.5) and updates the
`KVConnectorMetadata`. This is O(|cache|) per step on every
engine, even when no producer/consumer relationship is active.
2. **Block reservation semantics differ** under kv_both. The
scheduler treats blocks as candidates for export-to-others, so
the prefix cache LRU pressure is slightly different (we lose 1
pp APC).
Practical implication: **you don't enable kv_both for free**. If a - **NIXLplain** = the **v1-connector framework's irreducible cost**
deployment wants the option to do PD-sep selectively, the 45% TTFT (TTFT p90 +38%, TPOT p90 +16%, E2E p90 +17%). This is the cost
p90 tax applies even on requests that stay local. This needs to *any* v1 KV connector imposes:
recoverable cost before any selective-PD-sep policy is worth - the 1 GB `kv_buffer_size` carved from `gpu-memory-utilization`,
shipping. reducing the KV cache budget;
- per-step `SchedulerOutput.kv_connector_metadata` serialization
and round-trip through the connector worker;
- altered block-lifecycle semantics in `kv_cache_manager`
(`delay_free_blocks=True` is the default once any connector is
loaded, slowing LRU eviction).
- **MooncakeNIXL** = the **Mooncake-implementation-specific extra**
(TTFT p90 +7 pp, TPOT p90 +9 pp, E2E p90 +10 pp, hotspot +19 pp).
This is the cost Mooncake's design choices add on top of the
generic framework:
- per-scheduler-step `set(self._block_pool.cache.keys())` diff
against `_known_hash_keys` (`mooncake_connector.py:432-456`)
walks O(|cache|) on every step on every engine, costing ~4 M
set operations per second on a 200 k-block cache;
- the hash sync runs even when no `direct_read` consumer is
present, so the cost is paid unconditionally;
- block-lifecycle is further constrained because Mooncake
requires `delay_free` until the explicit `finished_sending`
arrives, vs NIXL which can release blocks earlier.
The **most striking gap is hotspot**: Mooncake's per-step hash
sync runs on the scheduler's GIL and disrupts the timeliness of
routing decisions, amplifying load imbalance by 19%. NIXL has no
equivalent global-state maintenance and preserves the plain-unified
hotspot to within 0.2%.
Practical implication: **you don't enable any v1 KV connector for
free**, but if you have to enable one, NIXL is meaningfully cheaper
than Mooncake. Even NIXL's 38% TTFT p90 tax is large enough that
PD-sep needs to recover it on a non-trivial fraction of requests
before being worth it.
## Result 2 — PD-sep rarely fires on a real agentic trace ## Result 2 — PD-sep rarely fires on a real agentic trace
@@ -153,24 +198,24 @@ The first-token clock for the 49 k request is **21× the model's
prediction**. This is not a small mis-tuning — it's a structurally prediction**. This is not a small mis-tuning — it's a structurally
different model. different model.
## Result 4 — three-way comparison ## Result 4 — four-way comparison
![](figures/fig_three_way_hotspot.png) ![](figures/fig_three_way_hotspot.png)
The full table: The full table:
| metric | unified (plain) | unified_kv_both | unified_v2 (relaxed) | | metric | unified (plain) | unified_nixl_both | unified_kv_both (Mooncake) | unified_v2 (relaxed) |
|---|---:|---:|---:| |---|---:|---:|---:|---:|
| n_ok | 1214 | 1214 | 1214 | | n_ok | 1214 | 1214 | 1214 | 1214 |
| TTFT p50 | 0.50 s | 0.50 s | 0.49 s | | TTFT p50 | 0.50 s | 0.51 s | 0.50 s | 0.49 s |
| TTFT p90 | 7.35 s | 10.67 s | 10.98 s | | TTFT p90 | 7.35 s | 10.13 s | 10.67 s | 10.98 s |
| TTFT p99 | 42.34 s | 45.19 s | 49.45 s | | TTFT p99 | 42.34 s | 44.58 s | 45.19 s | 49.45 s |
| TPOT p90 | 17.1 ms | 21.3 ms | 18.4 ms | | TPOT p90 | 17.1 ms | 19.8 ms | 21.3 ms | 18.4 ms |
| E2E p90 | 18.03 s | 22.89 s | 22.53 s | | E2E p90 | 18.03 s | 21.18 s | 22.89 s | 22.53 s |
| APC | 79.4% | 78.3% | 77.6% | | APC | 79.4% | 79.1% | 78.3% | 77.6% |
| interference index | n/a (no engine_state) | 8.57 | 8.46 | | interference index | n/a | 5.58 | 8.57 | 8.46 |
| hotspot index | 3.667 | 4.363 | 3.910 | | hotspot index | 3.667 | 3.674 | 4.363 | 3.910 |
| n_slow | 189 | 198 | 198 | | n_slow | 189 | 192 | 198 | 198 |
### v2 vs the kv_both control (the right comparison) ### v2 vs the kv_both control (the right comparison)

View File

@@ -155,6 +155,32 @@
"unknown": 49 "unknown": 49
} }
}, },
{
"policy": "unified_nixl_both",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 0.5138550130068325,
"ttft_p90_s": 10.127110345300755,
"ttft_p99_s": 44.5789094621703,
"tpot_p50_s": 0.008423213202440761,
"tpot_p90_s": 0.019759515867947428,
"tpot_p99_s": 0.1079433335279151,
"e2e_p50_s": 1.866590676479973,
"e2e_p90_s": 21.179128799570027,
"e2e_p99_s": 96.01196486203865,
"apc_ratio": 0.791441828164218,
"interference_index": 5.580715970433481,
"hotspot_index_ttft_p90": 3.673957447190547,
"reuse_intra_frac": 0.930632797070364,
"reuse_cross_frac": 0.05718149217603143,
"n_slow": 192,
"failure_counts": {
"cache_miss_large_append": 21,
"hot_worker_queue": 75,
"same_worker_prefill_overlap": 72,
"unknown": 24
}
},
{ {
"policy": "unified_v2", "policy": "unified_v2",
"n_ok": 1214, "n_ok": 1214,

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 3.673957447190547,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 21.5702620673168,
"http://127.0.0.1:8001": 21.44246501957532,
"http://127.0.0.1:8002": 7.497513776784784,
"http://127.0.0.1:8003": 18.975387462502113,
"http://127.0.0.1:8004": 27.733961877820548,
"http://127.0.0.1:8005": 14.178356938017535,
"http://127.0.0.1:8006": 25.44877168269595,
"http://127.0.0.1:8007": 54.500166546402035
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 7.380765471985799,
"http://127.0.0.1:8001": 14.109222683508415,
"http://127.0.0.1:8002": 3.001173847797329,
"http://127.0.0.1:8003": 14.087287129514152,
"http://127.0.0.1:8004": 14.151121024426537,
"http://127.0.0.1:8005": 6.165523712011057,
"http://127.0.0.1:8006": 6.314287615299688,
"http://127.0.0.1:8007": 39.43635586597957
},
"status": "supported"
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 83 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 70 KiB

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

After

Width:  |  Height:  |  Size: 57 KiB

View File

@@ -34,37 +34,39 @@ def _load(name: str):
POLICY_COLORS = { POLICY_COLORS = {
"unified": "#2ca02c", "unified": "#2ca02c",
"unified_kv_both": "#9467bd", "unified_kv_both": "#9467bd",
"unified_v2": "#d62728", "unified_nixl_both": "#1f77b4",
"unified_v2_strict": "#ff7f0e", "unified_v2": "#d62728",
"unified_v2_strict": "#ff7f0e",
} }
def fig_kv_both_overhead(): def fig_kv_both_overhead():
comp = _load("b3_policy_comparison.json") comp = _load("b3_policy_comparison.json")
by = {r["policy"]: r for r in comp["rows"]} by = {r["policy"]: r for r in comp["rows"]}
pols = ["unified", "unified_kv_both", "unified_v2"] pols = ["unified", "unified_kv_both", "unified_nixl_both", "unified_v2"]
metrics = [ metrics = [
("TTFT p90 (s)", lambda r: r["ttft_p90_s"]), ("TTFT p90 (s)", lambda r: r["ttft_p90_s"]),
("TPOT p90 (ms)", lambda r: r["tpot_p90_s"] * 1000), ("TPOT p90 (ms)", lambda r: r["tpot_p90_s"] * 1000),
("E2E p90 (s)", lambda r: r["e2e_p90_s"]), ("E2E p90 (s)", lambda r: r["e2e_p90_s"]),
("hotspot index", lambda r: r["hotspot_index_ttft_p90"]), ("hotspot index", lambda r: r["hotspot_index_ttft_p90"]),
] ]
fig, axes = plt.subplots(1, 4, figsize=(14, 4)) fig, axes = plt.subplots(1, 4, figsize=(15, 4.2))
for ax, (label, fn) in zip(axes, metrics): for ax, (label, fn) in zip(axes, metrics):
vals = [fn(by[p]) for p in pols] vals = [fn(by[p]) for p in pols]
bars = ax.bar(pols, vals, labels_short = [p.replace("unified_", "") for p in pols]
labels_short[0] = "plain"
bars = ax.bar(labels_short, vals,
color=[POLICY_COLORS[p] for p in pols], color=[POLICY_COLORS[p] for p in pols],
edgecolor="black", linewidth=0.5) edgecolor="black", linewidth=0.5)
ax.set_title(label) ax.set_title(label)
ax.tick_params(axis="x", rotation=20, labelsize=9) ax.tick_params(axis="x", rotation=15, labelsize=9)
for b, v in zip(bars, vals): for b, v in zip(bars, vals):
ax.text(b.get_x() + b.get_width() / 2, v, ax.text(b.get_x() + b.get_width() / 2, v,
f"{v:.2f}" if v < 100 else f"{v:.0f}", f"{v:.2f}" if v < 100 else f"{v:.0f}",
ha="center", va="bottom", fontsize=9) ha="center", va="bottom", fontsize=9)
ax.grid(alpha=0.3, axis="y") ax.grid(alpha=0.3, axis="y")
# delta annotation
baseline = vals[0] baseline = vals[0]
for i, v in enumerate(vals): for i, v in enumerate(vals):
if i == 0: if i == 0:
@@ -74,8 +76,8 @@ def fig_kv_both_overhead():
fontsize=10, fontweight="bold", fontsize=10, fontweight="bold",
color="darkred" if pct > 0 else "darkgreen") color="darkred" if pct > 0 else "darkgreen")
fig.suptitle( fig.suptitle(
"kv_both adds ~45% to TTFT p90 even without PD-sep firing.\n" "Mooncake substrate adds 19-45% across metrics; NIXL is 5-19pp better but\n"
"v2's PD-sep barely recovers the gap (and overshoots TTFT p99)." "still 16-38% above plain. v2's 5 PD-sep events don't recover the substrate tax."
) )
fig.tight_layout() fig.tight_layout()
fig.savefig(OUT / "fig_kv_both_overhead.png", dpi=120) fig.savefig(OUT / "fig_kv_both_overhead.png", dpi=120)
@@ -203,27 +205,29 @@ def fig_v2_predicted_vs_actual():
def fig_three_way_hotspot(): def fig_three_way_hotspot():
pols = ["unified", "unified_kv_both", "unified_v2"] pols = ["unified", "unified_kv_both", "unified_nixl_both", "unified_v2"]
per_worker = {p: _load(f"per_worker_{p}.json") for p in pols} per_worker = {p: _load(f"per_worker_{p}.json") for p in pols}
workers = sorted(per_worker["unified"]["per_worker_ttft_p90_s"].keys()) workers = sorted(per_worker["unified"]["per_worker_ttft_p90_s"].keys())
x = range(len(workers)) x = range(len(workers))
width = 0.27 n = len(pols)
fig, ax = plt.subplots(figsize=(11, 5)) width = 0.85 / n
fig, ax = plt.subplots(figsize=(12, 5))
for i, p in enumerate(pols): for i, p in enumerate(pols):
d = per_worker[p]["per_worker_ttft_p90_s"] d = per_worker[p]["per_worker_ttft_p90_s"]
vals = [d[w] for w in workers] vals = [d[w] for w in workers]
offset = (i - 1) * width offset = (i - (n - 1) / 2) * width
label = p.replace("unified_", "") if p != "unified" else "plain"
ax.bar([j + offset for j in x], vals, width, ax.bar([j + offset for j in x], vals, width,
label=f"{p} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})", label=f"{label} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
color=POLICY_COLORS[p], edgecolor="black", linewidth=0.4) color=POLICY_COLORS[p], edgecolor="black", linewidth=0.4)
short = [w.replace("http://127.0.0.1:", ":") for w in workers] short = [w.replace("http://127.0.0.1:", ":") for w in workers]
ax.set_xticks(list(x)) ax.set_xticks(list(x))
ax.set_xticklabels(short, rotation=0, fontsize=9) ax.set_xticklabels(short, rotation=0, fontsize=9)
ax.set_ylabel("worker TTFT p90 (s)") ax.set_ylabel("worker TTFT p90 (s)")
ax.set_title( ax.set_title(
"Per-worker TTFT p90 distribution. kv_both alone makes the hot worker hotter\n" "Per-worker TTFT p90 distribution across substrates. Mooncake (kv_both)\n"
"(unified→kv_both: 37.7s→43.5s peak); v2's 5 PD-sep triggers nudge it back." "amplifies the hot worker (hotspot 4.36); NIXL keeps it close to plain (3.67)."
) )
ax.legend(loc="upper left", fontsize=9) ax.legend(loc="upper left", fontsize=9)
ax.grid(alpha=0.3, axis="y") ax.grid(alpha=0.3, axis="y")
@@ -232,12 +236,64 @@ def fig_three_way_hotspot():
plt.close(fig) plt.close(fig)
def fig_connector_substrate_attribution():
"""Decomposes overhead into v1-framework cost (shared by all connectors,
proxied by NIXL since it's the leanest) and Mooncake-specific cost."""
comp = _load("b3_policy_comparison.json")
by = {r["policy"]: r for r in comp["rows"]}
metrics = [
("TTFT p90 (s)", "ttft_p90_s", False),
("TPOT p90 (ms)", "tpot_p90_s", True),
("E2E p90 (s)", "e2e_p90_s", False),
("hotspot index", "hotspot_index_ttft_p90", False),
]
fig, axes = plt.subplots(1, 4, figsize=(15, 4))
for ax, (label, key, scale_ms) in zip(axes, metrics):
plain = by["unified"][key] * (1000 if scale_ms else 1)
nixl = by["unified_nixl_both"][key] * (1000 if scale_ms else 1)
moon = by["unified_kv_both"][key] * (1000 if scale_ms else 1)
v2 = by["unified_v2"][key] * (1000 if scale_ms else 1)
framework_cost = nixl - plain # what NIXL adds = v1 framework cost
mooncake_extra = moon - nixl # extra on top from Mooncake
v2_branch_extra = v2 - moon # extra from PD-sep branch (Mooncake + 5 events)
bottom = 0
ax.bar(["overhead"], [plain], color="#cccccc",
edgecolor="black", linewidth=0.4,
label=f"plain unified ({plain:.2f})")
bottom += plain
ax.bar(["overhead"], [framework_cost], bottom=[bottom],
color="#1f77b4", edgecolor="black", linewidth=0.4,
label=f"v1 framework (+{framework_cost:.2f})")
bottom += framework_cost
ax.bar(["overhead"], [mooncake_extra], bottom=[bottom],
color="#9467bd", edgecolor="black", linewidth=0.4,
label=f"Mooncake extra (+{mooncake_extra:.2f})")
bottom += mooncake_extra
ax.bar(["overhead"], [v2_branch_extra], bottom=[bottom],
color="#d62728", edgecolor="black", linewidth=0.4,
label=f"v2 PD-sep branch ({v2_branch_extra:+.2f})")
ax.set_title(label)
ax.legend(fontsize=8, loc="upper right")
ax.grid(alpha=0.3, axis="y")
ax.tick_params(axis="x", labelbottom=False)
fig.suptitle(
"Attribution: plain unified vs NIXL substrate vs Mooncake substrate vs v2.\n"
"Blue: cost shared by any v1 connector. Purple: cost specific to Mooncake."
)
fig.tight_layout()
fig.savefig(OUT / "fig_connector_substrate_attribution.png", dpi=120)
plt.close(fig)
def main(): def main():
fig_kv_both_overhead() fig_kv_both_overhead()
fig_v2_trigger_funnel() fig_v2_trigger_funnel()
fig_v2_predicted_vs_actual() fig_v2_predicted_vs_actual()
fig_three_way_hotspot() fig_three_way_hotspot()
print(f"wrote 4 figures to {OUT}") fig_connector_substrate_attribution()
print(f"wrote 5 figures to {OUT}")
if __name__ == "__main__": if __name__ == "__main__":