Add NIXL substrate isolation control + attribution decomposition

Adds unified_nixl_both to elastic_migration_v2: same picker as unified_kv_both (never triggers PD-sep), but launches vLLM with NixlConnector instead of MooncakeConnector. Compared against plain unified and unified_kv_both (Mooncake) we can now attribute the substrate overhead between "v1 connector framework irreducible cost" (proxied by the leaner NIXL) and "Mooncake implementation extra" (Mooncake - NIXL). Result (vs plain unified, both substrates never PD-sep): metric plain NIXL Mooncake TTFT p90 7.35s +37.9% +45.3% (NIXL: +7pp better) TPOT p90 17.1ms +15.5% +24.5% (NIXL: +9pp better) E2E p90 18.03s +17.4% +27.0% (NIXL: +10pp better) hotspot 3.667 +0.2% +19.0% (NIXL: keeps it flat) APC 79.4% -0.3pp -1.1pp interference - 5.58 8.57 (NIXL: ~35% lower) The cleanest signal is hotspot: NIXL preserves plain-unified's distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step O(|cache|) `set(self._block_pool.cache.keys())` diff against _known_hash_keys (mooncake_connector.py:432-456) inflates routing imbalance by 19%. The hash sync runs unconditionally even when no direct_read consumer is present. Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer GPU memory, per-step SchedulerOutput.kv_connector_metadata round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL ~= Mooncake-specific overhead (the hash-sync loop and stricter delay_free semantics). Practical implication: NIXL is meaningfully better than Mooncake on this stack, but even NIXL imposes 16-38% across metrics — too expensive for selective-PD-sep on agentic workloads where the trigger rate is < 0.5%. Launch fixes required for NIXL multi-instance: - VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default 5600; we use 5600+i). Without this, 7 of 8 instances silently hang in `zmq.error.ZMQError: Address already in use` and the launcher trap kills all of them at health-check timeout. - Health-check timeout raised from 180s to 360s; NIXL initialization (UCX agent + memory registration) is ~100-150s per instance under 8-way concurrent load, vs Mooncake's ~30-60s. New figure: fig_connector_substrate_attribution.png stacks plain / framework / Mooncake-extra / v2-branch overhead per metric. Existing figures (fig_kv_both_overhead, fig_three_way_hotspot) updated to include NIXL as a fourth bar. README updated with 4-way table, Result 1 reframed as "the cost is mostly framework, not Mooncake — but Mooncake adds the hotspot penalty", and the substrate-vs-PD-sep tradeoff math. Refs: nixl_connector.py:700 handshake listener bind, factory.py register_connector for the NixlConnector entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 16:02:12 +08:00
parent 645b067dd4
commit dc6d24d1ca
8 changed files with 235 additions and 83 deletions
--- a/analysis/characterization/elastic_migration_v2/README.md
+++ b/analysis/characterization/elastic_migration_v2/README.md
@@ -9,82 +9,127 @@ Model: Qwen3-Coder-30B-A3B-Instruct, 8 × TP1 on H20
 This section explores whether the **B2-confirmed same-worker
 prefill–decode interference** can be relieved by selectively
 migrating prefill to a different worker for the requests where the
-interference cost would dominate the transfer cost. We implement two
-flavors of the policy (strict gates, then relaxed gates) and a clean
-isolation control (`unified_kv_both`: same picker as `unified`, but
-the vLLMs are launched in `kv_role=kv_both` so the Mooncake
-substrate is on but never triggers).
+interference cost would dominate the transfer cost. We implement
+two flavors of the routing policy (strict gates, then relaxed
+gates) and **two isolation controls** that use the unified picker
+but launch vLLMs in `kv_role=kv_both` so the connector substrate
+is on but never PD-seps:

-Three findings:
+- `unified_kv_both`: with **MooncakeConnector**
+- `unified_nixl_both`: with **NixlConnector** (NVIDIA's official
+  v1 connector; isolates connector implementation from policy)

-1. **`kv_role=kv_both` alone imposes a heavy always-on tax**: TTFT
-   p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain `unified`,
-   with no PD-sep ever firing.
-2. **PD-sep almost never triggers on a real agentic workload**:
+Four findings:
+
+1. **`kv_role=kv_both` imposes a substantial always-on tax even
+   when no PD-sep ever fires**: with Mooncake it's TTFT p90 +45%,
+   TPOT p90 +25%, hotspot +19%; with NIXL it's TTFT p90 +38%,
+   TPOT p90 +16%, hotspot +0.2%.
+2. **About half of the substrate cost is generic v1-connector
+   framework overhead** (proxied by NIXL since it's the leanest
+   implementation): KV buffer GPU memory cut from the model's
+   working budget, `SchedulerOutput.kv_connector_metadata`
+   round-trip, and altered `kv_cache_manager` block-lifecycle
+   semantics. **NIXL is meaningfully better than Mooncake** but
+   still imposes a 16-38% tax vs no connector.
+3. **PD-sep almost never triggers on a real agentic workload**:
   0.16% with strict gates, 0.41% with relaxed gates. Agentic
-   workloads have 93% intra-session reuse, so most requests land on
-   workers that already hold cache — the uncached tail is too small
-   to be worth migrating.
-3. **When PD-sep does fire, the cost model is wrong by ~10–20×**:
+   workloads have 93% intra-session reuse, so most requests land
+   on workers that already hold cache — the uncached tail is too
+   small to be worth migrating.
+4. **When PD-sep does fire, the cost model is wrong by ~10–20×**:
   the calibrated `0.3s + bytes / 2.7 GB/s` predicts 1–2 s migrate
-   cost; observed TTFT on triggered requests is 12–45 s. The same
-   D-side block-reservation pressure and absence of layerwise
-   pipelining that the E2 audit flagged still dominate.
+   cost; observed TTFT on triggered requests is 12–45 s.

 The net latency of `unified_v2` is **not better than plain
-`unified`**. Improving agentic PD-sep requires fixing the underlying
-Mooncake transfer mechanism (E2 patches 6.1 lazy block reservation
-and 6.3 layerwise pipelining), not the routing decision.
+`unified`** under either Mooncake or NIXL substrate. Improving
+agentic PD-sep requires (a) using the leaner connector (NIXL >
+Mooncake by 5-19 pp across metrics), and (b) fixing the underlying
+transfer mechanism (E2 patches 6.1 lazy block reservation and 6.3
+layerwise pipelining), not just the routing decision.

 ## Substrate

-We compare three policies on identical traces:
+We compare four policies on identical traces:

 | policy | picker | vLLM launch mode | what's it for |
 |---|---|---|---|
-| `unified` | hybrid affinity + LMetric | plain (no Mooncake) | the headline baseline |
-| `unified_kv_both` | same as `unified` | `kv_role=kv_both` + bootstrap | isolation control: how much does kv_both *alone* cost? |
-| `unified_v2` | unified + selective PD-sep | `kv_role=kv_both` + bootstrap | the actual experiment |
+| `unified` | hybrid affinity + LMetric | plain (no connector) | the headline baseline |
+| `unified_kv_both` | same as `unified` | `MooncakeConnector` + `kv_both` | substrate control: Mooncake cost without PD-sep |
+| `unified_nixl_both` | same as `unified` | `NixlConnector` + `kv_both` | substrate control: NIXL cost without PD-sep, attributes overhead to "framework vs Mooncake" |
+| `unified_v2` | unified + selective PD-sep | `MooncakeConnector` + `kv_both` + bootstrap | the actual experiment |

-All three use the same trace, the same 8-instance topology, the same
+All four use the same trace, the same 8-instance topology, the same
 shadow-drift–corrected proxy (`scripts/cache_aware_proxy.py` post-fix
 `95c8ef8`). Plain `unified` was rerun on the patched proxy
 (`b3_sweep_20260525_095043/unified`) under the same conditions.

-## Result 1 — kv_both is expensive by itself
+NIXL required two launch fixes beyond Mooncake:
+- `VLLM_NIXL_SIDE_CHANNEL_PORT` must be unique per instance
+  (default 5600 → 5600..5607); otherwise instances 2..8 silently
+  hang in `zmq.error.ZMQError: Address already in use`.
+- Health-check timeout had to be raised from 180 s to 360 s
+  because NIXL initialization (UCX agent + memory registration)
+  takes ~100-150 s per instance under 8-way concurrent launch.
+
+## Result 1 — kv_both is expensive by itself, and only partly Mooncake's fault

 ![](figures/fig_kv_both_overhead.png)

 Switching the vLLM launch from plain to `kv_role=kv_both` without
-ever triggering PD-sep already costs:
+ever triggering PD-sep imposes a substrate tax. We compare the two
+connectors available in vendored vLLM:

-| metric | plain `unified` | `unified_kv_both` | Δ |
-|---|---:|---:|---|
-| TTFT p50 | 0.50 s | 0.50 s | +0% |
-| TTFT p90 | 7.35 s | 10.67 s | **+45%** |
-| TTFT p99 | 42.34 s | 45.19 s | +7% |
-| TPOT p90 | 17.1 ms | 21.3 ms | **+25%** |
-| E2E p90 | 18.03 s | 22.89 s | **+27%** |
-| APC | 79.4% | 78.3% | −1.1 pp |
-| hotspot index | 3.667 | **4.363** | **+19%** |
+| metric | plain `unified` | `unified_nixl_both` | `unified_kv_both` (Mooncake) |
+|---|---:|---:|---:|
+| TTFT p50 | 0.50 s | 0.51 s (+1%) | 0.50 s (+0%) |
+| **TTFT p90** | 7.35 s | **10.13 s (+38%)** | **10.67 s (+45%)** |
+| TTFT p99 | 42.34 s | 44.58 s (+5%) | 45.19 s (+7%) |
+| TPOT p90 | 17.1 ms | **19.8 ms (+16%)** | **21.3 ms (+25%)** |
+| E2E p90 | 18.03 s | **21.18 s (+17%)** | **22.89 s (+27%)** |
+| APC | 79.4% | 79.1% (−0.3 pp) | 78.3% (−1.1 pp) |
+| **hotspot index** | 3.667 | **3.674 (+0.2%)** | **4.363 (+19%)** |
+| interference index | n/a | 5.58 | 8.57 |

-Two contributing factors:
+![](figures/fig_connector_substrate_attribution.png)

-1. **The Mooncake `MooncakeConnector` runs even when no transfer is
-   pending.** Every scheduler step it walks `set(cache.keys())`
-   against `_known_hash_keys` (E2 audit §6.5) and updates the
-   `KVConnectorMetadata`. This is O(|cache|) per step on every
-   engine, even when no producer/consumer relationship is active.
-2. **Block reservation semantics differ** under kv_both. The
-   scheduler treats blocks as candidates for export-to-others, so
-   the prefix cache LRU pressure is slightly different (we lose 1
-   pp APC).
+Reading the table from left to right gives a clean attribution:

-Practical implication: **you don't enable kv_both for free**. If a
-deployment wants the option to do PD-sep selectively, the 45% TTFT
-p90 tax applies even on requests that stay local. This needs to
-recoverable cost before any selective-PD-sep policy is worth
-shipping.
+- **NIXL−plain** = the **v1-connector framework's irreducible cost**
+  (TTFT p90 +38%, TPOT p90 +16%, E2E p90 +17%). This is the cost
+  *any* v1 KV connector imposes:
+  - the 1 GB `kv_buffer_size` carved from `gpu-memory-utilization`,
+    reducing the KV cache budget;
+  - per-step `SchedulerOutput.kv_connector_metadata` serialization
+    and round-trip through the connector worker;
+  - altered block-lifecycle semantics in `kv_cache_manager`
+    (`delay_free_blocks=True` is the default once any connector is
+    loaded, slowing LRU eviction).
+- **Mooncake−NIXL** = the **Mooncake-implementation-specific extra**
+  (TTFT p90 +7 pp, TPOT p90 +9 pp, E2E p90 +10 pp, hotspot +19 pp).
+  This is the cost Mooncake's design choices add on top of the
+  generic framework:
+  - per-scheduler-step `set(self._block_pool.cache.keys())` diff
+    against `_known_hash_keys` (`mooncake_connector.py:432-456`)
+    walks O(|cache|) on every step on every engine, costing ~4 M
+    set operations per second on a 200 k-block cache;
+  - the hash sync runs even when no `direct_read` consumer is
+    present, so the cost is paid unconditionally;
+  - block-lifecycle is further constrained because Mooncake
+    requires `delay_free` until the explicit `finished_sending`
+    arrives, vs NIXL which can release blocks earlier.
+
+The **most striking gap is hotspot**: Mooncake's per-step hash
+sync runs on the scheduler's GIL and disrupts the timeliness of
+routing decisions, amplifying load imbalance by 19%. NIXL has no
+equivalent global-state maintenance and preserves the plain-unified
+hotspot to within 0.2%.
+
+Practical implication: **you don't enable any v1 KV connector for
+free**, but if you have to enable one, NIXL is meaningfully cheaper
+than Mooncake. Even NIXL's 38% TTFT p90 tax is large enough that
+PD-sep needs to recover it on a non-trivial fraction of requests
+before being worth it.

 ## Result 2 — PD-sep rarely fires on a real agentic trace

@@ -153,24 +198,24 @@ The first-token clock for the 49 k request is **21× the model's
 prediction**. This is not a small mis-tuning — it's a structurally
 different model.

-## Result 4 — three-way comparison
+## Result 4 — four-way comparison

 ![](figures/fig_three_way_hotspot.png)

 The full table:

-| metric | unified (plain) | unified_kv_both | unified_v2 (relaxed) |
-|---|---:|---:|---:|
-| n_ok | 1214 | 1214 | 1214 |
-| TTFT p50 | 0.50 s | 0.50 s | 0.49 s |
-| TTFT p90 | 7.35 s | 10.67 s | 10.98 s |
-| TTFT p99 | 42.34 s | 45.19 s | 49.45 s |
-| TPOT p90 | 17.1 ms | 21.3 ms | 18.4 ms |
-| E2E p90 | 18.03 s | 22.89 s | 22.53 s |
-| APC | 79.4% | 78.3% | 77.6% |
-| interference index | n/a (no engine_state) | 8.57 | 8.46 |
-| hotspot index | 3.667 | 4.363 | 3.910 |
-| n_slow | 189 | 198 | 198 |
+| metric | unified (plain) | unified_nixl_both | unified_kv_both (Mooncake) | unified_v2 (relaxed) |
+|---|---:|---:|---:|---:|
+| n_ok | 1214 | 1214 | 1214 | 1214 |
+| TTFT p50 | 0.50 s | 0.51 s | 0.50 s | 0.49 s |
+| TTFT p90 | 7.35 s | 10.13 s | 10.67 s | 10.98 s |
+| TTFT p99 | 42.34 s | 44.58 s | 45.19 s | 49.45 s |
+| TPOT p90 | 17.1 ms | 19.8 ms | 21.3 ms | 18.4 ms |
+| E2E p90 | 18.03 s | 21.18 s | 22.89 s | 22.53 s |
+| APC | 79.4% | 79.1% | 78.3% | 77.6% |
+| interference index | n/a | 5.58 | 8.57 | 8.46 |
+| hotspot index | 3.667 | 3.674 | 4.363 | 3.910 |
+| n_slow | 189 | 192 | 198 | 198 |

 ### v2 vs the kv_both control (the right comparison)

--- a/analysis/characterization/elastic_migration_v2/data/b3_policy_comparison.json
+++ b/analysis/characterization/elastic_migration_v2/data/b3_policy_comparison.json
@@ -155,6 +155,32 @@
        "unknown": 49
      }
    },
+    {
+      "policy": "unified_nixl_both",
+      "n_ok": 1214,
+      "n_total": 1214,
+      "ttft_p50_s": 0.5138550130068325,
+      "ttft_p90_s": 10.127110345300755,
+      "ttft_p99_s": 44.5789094621703,
+      "tpot_p50_s": 0.008423213202440761,
+      "tpot_p90_s": 0.019759515867947428,
+      "tpot_p99_s": 0.1079433335279151,
+      "e2e_p50_s": 1.866590676479973,
+      "e2e_p90_s": 21.179128799570027,
+      "e2e_p99_s": 96.01196486203865,
+      "apc_ratio": 0.791441828164218,
+      "interference_index": 5.580715970433481,
+      "hotspot_index_ttft_p90": 3.673957447190547,
+      "reuse_intra_frac": 0.930632797070364,
+      "reuse_cross_frac": 0.05718149217603143,
+      "n_slow": 192,
+      "failure_counts": {
+        "cache_miss_large_append": 21,
+        "hot_worker_queue": 75,
+        "same_worker_prefill_overlap": 72,
+        "unknown": 24
+      }
+    },
    {
      "policy": "unified_v2",
      "n_ok": 1214,
--- a/analysis/characterization/elastic_migration_v2/data/breakdown_unified_nixl_both.json
+++ b/analysis/characterization/elastic_migration_v2/data/breakdown_unified_nixl_both.json
--- a/analysis/characterization/elastic_migration_v2/data/per_worker_unified_nixl_both.json
+++ b/analysis/characterization/elastic_migration_v2/data/per_worker_unified_nixl_both.json
@@ -0,0 +1,24 @@
+{
+  "hotspot_index_ttft_p90": 3.673957447190547,
+  "per_worker_latency_p90_s": {
+    "http://127.0.0.1:8000": 21.5702620673168,
+    "http://127.0.0.1:8001": 21.44246501957532,
+    "http://127.0.0.1:8002": 7.497513776784784,
+    "http://127.0.0.1:8003": 18.975387462502113,
+    "http://127.0.0.1:8004": 27.733961877820548,
+    "http://127.0.0.1:8005": 14.178356938017535,
+    "http://127.0.0.1:8006": 25.44877168269595,
+    "http://127.0.0.1:8007": 54.500166546402035
+  },
+  "per_worker_ttft_p90_s": {
+    "http://127.0.0.1:8000": 7.380765471985799,
+    "http://127.0.0.1:8001": 14.109222683508415,
+    "http://127.0.0.1:8002": 3.001173847797329,
+    "http://127.0.0.1:8003": 14.087287129514152,
+    "http://127.0.0.1:8004": 14.151121024426537,
+    "http://127.0.0.1:8005": 6.165523712011057,
+    "http://127.0.0.1:8006": 6.314287615299688,
+    "http://127.0.0.1:8007": 39.43635586597957
+  },
+  "status": "supported"
+}
--- a/analysis/characterization/elastic_migration_v2/figures/fig_connector_substrate_attribution.png
+++ b/analysis/characterization/elastic_migration_v2/figures/fig_connector_substrate_attribution.png
--- a/analysis/characterization/elastic_migration_v2/figures/fig_kv_both_overhead.png
+++ b/analysis/characterization/elastic_migration_v2/figures/fig_kv_both_overhead.png
--- a/analysis/characterization/elastic_migration_v2/figures/fig_three_way_hotspot.png
+++ b/analysis/characterization/elastic_migration_v2/figures/fig_three_way_hotspot.png
--- a/analysis/characterization/elastic_migration_v2/render_figures.py
+++ b/analysis/characterization/elastic_migration_v2/render_figures.py
@@ -34,37 +34,39 @@ def _load(name: str):


 POLICY_COLORS = {
-    "unified":            "#2ca02c",
-    "unified_kv_both":    "#9467bd",
-    "unified_v2":         "#d62728",
-    "unified_v2_strict":  "#ff7f0e",
+    "unified":             "#2ca02c",
+    "unified_kv_both":     "#9467bd",
+    "unified_nixl_both":   "#1f77b4",
+    "unified_v2":          "#d62728",
+    "unified_v2_strict":   "#ff7f0e",
 }


 def fig_kv_both_overhead():
    comp = _load("b3_policy_comparison.json")
    by = {r["policy"]: r for r in comp["rows"]}
-    pols = ["unified", "unified_kv_both", "unified_v2"]
+    pols = ["unified", "unified_kv_both", "unified_nixl_both", "unified_v2"]
    metrics = [
        ("TTFT p90 (s)",   lambda r: r["ttft_p90_s"]),
        ("TPOT p90 (ms)",  lambda r: r["tpot_p90_s"] * 1000),
        ("E2E p90 (s)",    lambda r: r["e2e_p90_s"]),
        ("hotspot index",  lambda r: r["hotspot_index_ttft_p90"]),
    ]
-    fig, axes = plt.subplots(1, 4, figsize=(14, 4))
+    fig, axes = plt.subplots(1, 4, figsize=(15, 4.2))
    for ax, (label, fn) in zip(axes, metrics):
        vals = [fn(by[p]) for p in pols]
-        bars = ax.bar(pols, vals,
+        labels_short = [p.replace("unified_", "") for p in pols]
+        labels_short[0] = "plain"
+        bars = ax.bar(labels_short, vals,
                       color=[POLICY_COLORS[p] for p in pols],
                       edgecolor="black", linewidth=0.5)
        ax.set_title(label)
-        ax.tick_params(axis="x", rotation=20, labelsize=9)
+        ax.tick_params(axis="x", rotation=15, labelsize=9)
        for b, v in zip(bars, vals):
            ax.text(b.get_x() + b.get_width() / 2, v,
                     f"{v:.2f}" if v < 100 else f"{v:.0f}",
                     ha="center", va="bottom", fontsize=9)
        ax.grid(alpha=0.3, axis="y")
-        # delta annotation
        baseline = vals[0]
        for i, v in enumerate(vals):
            if i == 0:
@@ -74,8 +76,8 @@ def fig_kv_both_overhead():
                     fontsize=10, fontweight="bold",
                     color="darkred" if pct > 0 else "darkgreen")
    fig.suptitle(
-        "kv_both adds ~45% to TTFT p90 even without PD-sep firing.\n"
-        "v2's PD-sep barely recovers the gap (and overshoots TTFT p99)."
+        "Mooncake substrate adds 19-45% across metrics; NIXL is 5-19pp better but\n"
+        "still 16-38% above plain. v2's 5 PD-sep events don't recover the substrate tax."
    )
    fig.tight_layout()
    fig.savefig(OUT / "fig_kv_both_overhead.png", dpi=120)
@@ -203,27 +205,29 @@ def fig_v2_predicted_vs_actual():


 def fig_three_way_hotspot():
-    pols = ["unified", "unified_kv_both", "unified_v2"]
+    pols = ["unified", "unified_kv_both", "unified_nixl_both", "unified_v2"]
    per_worker = {p: _load(f"per_worker_{p}.json") for p in pols}
    workers = sorted(per_worker["unified"]["per_worker_ttft_p90_s"].keys())

    x = range(len(workers))
-    width = 0.27
-    fig, ax = plt.subplots(figsize=(11, 5))
+    n = len(pols)
+    width = 0.85 / n
+    fig, ax = plt.subplots(figsize=(12, 5))
    for i, p in enumerate(pols):
        d = per_worker[p]["per_worker_ttft_p90_s"]
        vals = [d[w] for w in workers]
-        offset = (i - 1) * width
+        offset = (i - (n - 1) / 2) * width
+        label = p.replace("unified_", "") if p != "unified" else "plain"
        ax.bar([j + offset for j in x], vals, width,
-                label=f"{p} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
+                label=f"{label} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
                color=POLICY_COLORS[p], edgecolor="black", linewidth=0.4)
    short = [w.replace("http://127.0.0.1:", ":") for w in workers]
    ax.set_xticks(list(x))
    ax.set_xticklabels(short, rotation=0, fontsize=9)
    ax.set_ylabel("worker TTFT p90 (s)")
    ax.set_title(
-        "Per-worker TTFT p90 distribution. kv_both alone makes the hot worker hotter\n"
-        "(unified→kv_both: 37.7s→43.5s peak); v2's 5 PD-sep triggers nudge it back."
+        "Per-worker TTFT p90 distribution across substrates. Mooncake (kv_both)\n"
+        "amplifies the hot worker (hotspot 4.36); NIXL keeps it close to plain (3.67)."
    )
    ax.legend(loc="upper left", fontsize=9)
    ax.grid(alpha=0.3, axis="y")
@@ -232,12 +236,64 @@ def fig_three_way_hotspot():
    plt.close(fig)


+def fig_connector_substrate_attribution():
+    """Decomposes overhead into v1-framework cost (shared by all connectors,
+    proxied by NIXL since it's the leanest) and Mooncake-specific cost."""
+    comp = _load("b3_policy_comparison.json")
+    by = {r["policy"]: r for r in comp["rows"]}
+    metrics = [
+        ("TTFT p90 (s)",  "ttft_p90_s",  False),
+        ("TPOT p90 (ms)", "tpot_p90_s",  True),
+        ("E2E p90 (s)",   "e2e_p90_s",   False),
+        ("hotspot index", "hotspot_index_ttft_p90", False),
+    ]
+    fig, axes = plt.subplots(1, 4, figsize=(15, 4))
+    for ax, (label, key, scale_ms) in zip(axes, metrics):
+        plain = by["unified"][key] * (1000 if scale_ms else 1)
+        nixl = by["unified_nixl_both"][key] * (1000 if scale_ms else 1)
+        moon = by["unified_kv_both"][key] * (1000 if scale_ms else 1)
+        v2 = by["unified_v2"][key] * (1000 if scale_ms else 1)
+
+        framework_cost = nixl - plain   # what NIXL adds = v1 framework cost
+        mooncake_extra = moon - nixl    # extra on top from Mooncake
+        v2_branch_extra = v2 - moon     # extra from PD-sep branch (Mooncake + 5 events)
+
+        bottom = 0
+        ax.bar(["overhead"], [plain], color="#cccccc",
+                edgecolor="black", linewidth=0.4,
+                label=f"plain unified ({plain:.2f})")
+        bottom += plain
+        ax.bar(["overhead"], [framework_cost], bottom=[bottom],
+                color="#1f77b4", edgecolor="black", linewidth=0.4,
+                label=f"v1 framework (+{framework_cost:.2f})")
+        bottom += framework_cost
+        ax.bar(["overhead"], [mooncake_extra], bottom=[bottom],
+                color="#9467bd", edgecolor="black", linewidth=0.4,
+                label=f"Mooncake extra (+{mooncake_extra:.2f})")
+        bottom += mooncake_extra
+        ax.bar(["overhead"], [v2_branch_extra], bottom=[bottom],
+                color="#d62728", edgecolor="black", linewidth=0.4,
+                label=f"v2 PD-sep branch ({v2_branch_extra:+.2f})")
+        ax.set_title(label)
+        ax.legend(fontsize=8, loc="upper right")
+        ax.grid(alpha=0.3, axis="y")
+        ax.tick_params(axis="x", labelbottom=False)
+    fig.suptitle(
+        "Attribution: plain unified vs NIXL substrate vs Mooncake substrate vs v2.\n"
+        "Blue: cost shared by any v1 connector. Purple: cost specific to Mooncake."
+    )
+    fig.tight_layout()
+    fig.savefig(OUT / "fig_connector_substrate_attribution.png", dpi=120)
+    plt.close(fig)
+
+
 def main():
    fig_kv_both_overhead()
    fig_v2_trigger_funnel()
    fig_v2_predicted_vs_actual()
    fig_three_way_hotspot()
-    print(f"wrote 4 figures to {OUT}")
+    fig_connector_substrate_attribution()
+    print(f"wrote 5 figures to {OUT}")


 if __name__ == "__main__":