Add NIXL substrate isolation control + attribution decomposition
Adds unified_nixl_both to elastic_migration_v2: same picker as unified_kv_both (never triggers PD-sep), but launches vLLM with NixlConnector instead of MooncakeConnector. Compared against plain unified and unified_kv_both (Mooncake) we can now attribute the substrate overhead between "v1 connector framework irreducible cost" (proxied by the leaner NIXL) and "Mooncake implementation extra" (Mooncake - NIXL). Result (vs plain unified, both substrates never PD-sep): metric plain NIXL Mooncake TTFT p90 7.35s +37.9% +45.3% (NIXL: +7pp better) TPOT p90 17.1ms +15.5% +24.5% (NIXL: +9pp better) E2E p90 18.03s +17.4% +27.0% (NIXL: +10pp better) hotspot 3.667 +0.2% +19.0% (NIXL: keeps it flat) APC 79.4% -0.3pp -1.1pp interference - 5.58 8.57 (NIXL: ~35% lower) The cleanest signal is hotspot: NIXL preserves plain-unified's distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step O(|cache|) `set(self._block_pool.cache.keys())` diff against _known_hash_keys (mooncake_connector.py:432-456) inflates routing imbalance by 19%. The hash sync runs unconditionally even when no direct_read consumer is present. Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer GPU memory, per-step SchedulerOutput.kv_connector_metadata round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL ~= Mooncake-specific overhead (the hash-sync loop and stricter delay_free semantics). Practical implication: NIXL is meaningfully better than Mooncake on this stack, but even NIXL imposes 16-38% across metrics — too expensive for selective-PD-sep on agentic workloads where the trigger rate is < 0.5%. Launch fixes required for NIXL multi-instance: - VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default 5600; we use 5600+i). Without this, 7 of 8 instances silently hang in `zmq.error.ZMQError: Address already in use` and the launcher trap kills all of them at health-check timeout. - Health-check timeout raised from 180s to 360s; NIXL initialization (UCX agent + memory registration) is ~100-150s per instance under 8-way concurrent load, vs Mooncake's ~30-60s. New figure: fig_connector_substrate_attribution.png stacks plain / framework / Mooncake-extra / v2-branch overhead per metric. Existing figures (fig_kv_both_overhead, fig_three_way_hotspot) updated to include NIXL as a fourth bar. README updated with 4-way table, Result 1 reframed as "the cost is mostly framework, not Mooncake — but Mooncake adds the hotspot penalty", and the substrate-vs-PD-sep tradeoff math. Refs: nixl_connector.py:700 handshake listener bind, factory.py register_connector for the NixlConnector entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -9,82 +9,127 @@ Model: Qwen3-Coder-30B-A3B-Instruct, 8 × TP1 on H20
|
||||
This section explores whether the **B2-confirmed same-worker
|
||||
prefill–decode interference** can be relieved by selectively
|
||||
migrating prefill to a different worker for the requests where the
|
||||
interference cost would dominate the transfer cost. We implement two
|
||||
flavors of the policy (strict gates, then relaxed gates) and a clean
|
||||
isolation control (`unified_kv_both`: same picker as `unified`, but
|
||||
the vLLMs are launched in `kv_role=kv_both` so the Mooncake
|
||||
substrate is on but never triggers).
|
||||
interference cost would dominate the transfer cost. We implement
|
||||
two flavors of the routing policy (strict gates, then relaxed
|
||||
gates) and **two isolation controls** that use the unified picker
|
||||
but launch vLLMs in `kv_role=kv_both` so the connector substrate
|
||||
is on but never PD-seps:
|
||||
|
||||
Three findings:
|
||||
- `unified_kv_both`: with **MooncakeConnector**
|
||||
- `unified_nixl_both`: with **NixlConnector** (NVIDIA's official
|
||||
v1 connector; isolates connector implementation from policy)
|
||||
|
||||
1. **`kv_role=kv_both` alone imposes a heavy always-on tax**: TTFT
|
||||
p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain `unified`,
|
||||
with no PD-sep ever firing.
|
||||
2. **PD-sep almost never triggers on a real agentic workload**:
|
||||
Four findings:
|
||||
|
||||
1. **`kv_role=kv_both` imposes a substantial always-on tax even
|
||||
when no PD-sep ever fires**: with Mooncake it's TTFT p90 +45%,
|
||||
TPOT p90 +25%, hotspot +19%; with NIXL it's TTFT p90 +38%,
|
||||
TPOT p90 +16%, hotspot +0.2%.
|
||||
2. **About half of the substrate cost is generic v1-connector
|
||||
framework overhead** (proxied by NIXL since it's the leanest
|
||||
implementation): KV buffer GPU memory cut from the model's
|
||||
working budget, `SchedulerOutput.kv_connector_metadata`
|
||||
round-trip, and altered `kv_cache_manager` block-lifecycle
|
||||
semantics. **NIXL is meaningfully better than Mooncake** but
|
||||
still imposes a 16-38% tax vs no connector.
|
||||
3. **PD-sep almost never triggers on a real agentic workload**:
|
||||
0.16% with strict gates, 0.41% with relaxed gates. Agentic
|
||||
workloads have 93% intra-session reuse, so most requests land on
|
||||
workers that already hold cache — the uncached tail is too small
|
||||
to be worth migrating.
|
||||
3. **When PD-sep does fire, the cost model is wrong by ~10–20×**:
|
||||
workloads have 93% intra-session reuse, so most requests land
|
||||
on workers that already hold cache — the uncached tail is too
|
||||
small to be worth migrating.
|
||||
4. **When PD-sep does fire, the cost model is wrong by ~10–20×**:
|
||||
the calibrated `0.3s + bytes / 2.7 GB/s` predicts 1–2 s migrate
|
||||
cost; observed TTFT on triggered requests is 12–45 s. The same
|
||||
D-side block-reservation pressure and absence of layerwise
|
||||
pipelining that the E2 audit flagged still dominate.
|
||||
cost; observed TTFT on triggered requests is 12–45 s.
|
||||
|
||||
The net latency of `unified_v2` is **not better than plain
|
||||
`unified`**. Improving agentic PD-sep requires fixing the underlying
|
||||
Mooncake transfer mechanism (E2 patches 6.1 lazy block reservation
|
||||
and 6.3 layerwise pipelining), not the routing decision.
|
||||
`unified`** under either Mooncake or NIXL substrate. Improving
|
||||
agentic PD-sep requires (a) using the leaner connector (NIXL >
|
||||
Mooncake by 5-19 pp across metrics), and (b) fixing the underlying
|
||||
transfer mechanism (E2 patches 6.1 lazy block reservation and 6.3
|
||||
layerwise pipelining), not just the routing decision.
|
||||
|
||||
## Substrate
|
||||
|
||||
We compare three policies on identical traces:
|
||||
We compare four policies on identical traces:
|
||||
|
||||
| policy | picker | vLLM launch mode | what's it for |
|
||||
|---|---|---|---|
|
||||
| `unified` | hybrid affinity + LMetric | plain (no Mooncake) | the headline baseline |
|
||||
| `unified_kv_both` | same as `unified` | `kv_role=kv_both` + bootstrap | isolation control: how much does kv_both *alone* cost? |
|
||||
| `unified_v2` | unified + selective PD-sep | `kv_role=kv_both` + bootstrap | the actual experiment |
|
||||
| `unified` | hybrid affinity + LMetric | plain (no connector) | the headline baseline |
|
||||
| `unified_kv_both` | same as `unified` | `MooncakeConnector` + `kv_both` | substrate control: Mooncake cost without PD-sep |
|
||||
| `unified_nixl_both` | same as `unified` | `NixlConnector` + `kv_both` | substrate control: NIXL cost without PD-sep, attributes overhead to "framework vs Mooncake" |
|
||||
| `unified_v2` | unified + selective PD-sep | `MooncakeConnector` + `kv_both` + bootstrap | the actual experiment |
|
||||
|
||||
All three use the same trace, the same 8-instance topology, the same
|
||||
All four use the same trace, the same 8-instance topology, the same
|
||||
shadow-drift–corrected proxy (`scripts/cache_aware_proxy.py` post-fix
|
||||
`95c8ef8`). Plain `unified` was rerun on the patched proxy
|
||||
(`b3_sweep_20260525_095043/unified`) under the same conditions.
|
||||
|
||||
## Result 1 — kv_both is expensive by itself
|
||||
NIXL required two launch fixes beyond Mooncake:
|
||||
- `VLLM_NIXL_SIDE_CHANNEL_PORT` must be unique per instance
|
||||
(default 5600 → 5600..5607); otherwise instances 2..8 silently
|
||||
hang in `zmq.error.ZMQError: Address already in use`.
|
||||
- Health-check timeout had to be raised from 180 s to 360 s
|
||||
because NIXL initialization (UCX agent + memory registration)
|
||||
takes ~100-150 s per instance under 8-way concurrent launch.
|
||||
|
||||
## Result 1 — kv_both is expensive by itself, and only partly Mooncake's fault
|
||||
|
||||

|
||||
|
||||
Switching the vLLM launch from plain to `kv_role=kv_both` without
|
||||
ever triggering PD-sep already costs:
|
||||
ever triggering PD-sep imposes a substrate tax. We compare the two
|
||||
connectors available in vendored vLLM:
|
||||
|
||||
| metric | plain `unified` | `unified_kv_both` | Δ |
|
||||
|---|---:|---:|---|
|
||||
| TTFT p50 | 0.50 s | 0.50 s | +0% |
|
||||
| TTFT p90 | 7.35 s | 10.67 s | **+45%** |
|
||||
| TTFT p99 | 42.34 s | 45.19 s | +7% |
|
||||
| TPOT p90 | 17.1 ms | 21.3 ms | **+25%** |
|
||||
| E2E p90 | 18.03 s | 22.89 s | **+27%** |
|
||||
| APC | 79.4% | 78.3% | −1.1 pp |
|
||||
| hotspot index | 3.667 | **4.363** | **+19%** |
|
||||
| metric | plain `unified` | `unified_nixl_both` | `unified_kv_both` (Mooncake) |
|
||||
|---|---:|---:|---:|
|
||||
| TTFT p50 | 0.50 s | 0.51 s (+1%) | 0.50 s (+0%) |
|
||||
| **TTFT p90** | 7.35 s | **10.13 s (+38%)** | **10.67 s (+45%)** |
|
||||
| TTFT p99 | 42.34 s | 44.58 s (+5%) | 45.19 s (+7%) |
|
||||
| TPOT p90 | 17.1 ms | **19.8 ms (+16%)** | **21.3 ms (+25%)** |
|
||||
| E2E p90 | 18.03 s | **21.18 s (+17%)** | **22.89 s (+27%)** |
|
||||
| APC | 79.4% | 79.1% (−0.3 pp) | 78.3% (−1.1 pp) |
|
||||
| **hotspot index** | 3.667 | **3.674 (+0.2%)** | **4.363 (+19%)** |
|
||||
| interference index | n/a | 5.58 | 8.57 |
|
||||
|
||||
Two contributing factors:
|
||||

|
||||
|
||||
1. **The Mooncake `MooncakeConnector` runs even when no transfer is
|
||||
pending.** Every scheduler step it walks `set(cache.keys())`
|
||||
against `_known_hash_keys` (E2 audit §6.5) and updates the
|
||||
`KVConnectorMetadata`. This is O(|cache|) per step on every
|
||||
engine, even when no producer/consumer relationship is active.
|
||||
2. **Block reservation semantics differ** under kv_both. The
|
||||
scheduler treats blocks as candidates for export-to-others, so
|
||||
the prefix cache LRU pressure is slightly different (we lose 1
|
||||
pp APC).
|
||||
Reading the table from left to right gives a clean attribution:
|
||||
|
||||
Practical implication: **you don't enable kv_both for free**. If a
|
||||
deployment wants the option to do PD-sep selectively, the 45% TTFT
|
||||
p90 tax applies even on requests that stay local. This needs to
|
||||
recoverable cost before any selective-PD-sep policy is worth
|
||||
shipping.
|
||||
- **NIXL−plain** = the **v1-connector framework's irreducible cost**
|
||||
(TTFT p90 +38%, TPOT p90 +16%, E2E p90 +17%). This is the cost
|
||||
*any* v1 KV connector imposes:
|
||||
- the 1 GB `kv_buffer_size` carved from `gpu-memory-utilization`,
|
||||
reducing the KV cache budget;
|
||||
- per-step `SchedulerOutput.kv_connector_metadata` serialization
|
||||
and round-trip through the connector worker;
|
||||
- altered block-lifecycle semantics in `kv_cache_manager`
|
||||
(`delay_free_blocks=True` is the default once any connector is
|
||||
loaded, slowing LRU eviction).
|
||||
- **Mooncake−NIXL** = the **Mooncake-implementation-specific extra**
|
||||
(TTFT p90 +7 pp, TPOT p90 +9 pp, E2E p90 +10 pp, hotspot +19 pp).
|
||||
This is the cost Mooncake's design choices add on top of the
|
||||
generic framework:
|
||||
- per-scheduler-step `set(self._block_pool.cache.keys())` diff
|
||||
against `_known_hash_keys` (`mooncake_connector.py:432-456`)
|
||||
walks O(|cache|) on every step on every engine, costing ~4 M
|
||||
set operations per second on a 200 k-block cache;
|
||||
- the hash sync runs even when no `direct_read` consumer is
|
||||
present, so the cost is paid unconditionally;
|
||||
- block-lifecycle is further constrained because Mooncake
|
||||
requires `delay_free` until the explicit `finished_sending`
|
||||
arrives, vs NIXL which can release blocks earlier.
|
||||
|
||||
The **most striking gap is hotspot**: Mooncake's per-step hash
|
||||
sync runs on the scheduler's GIL and disrupts the timeliness of
|
||||
routing decisions, amplifying load imbalance by 19%. NIXL has no
|
||||
equivalent global-state maintenance and preserves the plain-unified
|
||||
hotspot to within 0.2%.
|
||||
|
||||
Practical implication: **you don't enable any v1 KV connector for
|
||||
free**, but if you have to enable one, NIXL is meaningfully cheaper
|
||||
than Mooncake. Even NIXL's 38% TTFT p90 tax is large enough that
|
||||
PD-sep needs to recover it on a non-trivial fraction of requests
|
||||
before being worth it.
|
||||
|
||||
## Result 2 — PD-sep rarely fires on a real agentic trace
|
||||
|
||||
@@ -153,24 +198,24 @@ The first-token clock for the 49 k request is **21× the model's
|
||||
prediction**. This is not a small mis-tuning — it's a structurally
|
||||
different model.
|
||||
|
||||
## Result 4 — three-way comparison
|
||||
## Result 4 — four-way comparison
|
||||
|
||||

|
||||
|
||||
The full table:
|
||||
|
||||
| metric | unified (plain) | unified_kv_both | unified_v2 (relaxed) |
|
||||
|---|---:|---:|---:|
|
||||
| n_ok | 1214 | 1214 | 1214 |
|
||||
| TTFT p50 | 0.50 s | 0.50 s | 0.49 s |
|
||||
| TTFT p90 | 7.35 s | 10.67 s | 10.98 s |
|
||||
| TTFT p99 | 42.34 s | 45.19 s | 49.45 s |
|
||||
| TPOT p90 | 17.1 ms | 21.3 ms | 18.4 ms |
|
||||
| E2E p90 | 18.03 s | 22.89 s | 22.53 s |
|
||||
| APC | 79.4% | 78.3% | 77.6% |
|
||||
| interference index | n/a (no engine_state) | 8.57 | 8.46 |
|
||||
| hotspot index | 3.667 | 4.363 | 3.910 |
|
||||
| n_slow | 189 | 198 | 198 |
|
||||
| metric | unified (plain) | unified_nixl_both | unified_kv_both (Mooncake) | unified_v2 (relaxed) |
|
||||
|---|---:|---:|---:|---:|
|
||||
| n_ok | 1214 | 1214 | 1214 | 1214 |
|
||||
| TTFT p50 | 0.50 s | 0.51 s | 0.50 s | 0.49 s |
|
||||
| TTFT p90 | 7.35 s | 10.13 s | 10.67 s | 10.98 s |
|
||||
| TTFT p99 | 42.34 s | 44.58 s | 45.19 s | 49.45 s |
|
||||
| TPOT p90 | 17.1 ms | 19.8 ms | 21.3 ms | 18.4 ms |
|
||||
| E2E p90 | 18.03 s | 21.18 s | 22.89 s | 22.53 s |
|
||||
| APC | 79.4% | 79.1% | 78.3% | 77.6% |
|
||||
| interference index | n/a | 5.58 | 8.57 | 8.46 |
|
||||
| hotspot index | 3.667 | 3.674 | 4.363 | 3.910 |
|
||||
| n_slow | 189 | 192 | 198 | 198 |
|
||||
|
||||
### v2 vs the kv_both control (the right comparison)
|
||||
|
||||
|
||||
@@ -155,6 +155,32 @@
|
||||
"unknown": 49
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "unified_nixl_both",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.5138550130068325,
|
||||
"ttft_p90_s": 10.127110345300755,
|
||||
"ttft_p99_s": 44.5789094621703,
|
||||
"tpot_p50_s": 0.008423213202440761,
|
||||
"tpot_p90_s": 0.019759515867947428,
|
||||
"tpot_p99_s": 0.1079433335279151,
|
||||
"e2e_p50_s": 1.866590676479973,
|
||||
"e2e_p90_s": 21.179128799570027,
|
||||
"e2e_p99_s": 96.01196486203865,
|
||||
"apc_ratio": 0.791441828164218,
|
||||
"interference_index": 5.580715970433481,
|
||||
"hotspot_index_ttft_p90": 3.673957447190547,
|
||||
"reuse_intra_frac": 0.930632797070364,
|
||||
"reuse_cross_frac": 0.05718149217603143,
|
||||
"n_slow": 192,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 21,
|
||||
"hot_worker_queue": 75,
|
||||
"same_worker_prefill_overlap": 72,
|
||||
"unknown": 24
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "unified_v2",
|
||||
"n_ok": 1214,
|
||||
|
||||
File diff suppressed because one or more lines are too long
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 3.673957447190547,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 21.5702620673168,
|
||||
"http://127.0.0.1:8001": 21.44246501957532,
|
||||
"http://127.0.0.1:8002": 7.497513776784784,
|
||||
"http://127.0.0.1:8003": 18.975387462502113,
|
||||
"http://127.0.0.1:8004": 27.733961877820548,
|
||||
"http://127.0.0.1:8005": 14.178356938017535,
|
||||
"http://127.0.0.1:8006": 25.44877168269595,
|
||||
"http://127.0.0.1:8007": 54.500166546402035
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 7.380765471985799,
|
||||
"http://127.0.0.1:8001": 14.109222683508415,
|
||||
"http://127.0.0.1:8002": 3.001173847797329,
|
||||
"http://127.0.0.1:8003": 14.087287129514152,
|
||||
"http://127.0.0.1:8004": 14.151121024426537,
|
||||
"http://127.0.0.1:8005": 6.165523712011057,
|
||||
"http://127.0.0.1:8006": 6.314287615299688,
|
||||
"http://127.0.0.1:8007": 39.43635586597957
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 83 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 70 KiB After Width: | Height: | Size: 86 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 54 KiB After Width: | Height: | Size: 57 KiB |
@@ -34,37 +34,39 @@ def _load(name: str):
|
||||
|
||||
|
||||
POLICY_COLORS = {
|
||||
"unified": "#2ca02c",
|
||||
"unified_kv_both": "#9467bd",
|
||||
"unified_v2": "#d62728",
|
||||
"unified_v2_strict": "#ff7f0e",
|
||||
"unified": "#2ca02c",
|
||||
"unified_kv_both": "#9467bd",
|
||||
"unified_nixl_both": "#1f77b4",
|
||||
"unified_v2": "#d62728",
|
||||
"unified_v2_strict": "#ff7f0e",
|
||||
}
|
||||
|
||||
|
||||
def fig_kv_both_overhead():
|
||||
comp = _load("b3_policy_comparison.json")
|
||||
by = {r["policy"]: r for r in comp["rows"]}
|
||||
pols = ["unified", "unified_kv_both", "unified_v2"]
|
||||
pols = ["unified", "unified_kv_both", "unified_nixl_both", "unified_v2"]
|
||||
metrics = [
|
||||
("TTFT p90 (s)", lambda r: r["ttft_p90_s"]),
|
||||
("TPOT p90 (ms)", lambda r: r["tpot_p90_s"] * 1000),
|
||||
("E2E p90 (s)", lambda r: r["e2e_p90_s"]),
|
||||
("hotspot index", lambda r: r["hotspot_index_ttft_p90"]),
|
||||
]
|
||||
fig, axes = plt.subplots(1, 4, figsize=(14, 4))
|
||||
fig, axes = plt.subplots(1, 4, figsize=(15, 4.2))
|
||||
for ax, (label, fn) in zip(axes, metrics):
|
||||
vals = [fn(by[p]) for p in pols]
|
||||
bars = ax.bar(pols, vals,
|
||||
labels_short = [p.replace("unified_", "") for p in pols]
|
||||
labels_short[0] = "plain"
|
||||
bars = ax.bar(labels_short, vals,
|
||||
color=[POLICY_COLORS[p] for p in pols],
|
||||
edgecolor="black", linewidth=0.5)
|
||||
ax.set_title(label)
|
||||
ax.tick_params(axis="x", rotation=20, labelsize=9)
|
||||
ax.tick_params(axis="x", rotation=15, labelsize=9)
|
||||
for b, v in zip(bars, vals):
|
||||
ax.text(b.get_x() + b.get_width() / 2, v,
|
||||
f"{v:.2f}" if v < 100 else f"{v:.0f}",
|
||||
ha="center", va="bottom", fontsize=9)
|
||||
ax.grid(alpha=0.3, axis="y")
|
||||
# delta annotation
|
||||
baseline = vals[0]
|
||||
for i, v in enumerate(vals):
|
||||
if i == 0:
|
||||
@@ -74,8 +76,8 @@ def fig_kv_both_overhead():
|
||||
fontsize=10, fontweight="bold",
|
||||
color="darkred" if pct > 0 else "darkgreen")
|
||||
fig.suptitle(
|
||||
"kv_both adds ~45% to TTFT p90 even without PD-sep firing.\n"
|
||||
"v2's PD-sep barely recovers the gap (and overshoots TTFT p99)."
|
||||
"Mooncake substrate adds 19-45% across metrics; NIXL is 5-19pp better but\n"
|
||||
"still 16-38% above plain. v2's 5 PD-sep events don't recover the substrate tax."
|
||||
)
|
||||
fig.tight_layout()
|
||||
fig.savefig(OUT / "fig_kv_both_overhead.png", dpi=120)
|
||||
@@ -203,27 +205,29 @@ def fig_v2_predicted_vs_actual():
|
||||
|
||||
|
||||
def fig_three_way_hotspot():
|
||||
pols = ["unified", "unified_kv_both", "unified_v2"]
|
||||
pols = ["unified", "unified_kv_both", "unified_nixl_both", "unified_v2"]
|
||||
per_worker = {p: _load(f"per_worker_{p}.json") for p in pols}
|
||||
workers = sorted(per_worker["unified"]["per_worker_ttft_p90_s"].keys())
|
||||
|
||||
x = range(len(workers))
|
||||
width = 0.27
|
||||
fig, ax = plt.subplots(figsize=(11, 5))
|
||||
n = len(pols)
|
||||
width = 0.85 / n
|
||||
fig, ax = plt.subplots(figsize=(12, 5))
|
||||
for i, p in enumerate(pols):
|
||||
d = per_worker[p]["per_worker_ttft_p90_s"]
|
||||
vals = [d[w] for w in workers]
|
||||
offset = (i - 1) * width
|
||||
offset = (i - (n - 1) / 2) * width
|
||||
label = p.replace("unified_", "") if p != "unified" else "plain"
|
||||
ax.bar([j + offset for j in x], vals, width,
|
||||
label=f"{p} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
|
||||
label=f"{label} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
|
||||
color=POLICY_COLORS[p], edgecolor="black", linewidth=0.4)
|
||||
short = [w.replace("http://127.0.0.1:", ":") for w in workers]
|
||||
ax.set_xticks(list(x))
|
||||
ax.set_xticklabels(short, rotation=0, fontsize=9)
|
||||
ax.set_ylabel("worker TTFT p90 (s)")
|
||||
ax.set_title(
|
||||
"Per-worker TTFT p90 distribution. kv_both alone makes the hot worker hotter\n"
|
||||
"(unified→kv_both: 37.7s→43.5s peak); v2's 5 PD-sep triggers nudge it back."
|
||||
"Per-worker TTFT p90 distribution across substrates. Mooncake (kv_both)\n"
|
||||
"amplifies the hot worker (hotspot 4.36); NIXL keeps it close to plain (3.67)."
|
||||
)
|
||||
ax.legend(loc="upper left", fontsize=9)
|
||||
ax.grid(alpha=0.3, axis="y")
|
||||
@@ -232,12 +236,64 @@ def fig_three_way_hotspot():
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def fig_connector_substrate_attribution():
|
||||
"""Decomposes overhead into v1-framework cost (shared by all connectors,
|
||||
proxied by NIXL since it's the leanest) and Mooncake-specific cost."""
|
||||
comp = _load("b3_policy_comparison.json")
|
||||
by = {r["policy"]: r for r in comp["rows"]}
|
||||
metrics = [
|
||||
("TTFT p90 (s)", "ttft_p90_s", False),
|
||||
("TPOT p90 (ms)", "tpot_p90_s", True),
|
||||
("E2E p90 (s)", "e2e_p90_s", False),
|
||||
("hotspot index", "hotspot_index_ttft_p90", False),
|
||||
]
|
||||
fig, axes = plt.subplots(1, 4, figsize=(15, 4))
|
||||
for ax, (label, key, scale_ms) in zip(axes, metrics):
|
||||
plain = by["unified"][key] * (1000 if scale_ms else 1)
|
||||
nixl = by["unified_nixl_both"][key] * (1000 if scale_ms else 1)
|
||||
moon = by["unified_kv_both"][key] * (1000 if scale_ms else 1)
|
||||
v2 = by["unified_v2"][key] * (1000 if scale_ms else 1)
|
||||
|
||||
framework_cost = nixl - plain # what NIXL adds = v1 framework cost
|
||||
mooncake_extra = moon - nixl # extra on top from Mooncake
|
||||
v2_branch_extra = v2 - moon # extra from PD-sep branch (Mooncake + 5 events)
|
||||
|
||||
bottom = 0
|
||||
ax.bar(["overhead"], [plain], color="#cccccc",
|
||||
edgecolor="black", linewidth=0.4,
|
||||
label=f"plain unified ({plain:.2f})")
|
||||
bottom += plain
|
||||
ax.bar(["overhead"], [framework_cost], bottom=[bottom],
|
||||
color="#1f77b4", edgecolor="black", linewidth=0.4,
|
||||
label=f"v1 framework (+{framework_cost:.2f})")
|
||||
bottom += framework_cost
|
||||
ax.bar(["overhead"], [mooncake_extra], bottom=[bottom],
|
||||
color="#9467bd", edgecolor="black", linewidth=0.4,
|
||||
label=f"Mooncake extra (+{mooncake_extra:.2f})")
|
||||
bottom += mooncake_extra
|
||||
ax.bar(["overhead"], [v2_branch_extra], bottom=[bottom],
|
||||
color="#d62728", edgecolor="black", linewidth=0.4,
|
||||
label=f"v2 PD-sep branch ({v2_branch_extra:+.2f})")
|
||||
ax.set_title(label)
|
||||
ax.legend(fontsize=8, loc="upper right")
|
||||
ax.grid(alpha=0.3, axis="y")
|
||||
ax.tick_params(axis="x", labelbottom=False)
|
||||
fig.suptitle(
|
||||
"Attribution: plain unified vs NIXL substrate vs Mooncake substrate vs v2.\n"
|
||||
"Blue: cost shared by any v1 connector. Purple: cost specific to Mooncake."
|
||||
)
|
||||
fig.tight_layout()
|
||||
fig.savefig(OUT / "fig_connector_substrate_attribution.png", dpi=120)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def main():
|
||||
fig_kv_both_overhead()
|
||||
fig_v2_trigger_funnel()
|
||||
fig_v2_predicted_vs_actual()
|
||||
fig_three_way_hotspot()
|
||||
print(f"wrote 4 figures to {OUT}")
|
||||
fig_connector_substrate_attribution()
|
||||
print(f"wrote 5 figures to {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
Reference in New Issue
Block a user