Add NIXL substrate isolation control + attribution decomposition
Adds unified_nixl_both to elastic_migration_v2: same picker as unified_kv_both (never triggers PD-sep), but launches vLLM with NixlConnector instead of MooncakeConnector. Compared against plain unified and unified_kv_both (Mooncake) we can now attribute the substrate overhead between "v1 connector framework irreducible cost" (proxied by the leaner NIXL) and "Mooncake implementation extra" (Mooncake - NIXL). Result (vs plain unified, both substrates never PD-sep): metric plain NIXL Mooncake TTFT p90 7.35s +37.9% +45.3% (NIXL: +7pp better) TPOT p90 17.1ms +15.5% +24.5% (NIXL: +9pp better) E2E p90 18.03s +17.4% +27.0% (NIXL: +10pp better) hotspot 3.667 +0.2% +19.0% (NIXL: keeps it flat) APC 79.4% -0.3pp -1.1pp interference - 5.58 8.57 (NIXL: ~35% lower) The cleanest signal is hotspot: NIXL preserves plain-unified's distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step O(|cache|) `set(self._block_pool.cache.keys())` diff against _known_hash_keys (mooncake_connector.py:432-456) inflates routing imbalance by 19%. The hash sync runs unconditionally even when no direct_read consumer is present. Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer GPU memory, per-step SchedulerOutput.kv_connector_metadata round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL ~= Mooncake-specific overhead (the hash-sync loop and stricter delay_free semantics). Practical implication: NIXL is meaningfully better than Mooncake on this stack, but even NIXL imposes 16-38% across metrics — too expensive for selective-PD-sep on agentic workloads where the trigger rate is < 0.5%. Launch fixes required for NIXL multi-instance: - VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default 5600; we use 5600+i). Without this, 7 of 8 instances silently hang in `zmq.error.ZMQError: Address already in use` and the launcher trap kills all of them at health-check timeout. - Health-check timeout raised from 180s to 360s; NIXL initialization (UCX agent + memory registration) is ~100-150s per instance under 8-way concurrent load, vs Mooncake's ~30-60s. New figure: fig_connector_substrate_attribution.png stacks plain / framework / Mooncake-extra / v2-branch overhead per metric. Existing figures (fig_kv_both_overhead, fig_three_way_hotspot) updated to include NIXL as a fourth bar. README updated with 4-way table, Result 1 reframed as "the cost is mostly framework, not Mooncake — but Mooncake adds the hotspot penalty", and the substrate-vs-PD-sep tradeoff math. Refs: nixl_connector.py:700 handshake listener bind, factory.py register_connector for the NixlConnector entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -9,82 +9,127 @@ Model: Qwen3-Coder-30B-A3B-Instruct, 8 × TP1 on H20
|
|||||||
This section explores whether the **B2-confirmed same-worker
|
This section explores whether the **B2-confirmed same-worker
|
||||||
prefill–decode interference** can be relieved by selectively
|
prefill–decode interference** can be relieved by selectively
|
||||||
migrating prefill to a different worker for the requests where the
|
migrating prefill to a different worker for the requests where the
|
||||||
interference cost would dominate the transfer cost. We implement two
|
interference cost would dominate the transfer cost. We implement
|
||||||
flavors of the policy (strict gates, then relaxed gates) and a clean
|
two flavors of the routing policy (strict gates, then relaxed
|
||||||
isolation control (`unified_kv_both`: same picker as `unified`, but
|
gates) and **two isolation controls** that use the unified picker
|
||||||
the vLLMs are launched in `kv_role=kv_both` so the Mooncake
|
but launch vLLMs in `kv_role=kv_both` so the connector substrate
|
||||||
substrate is on but never triggers).
|
is on but never PD-seps:
|
||||||
|
|
||||||
Three findings:
|
- `unified_kv_both`: with **MooncakeConnector**
|
||||||
|
- `unified_nixl_both`: with **NixlConnector** (NVIDIA's official
|
||||||
|
v1 connector; isolates connector implementation from policy)
|
||||||
|
|
||||||
1. **`kv_role=kv_both` alone imposes a heavy always-on tax**: TTFT
|
Four findings:
|
||||||
p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain `unified`,
|
|
||||||
with no PD-sep ever firing.
|
1. **`kv_role=kv_both` imposes a substantial always-on tax even
|
||||||
2. **PD-sep almost never triggers on a real agentic workload**:
|
when no PD-sep ever fires**: with Mooncake it's TTFT p90 +45%,
|
||||||
|
TPOT p90 +25%, hotspot +19%; with NIXL it's TTFT p90 +38%,
|
||||||
|
TPOT p90 +16%, hotspot +0.2%.
|
||||||
|
2. **About half of the substrate cost is generic v1-connector
|
||||||
|
framework overhead** (proxied by NIXL since it's the leanest
|
||||||
|
implementation): KV buffer GPU memory cut from the model's
|
||||||
|
working budget, `SchedulerOutput.kv_connector_metadata`
|
||||||
|
round-trip, and altered `kv_cache_manager` block-lifecycle
|
||||||
|
semantics. **NIXL is meaningfully better than Mooncake** but
|
||||||
|
still imposes a 16-38% tax vs no connector.
|
||||||
|
3. **PD-sep almost never triggers on a real agentic workload**:
|
||||||
0.16% with strict gates, 0.41% with relaxed gates. Agentic
|
0.16% with strict gates, 0.41% with relaxed gates. Agentic
|
||||||
workloads have 93% intra-session reuse, so most requests land on
|
workloads have 93% intra-session reuse, so most requests land
|
||||||
workers that already hold cache — the uncached tail is too small
|
on workers that already hold cache — the uncached tail is too
|
||||||
to be worth migrating.
|
small to be worth migrating.
|
||||||
3. **When PD-sep does fire, the cost model is wrong by ~10–20×**:
|
4. **When PD-sep does fire, the cost model is wrong by ~10–20×**:
|
||||||
the calibrated `0.3s + bytes / 2.7 GB/s` predicts 1–2 s migrate
|
the calibrated `0.3s + bytes / 2.7 GB/s` predicts 1–2 s migrate
|
||||||
cost; observed TTFT on triggered requests is 12–45 s. The same
|
cost; observed TTFT on triggered requests is 12–45 s.
|
||||||
D-side block-reservation pressure and absence of layerwise
|
|
||||||
pipelining that the E2 audit flagged still dominate.
|
|
||||||
|
|
||||||
The net latency of `unified_v2` is **not better than plain
|
The net latency of `unified_v2` is **not better than plain
|
||||||
`unified`**. Improving agentic PD-sep requires fixing the underlying
|
`unified`** under either Mooncake or NIXL substrate. Improving
|
||||||
Mooncake transfer mechanism (E2 patches 6.1 lazy block reservation
|
agentic PD-sep requires (a) using the leaner connector (NIXL >
|
||||||
and 6.3 layerwise pipelining), not the routing decision.
|
Mooncake by 5-19 pp across metrics), and (b) fixing the underlying
|
||||||
|
transfer mechanism (E2 patches 6.1 lazy block reservation and 6.3
|
||||||
|
layerwise pipelining), not just the routing decision.
|
||||||
|
|
||||||
## Substrate
|
## Substrate
|
||||||
|
|
||||||
We compare three policies on identical traces:
|
We compare four policies on identical traces:
|
||||||
|
|
||||||
| policy | picker | vLLM launch mode | what's it for |
|
| policy | picker | vLLM launch mode | what's it for |
|
||||||
|---|---|---|---|
|
|---|---|---|---|
|
||||||
| `unified` | hybrid affinity + LMetric | plain (no Mooncake) | the headline baseline |
|
| `unified` | hybrid affinity + LMetric | plain (no connector) | the headline baseline |
|
||||||
| `unified_kv_both` | same as `unified` | `kv_role=kv_both` + bootstrap | isolation control: how much does kv_both *alone* cost? |
|
| `unified_kv_both` | same as `unified` | `MooncakeConnector` + `kv_both` | substrate control: Mooncake cost without PD-sep |
|
||||||
| `unified_v2` | unified + selective PD-sep | `kv_role=kv_both` + bootstrap | the actual experiment |
|
| `unified_nixl_both` | same as `unified` | `NixlConnector` + `kv_both` | substrate control: NIXL cost without PD-sep, attributes overhead to "framework vs Mooncake" |
|
||||||
|
| `unified_v2` | unified + selective PD-sep | `MooncakeConnector` + `kv_both` + bootstrap | the actual experiment |
|
||||||
|
|
||||||
All three use the same trace, the same 8-instance topology, the same
|
All four use the same trace, the same 8-instance topology, the same
|
||||||
shadow-drift–corrected proxy (`scripts/cache_aware_proxy.py` post-fix
|
shadow-drift–corrected proxy (`scripts/cache_aware_proxy.py` post-fix
|
||||||
`95c8ef8`). Plain `unified` was rerun on the patched proxy
|
`95c8ef8`). Plain `unified` was rerun on the patched proxy
|
||||||
(`b3_sweep_20260525_095043/unified`) under the same conditions.
|
(`b3_sweep_20260525_095043/unified`) under the same conditions.
|
||||||
|
|
||||||
## Result 1 — kv_both is expensive by itself
|
NIXL required two launch fixes beyond Mooncake:
|
||||||
|
- `VLLM_NIXL_SIDE_CHANNEL_PORT` must be unique per instance
|
||||||
|
(default 5600 → 5600..5607); otherwise instances 2..8 silently
|
||||||
|
hang in `zmq.error.ZMQError: Address already in use`.
|
||||||
|
- Health-check timeout had to be raised from 180 s to 360 s
|
||||||
|
because NIXL initialization (UCX agent + memory registration)
|
||||||
|
takes ~100-150 s per instance under 8-way concurrent launch.
|
||||||
|
|
||||||
|
## Result 1 — kv_both is expensive by itself, and only partly Mooncake's fault
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
Switching the vLLM launch from plain to `kv_role=kv_both` without
|
Switching the vLLM launch from plain to `kv_role=kv_both` without
|
||||||
ever triggering PD-sep already costs:
|
ever triggering PD-sep imposes a substrate tax. We compare the two
|
||||||
|
connectors available in vendored vLLM:
|
||||||
|
|
||||||
| metric | plain `unified` | `unified_kv_both` | Δ |
|
| metric | plain `unified` | `unified_nixl_both` | `unified_kv_both` (Mooncake) |
|
||||||
|---|---:|---:|---|
|
|---|---:|---:|---:|
|
||||||
| TTFT p50 | 0.50 s | 0.50 s | +0% |
|
| TTFT p50 | 0.50 s | 0.51 s (+1%) | 0.50 s (+0%) |
|
||||||
| TTFT p90 | 7.35 s | 10.67 s | **+45%** |
|
| **TTFT p90** | 7.35 s | **10.13 s (+38%)** | **10.67 s (+45%)** |
|
||||||
| TTFT p99 | 42.34 s | 45.19 s | +7% |
|
| TTFT p99 | 42.34 s | 44.58 s (+5%) | 45.19 s (+7%) |
|
||||||
| TPOT p90 | 17.1 ms | 21.3 ms | **+25%** |
|
| TPOT p90 | 17.1 ms | **19.8 ms (+16%)** | **21.3 ms (+25%)** |
|
||||||
| E2E p90 | 18.03 s | 22.89 s | **+27%** |
|
| E2E p90 | 18.03 s | **21.18 s (+17%)** | **22.89 s (+27%)** |
|
||||||
| APC | 79.4% | 78.3% | −1.1 pp |
|
| APC | 79.4% | 79.1% (−0.3 pp) | 78.3% (−1.1 pp) |
|
||||||
| hotspot index | 3.667 | **4.363** | **+19%** |
|
| **hotspot index** | 3.667 | **3.674 (+0.2%)** | **4.363 (+19%)** |
|
||||||
|
| interference index | n/a | 5.58 | 8.57 |
|
||||||
|
|
||||||
Two contributing factors:
|

|
||||||
|
|
||||||
1. **The Mooncake `MooncakeConnector` runs even when no transfer is
|
Reading the table from left to right gives a clean attribution:
|
||||||
pending.** Every scheduler step it walks `set(cache.keys())`
|
|
||||||
against `_known_hash_keys` (E2 audit §6.5) and updates the
|
|
||||||
`KVConnectorMetadata`. This is O(|cache|) per step on every
|
|
||||||
engine, even when no producer/consumer relationship is active.
|
|
||||||
2. **Block reservation semantics differ** under kv_both. The
|
|
||||||
scheduler treats blocks as candidates for export-to-others, so
|
|
||||||
the prefix cache LRU pressure is slightly different (we lose 1
|
|
||||||
pp APC).
|
|
||||||
|
|
||||||
Practical implication: **you don't enable kv_both for free**. If a
|
- **NIXL−plain** = the **v1-connector framework's irreducible cost**
|
||||||
deployment wants the option to do PD-sep selectively, the 45% TTFT
|
(TTFT p90 +38%, TPOT p90 +16%, E2E p90 +17%). This is the cost
|
||||||
p90 tax applies even on requests that stay local. This needs to
|
*any* v1 KV connector imposes:
|
||||||
recoverable cost before any selective-PD-sep policy is worth
|
- the 1 GB `kv_buffer_size` carved from `gpu-memory-utilization`,
|
||||||
shipping.
|
reducing the KV cache budget;
|
||||||
|
- per-step `SchedulerOutput.kv_connector_metadata` serialization
|
||||||
|
and round-trip through the connector worker;
|
||||||
|
- altered block-lifecycle semantics in `kv_cache_manager`
|
||||||
|
(`delay_free_blocks=True` is the default once any connector is
|
||||||
|
loaded, slowing LRU eviction).
|
||||||
|
- **Mooncake−NIXL** = the **Mooncake-implementation-specific extra**
|
||||||
|
(TTFT p90 +7 pp, TPOT p90 +9 pp, E2E p90 +10 pp, hotspot +19 pp).
|
||||||
|
This is the cost Mooncake's design choices add on top of the
|
||||||
|
generic framework:
|
||||||
|
- per-scheduler-step `set(self._block_pool.cache.keys())` diff
|
||||||
|
against `_known_hash_keys` (`mooncake_connector.py:432-456`)
|
||||||
|
walks O(|cache|) on every step on every engine, costing ~4 M
|
||||||
|
set operations per second on a 200 k-block cache;
|
||||||
|
- the hash sync runs even when no `direct_read` consumer is
|
||||||
|
present, so the cost is paid unconditionally;
|
||||||
|
- block-lifecycle is further constrained because Mooncake
|
||||||
|
requires `delay_free` until the explicit `finished_sending`
|
||||||
|
arrives, vs NIXL which can release blocks earlier.
|
||||||
|
|
||||||
|
The **most striking gap is hotspot**: Mooncake's per-step hash
|
||||||
|
sync runs on the scheduler's GIL and disrupts the timeliness of
|
||||||
|
routing decisions, amplifying load imbalance by 19%. NIXL has no
|
||||||
|
equivalent global-state maintenance and preserves the plain-unified
|
||||||
|
hotspot to within 0.2%.
|
||||||
|
|
||||||
|
Practical implication: **you don't enable any v1 KV connector for
|
||||||
|
free**, but if you have to enable one, NIXL is meaningfully cheaper
|
||||||
|
than Mooncake. Even NIXL's 38% TTFT p90 tax is large enough that
|
||||||
|
PD-sep needs to recover it on a non-trivial fraction of requests
|
||||||
|
before being worth it.
|
||||||
|
|
||||||
## Result 2 — PD-sep rarely fires on a real agentic trace
|
## Result 2 — PD-sep rarely fires on a real agentic trace
|
||||||
|
|
||||||
@@ -153,24 +198,24 @@ The first-token clock for the 49 k request is **21× the model's
|
|||||||
prediction**. This is not a small mis-tuning — it's a structurally
|
prediction**. This is not a small mis-tuning — it's a structurally
|
||||||
different model.
|
different model.
|
||||||
|
|
||||||
## Result 4 — three-way comparison
|
## Result 4 — four-way comparison
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
The full table:
|
The full table:
|
||||||
|
|
||||||
| metric | unified (plain) | unified_kv_both | unified_v2 (relaxed) |
|
| metric | unified (plain) | unified_nixl_both | unified_kv_both (Mooncake) | unified_v2 (relaxed) |
|
||||||
|---|---:|---:|---:|
|
|---|---:|---:|---:|---:|
|
||||||
| n_ok | 1214 | 1214 | 1214 |
|
| n_ok | 1214 | 1214 | 1214 | 1214 |
|
||||||
| TTFT p50 | 0.50 s | 0.50 s | 0.49 s |
|
| TTFT p50 | 0.50 s | 0.51 s | 0.50 s | 0.49 s |
|
||||||
| TTFT p90 | 7.35 s | 10.67 s | 10.98 s |
|
| TTFT p90 | 7.35 s | 10.13 s | 10.67 s | 10.98 s |
|
||||||
| TTFT p99 | 42.34 s | 45.19 s | 49.45 s |
|
| TTFT p99 | 42.34 s | 44.58 s | 45.19 s | 49.45 s |
|
||||||
| TPOT p90 | 17.1 ms | 21.3 ms | 18.4 ms |
|
| TPOT p90 | 17.1 ms | 19.8 ms | 21.3 ms | 18.4 ms |
|
||||||
| E2E p90 | 18.03 s | 22.89 s | 22.53 s |
|
| E2E p90 | 18.03 s | 21.18 s | 22.89 s | 22.53 s |
|
||||||
| APC | 79.4% | 78.3% | 77.6% |
|
| APC | 79.4% | 79.1% | 78.3% | 77.6% |
|
||||||
| interference index | n/a (no engine_state) | 8.57 | 8.46 |
|
| interference index | n/a | 5.58 | 8.57 | 8.46 |
|
||||||
| hotspot index | 3.667 | 4.363 | 3.910 |
|
| hotspot index | 3.667 | 3.674 | 4.363 | 3.910 |
|
||||||
| n_slow | 189 | 198 | 198 |
|
| n_slow | 189 | 192 | 198 | 198 |
|
||||||
|
|
||||||
### v2 vs the kv_both control (the right comparison)
|
### v2 vs the kv_both control (the right comparison)
|
||||||
|
|
||||||
|
|||||||
@@ -155,6 +155,32 @@
|
|||||||
"unknown": 49
|
"unknown": 49
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"policy": "unified_nixl_both",
|
||||||
|
"n_ok": 1214,
|
||||||
|
"n_total": 1214,
|
||||||
|
"ttft_p50_s": 0.5138550130068325,
|
||||||
|
"ttft_p90_s": 10.127110345300755,
|
||||||
|
"ttft_p99_s": 44.5789094621703,
|
||||||
|
"tpot_p50_s": 0.008423213202440761,
|
||||||
|
"tpot_p90_s": 0.019759515867947428,
|
||||||
|
"tpot_p99_s": 0.1079433335279151,
|
||||||
|
"e2e_p50_s": 1.866590676479973,
|
||||||
|
"e2e_p90_s": 21.179128799570027,
|
||||||
|
"e2e_p99_s": 96.01196486203865,
|
||||||
|
"apc_ratio": 0.791441828164218,
|
||||||
|
"interference_index": 5.580715970433481,
|
||||||
|
"hotspot_index_ttft_p90": 3.673957447190547,
|
||||||
|
"reuse_intra_frac": 0.930632797070364,
|
||||||
|
"reuse_cross_frac": 0.05718149217603143,
|
||||||
|
"n_slow": 192,
|
||||||
|
"failure_counts": {
|
||||||
|
"cache_miss_large_append": 21,
|
||||||
|
"hot_worker_queue": 75,
|
||||||
|
"same_worker_prefill_overlap": 72,
|
||||||
|
"unknown": 24
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"policy": "unified_v2",
|
"policy": "unified_v2",
|
||||||
"n_ok": 1214,
|
"n_ok": 1214,
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
@@ -0,0 +1,24 @@
|
|||||||
|
{
|
||||||
|
"hotspot_index_ttft_p90": 3.673957447190547,
|
||||||
|
"per_worker_latency_p90_s": {
|
||||||
|
"http://127.0.0.1:8000": 21.5702620673168,
|
||||||
|
"http://127.0.0.1:8001": 21.44246501957532,
|
||||||
|
"http://127.0.0.1:8002": 7.497513776784784,
|
||||||
|
"http://127.0.0.1:8003": 18.975387462502113,
|
||||||
|
"http://127.0.0.1:8004": 27.733961877820548,
|
||||||
|
"http://127.0.0.1:8005": 14.178356938017535,
|
||||||
|
"http://127.0.0.1:8006": 25.44877168269595,
|
||||||
|
"http://127.0.0.1:8007": 54.500166546402035
|
||||||
|
},
|
||||||
|
"per_worker_ttft_p90_s": {
|
||||||
|
"http://127.0.0.1:8000": 7.380765471985799,
|
||||||
|
"http://127.0.0.1:8001": 14.109222683508415,
|
||||||
|
"http://127.0.0.1:8002": 3.001173847797329,
|
||||||
|
"http://127.0.0.1:8003": 14.087287129514152,
|
||||||
|
"http://127.0.0.1:8004": 14.151121024426537,
|
||||||
|
"http://127.0.0.1:8005": 6.165523712011057,
|
||||||
|
"http://127.0.0.1:8006": 6.314287615299688,
|
||||||
|
"http://127.0.0.1:8007": 39.43635586597957
|
||||||
|
},
|
||||||
|
"status": "supported"
|
||||||
|
}
|
||||||
Binary file not shown.
|
After Width: | Height: | Size: 83 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 70 KiB After Width: | Height: | Size: 86 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 54 KiB After Width: | Height: | Size: 57 KiB |
@@ -34,37 +34,39 @@ def _load(name: str):
|
|||||||
|
|
||||||
|
|
||||||
POLICY_COLORS = {
|
POLICY_COLORS = {
|
||||||
"unified": "#2ca02c",
|
"unified": "#2ca02c",
|
||||||
"unified_kv_both": "#9467bd",
|
"unified_kv_both": "#9467bd",
|
||||||
"unified_v2": "#d62728",
|
"unified_nixl_both": "#1f77b4",
|
||||||
"unified_v2_strict": "#ff7f0e",
|
"unified_v2": "#d62728",
|
||||||
|
"unified_v2_strict": "#ff7f0e",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def fig_kv_both_overhead():
|
def fig_kv_both_overhead():
|
||||||
comp = _load("b3_policy_comparison.json")
|
comp = _load("b3_policy_comparison.json")
|
||||||
by = {r["policy"]: r for r in comp["rows"]}
|
by = {r["policy"]: r for r in comp["rows"]}
|
||||||
pols = ["unified", "unified_kv_both", "unified_v2"]
|
pols = ["unified", "unified_kv_both", "unified_nixl_both", "unified_v2"]
|
||||||
metrics = [
|
metrics = [
|
||||||
("TTFT p90 (s)", lambda r: r["ttft_p90_s"]),
|
("TTFT p90 (s)", lambda r: r["ttft_p90_s"]),
|
||||||
("TPOT p90 (ms)", lambda r: r["tpot_p90_s"] * 1000),
|
("TPOT p90 (ms)", lambda r: r["tpot_p90_s"] * 1000),
|
||||||
("E2E p90 (s)", lambda r: r["e2e_p90_s"]),
|
("E2E p90 (s)", lambda r: r["e2e_p90_s"]),
|
||||||
("hotspot index", lambda r: r["hotspot_index_ttft_p90"]),
|
("hotspot index", lambda r: r["hotspot_index_ttft_p90"]),
|
||||||
]
|
]
|
||||||
fig, axes = plt.subplots(1, 4, figsize=(14, 4))
|
fig, axes = plt.subplots(1, 4, figsize=(15, 4.2))
|
||||||
for ax, (label, fn) in zip(axes, metrics):
|
for ax, (label, fn) in zip(axes, metrics):
|
||||||
vals = [fn(by[p]) for p in pols]
|
vals = [fn(by[p]) for p in pols]
|
||||||
bars = ax.bar(pols, vals,
|
labels_short = [p.replace("unified_", "") for p in pols]
|
||||||
|
labels_short[0] = "plain"
|
||||||
|
bars = ax.bar(labels_short, vals,
|
||||||
color=[POLICY_COLORS[p] for p in pols],
|
color=[POLICY_COLORS[p] for p in pols],
|
||||||
edgecolor="black", linewidth=0.5)
|
edgecolor="black", linewidth=0.5)
|
||||||
ax.set_title(label)
|
ax.set_title(label)
|
||||||
ax.tick_params(axis="x", rotation=20, labelsize=9)
|
ax.tick_params(axis="x", rotation=15, labelsize=9)
|
||||||
for b, v in zip(bars, vals):
|
for b, v in zip(bars, vals):
|
||||||
ax.text(b.get_x() + b.get_width() / 2, v,
|
ax.text(b.get_x() + b.get_width() / 2, v,
|
||||||
f"{v:.2f}" if v < 100 else f"{v:.0f}",
|
f"{v:.2f}" if v < 100 else f"{v:.0f}",
|
||||||
ha="center", va="bottom", fontsize=9)
|
ha="center", va="bottom", fontsize=9)
|
||||||
ax.grid(alpha=0.3, axis="y")
|
ax.grid(alpha=0.3, axis="y")
|
||||||
# delta annotation
|
|
||||||
baseline = vals[0]
|
baseline = vals[0]
|
||||||
for i, v in enumerate(vals):
|
for i, v in enumerate(vals):
|
||||||
if i == 0:
|
if i == 0:
|
||||||
@@ -74,8 +76,8 @@ def fig_kv_both_overhead():
|
|||||||
fontsize=10, fontweight="bold",
|
fontsize=10, fontweight="bold",
|
||||||
color="darkred" if pct > 0 else "darkgreen")
|
color="darkred" if pct > 0 else "darkgreen")
|
||||||
fig.suptitle(
|
fig.suptitle(
|
||||||
"kv_both adds ~45% to TTFT p90 even without PD-sep firing.\n"
|
"Mooncake substrate adds 19-45% across metrics; NIXL is 5-19pp better but\n"
|
||||||
"v2's PD-sep barely recovers the gap (and overshoots TTFT p99)."
|
"still 16-38% above plain. v2's 5 PD-sep events don't recover the substrate tax."
|
||||||
)
|
)
|
||||||
fig.tight_layout()
|
fig.tight_layout()
|
||||||
fig.savefig(OUT / "fig_kv_both_overhead.png", dpi=120)
|
fig.savefig(OUT / "fig_kv_both_overhead.png", dpi=120)
|
||||||
@@ -203,27 +205,29 @@ def fig_v2_predicted_vs_actual():
|
|||||||
|
|
||||||
|
|
||||||
def fig_three_way_hotspot():
|
def fig_three_way_hotspot():
|
||||||
pols = ["unified", "unified_kv_both", "unified_v2"]
|
pols = ["unified", "unified_kv_both", "unified_nixl_both", "unified_v2"]
|
||||||
per_worker = {p: _load(f"per_worker_{p}.json") for p in pols}
|
per_worker = {p: _load(f"per_worker_{p}.json") for p in pols}
|
||||||
workers = sorted(per_worker["unified"]["per_worker_ttft_p90_s"].keys())
|
workers = sorted(per_worker["unified"]["per_worker_ttft_p90_s"].keys())
|
||||||
|
|
||||||
x = range(len(workers))
|
x = range(len(workers))
|
||||||
width = 0.27
|
n = len(pols)
|
||||||
fig, ax = plt.subplots(figsize=(11, 5))
|
width = 0.85 / n
|
||||||
|
fig, ax = plt.subplots(figsize=(12, 5))
|
||||||
for i, p in enumerate(pols):
|
for i, p in enumerate(pols):
|
||||||
d = per_worker[p]["per_worker_ttft_p90_s"]
|
d = per_worker[p]["per_worker_ttft_p90_s"]
|
||||||
vals = [d[w] for w in workers]
|
vals = [d[w] for w in workers]
|
||||||
offset = (i - 1) * width
|
offset = (i - (n - 1) / 2) * width
|
||||||
|
label = p.replace("unified_", "") if p != "unified" else "plain"
|
||||||
ax.bar([j + offset for j in x], vals, width,
|
ax.bar([j + offset for j in x], vals, width,
|
||||||
label=f"{p} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
|
label=f"{label} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
|
||||||
color=POLICY_COLORS[p], edgecolor="black", linewidth=0.4)
|
color=POLICY_COLORS[p], edgecolor="black", linewidth=0.4)
|
||||||
short = [w.replace("http://127.0.0.1:", ":") for w in workers]
|
short = [w.replace("http://127.0.0.1:", ":") for w in workers]
|
||||||
ax.set_xticks(list(x))
|
ax.set_xticks(list(x))
|
||||||
ax.set_xticklabels(short, rotation=0, fontsize=9)
|
ax.set_xticklabels(short, rotation=0, fontsize=9)
|
||||||
ax.set_ylabel("worker TTFT p90 (s)")
|
ax.set_ylabel("worker TTFT p90 (s)")
|
||||||
ax.set_title(
|
ax.set_title(
|
||||||
"Per-worker TTFT p90 distribution. kv_both alone makes the hot worker hotter\n"
|
"Per-worker TTFT p90 distribution across substrates. Mooncake (kv_both)\n"
|
||||||
"(unified→kv_both: 37.7s→43.5s peak); v2's 5 PD-sep triggers nudge it back."
|
"amplifies the hot worker (hotspot 4.36); NIXL keeps it close to plain (3.67)."
|
||||||
)
|
)
|
||||||
ax.legend(loc="upper left", fontsize=9)
|
ax.legend(loc="upper left", fontsize=9)
|
||||||
ax.grid(alpha=0.3, axis="y")
|
ax.grid(alpha=0.3, axis="y")
|
||||||
@@ -232,12 +236,64 @@ def fig_three_way_hotspot():
|
|||||||
plt.close(fig)
|
plt.close(fig)
|
||||||
|
|
||||||
|
|
||||||
|
def fig_connector_substrate_attribution():
|
||||||
|
"""Decomposes overhead into v1-framework cost (shared by all connectors,
|
||||||
|
proxied by NIXL since it's the leanest) and Mooncake-specific cost."""
|
||||||
|
comp = _load("b3_policy_comparison.json")
|
||||||
|
by = {r["policy"]: r for r in comp["rows"]}
|
||||||
|
metrics = [
|
||||||
|
("TTFT p90 (s)", "ttft_p90_s", False),
|
||||||
|
("TPOT p90 (ms)", "tpot_p90_s", True),
|
||||||
|
("E2E p90 (s)", "e2e_p90_s", False),
|
||||||
|
("hotspot index", "hotspot_index_ttft_p90", False),
|
||||||
|
]
|
||||||
|
fig, axes = plt.subplots(1, 4, figsize=(15, 4))
|
||||||
|
for ax, (label, key, scale_ms) in zip(axes, metrics):
|
||||||
|
plain = by["unified"][key] * (1000 if scale_ms else 1)
|
||||||
|
nixl = by["unified_nixl_both"][key] * (1000 if scale_ms else 1)
|
||||||
|
moon = by["unified_kv_both"][key] * (1000 if scale_ms else 1)
|
||||||
|
v2 = by["unified_v2"][key] * (1000 if scale_ms else 1)
|
||||||
|
|
||||||
|
framework_cost = nixl - plain # what NIXL adds = v1 framework cost
|
||||||
|
mooncake_extra = moon - nixl # extra on top from Mooncake
|
||||||
|
v2_branch_extra = v2 - moon # extra from PD-sep branch (Mooncake + 5 events)
|
||||||
|
|
||||||
|
bottom = 0
|
||||||
|
ax.bar(["overhead"], [plain], color="#cccccc",
|
||||||
|
edgecolor="black", linewidth=0.4,
|
||||||
|
label=f"plain unified ({plain:.2f})")
|
||||||
|
bottom += plain
|
||||||
|
ax.bar(["overhead"], [framework_cost], bottom=[bottom],
|
||||||
|
color="#1f77b4", edgecolor="black", linewidth=0.4,
|
||||||
|
label=f"v1 framework (+{framework_cost:.2f})")
|
||||||
|
bottom += framework_cost
|
||||||
|
ax.bar(["overhead"], [mooncake_extra], bottom=[bottom],
|
||||||
|
color="#9467bd", edgecolor="black", linewidth=0.4,
|
||||||
|
label=f"Mooncake extra (+{mooncake_extra:.2f})")
|
||||||
|
bottom += mooncake_extra
|
||||||
|
ax.bar(["overhead"], [v2_branch_extra], bottom=[bottom],
|
||||||
|
color="#d62728", edgecolor="black", linewidth=0.4,
|
||||||
|
label=f"v2 PD-sep branch ({v2_branch_extra:+.2f})")
|
||||||
|
ax.set_title(label)
|
||||||
|
ax.legend(fontsize=8, loc="upper right")
|
||||||
|
ax.grid(alpha=0.3, axis="y")
|
||||||
|
ax.tick_params(axis="x", labelbottom=False)
|
||||||
|
fig.suptitle(
|
||||||
|
"Attribution: plain unified vs NIXL substrate vs Mooncake substrate vs v2.\n"
|
||||||
|
"Blue: cost shared by any v1 connector. Purple: cost specific to Mooncake."
|
||||||
|
)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(OUT / "fig_connector_substrate_attribution.png", dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
fig_kv_both_overhead()
|
fig_kv_both_overhead()
|
||||||
fig_v2_trigger_funnel()
|
fig_v2_trigger_funnel()
|
||||||
fig_v2_predicted_vs_actual()
|
fig_v2_predicted_vs_actual()
|
||||||
fig_three_way_hotspot()
|
fig_three_way_hotspot()
|
||||||
print(f"wrote 4 figures to {OUT}")
|
fig_connector_substrate_attribution()
|
||||||
|
print(f"wrote 5 figures to {OUT}")
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
Reference in New Issue
Block a user