Elastic migration v2 section: PD-sep on agentic workload is net negative

New analysis/characterization/elastic_migration_v2/ packages the
unified_v2 + unified_kv_both experiments into a self-contained
results section that the paper can cite as the "we tried selective
PD-sep migration" case study. The section finds three independent
reasons PD-sep doesn't help on agentic w600:

1. Mooncake kv_both substrate alone (no PD-sep ever firing) imposes
   TTFT p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain
   unified. Per-step KVConnectorMetadata maintenance and block
   reservation semantics dominate even when no transfer is pending.
2. PD-sep gate fires only 0.16-0.41% of requests across two
   gate-tightness configurations. 88-76% are killed by
   new_local < threshold because 93% intra-session reuse on agentic
   traces leaves a small uncached tail; 19% are killed by
   chosen_no_active_decode (snapshot-time gate). Even relaxed
   thresholds can't grow trigger rate past 0.5%.
3. When PD-sep fires, the calibrated cost model
   (0.3s + bytes / 2.7 GB/s) is wrong by 10-20x. 5 triggered
   requests in v2.1 saw realized TTFT 12-45s vs model-predicted
   migrate cost 0.7-2.2s, consistent with the E2 audit's finding
   that D-side block pre-reservation and missing layerwise
   pipelining dominate the decode_sent -> first_token clock.

Three-way comparison (unified vs unified_kv_both vs unified_v2):
v2 vs the kv_both control is roughly net-zero (-10% hotspot,
-14% TPOT p90, +3% TTFT p90, +9% TTFT p99). v2 vs plain unified is
strictly worse by 27-49% across latency percentiles because the
kv_both substrate tax is unavoidable when the policy is enabled.

Contents:
- README.md: the four results sections, the three-way comparison
  table, an explicit "what this claims for the paper" list, and a
  cross-reference index to the earlier characterization documents.
- data/: b3_policy_comparison.json + per-policy breakdown.json
  + per-policy hotspot_index.json for the four policies in scope.
- figures/: 4 PNGs rendered by render_figures.py:
  * fig_kv_both_overhead.png   — 4-metric bar chart with delta
    annotations showing kv_both alone costs +45% TTFT p90.
  * fig_v2_trigger_funnel.png  — per-reason request count for the
    two gate configurations on log scale.
  * fig_v2_predicted_vs_actual.png  — scatter of model-predicted
    migrate cost vs realized TTFT for the 5 triggered requests,
    with y=x, 10x, and 20x reference lines.
  * fig_three_way_hotspot.png  — per-worker TTFT p90 grouped bars
    across the three policies.

The section is intentionally self-contained: it lists what the
experiment validates (cost model picks correct candidates;
shadow-drift fix is necessary; same-worker interference is real)
alongside what it disproves (per-request PD-sep on agentic via
Mooncake is not a net win in current implementation).

Refs: E1/E2 subagent audits, B2 microbench, unified_v2 commits
19f69a9 / 4b833d3 / 95c8ef8.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-26 13:28:37 +08:00
parent 95c8ef853c
commit d76eb02637
15 changed files with 839 additions and 0 deletions

View File

@@ -0,0 +1,284 @@
# Elastic Migration v2: Selective PD-Separation via Mooncake
Date: 2026-05-26
Trace: `traces/w600_r0.0015_st30.jsonl` (1214 reqs, 274 sessions, 53.3 M tokens)
Model: Qwen3-Coder-30B-A3B-Instruct, 8 × TP1 on H20
## TL;DR
This section explores whether the **B2-confirmed same-worker
prefilldecode interference** can be relieved by selectively
migrating prefill to a different worker for the requests where the
interference cost would dominate the transfer cost. We implement two
flavors of the policy (strict gates, then relaxed gates) and a clean
isolation control (`unified_kv_both`: same picker as `unified`, but
the vLLMs are launched in `kv_role=kv_both` so the Mooncake
substrate is on but never triggers).
Three findings:
1. **`kv_role=kv_both` alone imposes a heavy always-on tax**: TTFT
p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain `unified`,
with no PD-sep ever firing.
2. **PD-sep almost never triggers on a real agentic workload**:
0.16% with strict gates, 0.41% with relaxed gates. Agentic
workloads have 93% intra-session reuse, so most requests land on
workers that already hold cache — the uncached tail is too small
to be worth migrating.
3. **When PD-sep does fire, the cost model is wrong by ~1020×**:
the calibrated `0.3s + bytes / 2.7 GB/s` predicts 12 s migrate
cost; observed TTFT on triggered requests is 1245 s. The same
D-side block-reservation pressure and absence of layerwise
pipelining that the E2 audit flagged still dominate.
The net latency of `unified_v2` is **not better than plain
`unified`**. Improving agentic PD-sep requires fixing the underlying
Mooncake transfer mechanism (E2 patches 6.1 lazy block reservation
and 6.3 layerwise pipelining), not the routing decision.
## Substrate
We compare three policies on identical traces:
| policy | picker | vLLM launch mode | what's it for |
|---|---|---|---|
| `unified` | hybrid affinity + LMetric | plain (no Mooncake) | the headline baseline |
| `unified_kv_both` | same as `unified` | `kv_role=kv_both` + bootstrap | isolation control: how much does kv_both *alone* cost? |
| `unified_v2` | unified + selective PD-sep | `kv_role=kv_both` + bootstrap | the actual experiment |
All three use the same trace, the same 8-instance topology, the same
shadow-driftcorrected proxy (`scripts/cache_aware_proxy.py` post-fix
`95c8ef8`). Plain `unified` was rerun on the patched proxy
(`b3_sweep_20260525_095043/unified`) under the same conditions.
## Result 1 — kv_both is expensive by itself
![](figures/fig_kv_both_overhead.png)
Switching the vLLM launch from plain to `kv_role=kv_both` without
ever triggering PD-sep already costs:
| metric | plain `unified` | `unified_kv_both` | Δ |
|---|---:|---:|---|
| TTFT p50 | 0.50 s | 0.50 s | +0% |
| TTFT p90 | 7.35 s | 10.67 s | **+45%** |
| TTFT p99 | 42.34 s | 45.19 s | +7% |
| TPOT p90 | 17.1 ms | 21.3 ms | **+25%** |
| E2E p90 | 18.03 s | 22.89 s | **+27%** |
| APC | 79.4% | 78.3% | 1.1 pp |
| hotspot index | 3.667 | **4.363** | **+19%** |
Two contributing factors:
1. **The Mooncake `MooncakeConnector` runs even when no transfer is
pending.** Every scheduler step it walks `set(cache.keys())`
against `_known_hash_keys` (E2 audit §6.5) and updates the
`KVConnectorMetadata`. This is O(|cache|) per step on every
engine, even when no producer/consumer relationship is active.
2. **Block reservation semantics differ** under kv_both. The
scheduler treats blocks as candidates for export-to-others, so
the prefix cache LRU pressure is slightly different (we lose 1
pp APC).
Practical implication: **you don't enable kv_both for free**. If a
deployment wants the option to do PD-sep selectively, the 45% TTFT
p90 tax applies even on requests that stay local. This needs to
recoverable cost before any selective-PD-sep policy is worth
shipping.
## Result 2 — PD-sep rarely fires on a real agentic trace
![](figures/fig_v2_trigger_funnel.png)
We log every routing decision's `v2_reason` (why we did or did not
PD-sep). Two runs with different gate thresholds:
| fall-through bucket | v2.0 strict | v2.1 relaxed | what it means |
|---|---:|---:|---|
| `new_local < threshold` | 1077 (88.7%) | 924 (76.1%) | uncached tail too small to justify transfer |
| `chosen_no_active_decode` | 115 (9.5%) | 229 (18.9%) | no decode on chosen to protect |
| `src_cache_below_threshold` | 14 (1.2%) | 36 (3.0%) | no alt instance holds enough cache |
| `src_not_meaningfully_more_cache` | 6 (0.5%) | 16 (1.3%) | alt instance doesn't help vs chosen |
| `cost_benefit not enough margin` | 0 | 4 (0.3%) | model says transfer cost + interference on src ≥ local interference |
| **PD-sep TRIGGERED** | **2 (0.16%)** | **5 (0.41%)** | passed all gates and cost-benefit favored migrate |
The dominant filter is `new_local < threshold`. Even with the
threshold dropped from 16 k to 8 k tokens, three out of four requests
have less than 8 k uncached tokens at the chosen worker. This is
structural: with intra-session reuse measured at 93% on the same
trace (window_1_results.md), most turns hit prefix cache on the
session's previous worker.
The second filter, `chosen_no_active_decode`, kills another fifth.
This is a snapshot-time phenomenon: at the moment the picker runs,
the chosen worker often has its previous request still in prefill,
not yet decoding. The gate's intent ("don't migrate if no decode is
being hurt by the prefill we're routing") is correct, but it ends up
suppressing PD-sep for a real situation where decode is *about to*
start.
Even after these two filters, the cost-benefit step itself rejects
nearly half of remaining candidates (4 out of 9 in relaxed). So the
final trigger rate of 0.41% is a structural property, not a
parameter-tuning problem.
## Result 3 — when PD-sep fires, the cost model is wrong by 1020×
![](figures/fig_v2_predicted_vs_actual.png)
The 5 PD-sep-triggered requests in v2.1 relaxed:
| input | new_local | new_src | src→dst | cost_local | cost_migrate (model) | actual TTFT | actual E2E |
|---:|---:|---:|---|---:|---:|---:|---:|
| 21963 | 21963 | 9163 | 6→5 | 4.39 s | 4.17 s | 3.69 s | 8.48 s |
| 8706 | 8706 | 2050 | 5→7 | 1.09 s | 0.73 s | 12.48 s | 14.31 s |
| 13616 | 13616 | 2352 | 4→0 | 1.70 s | 1.03 s | 18.33 s | 19.50 s |
| 49483 | 49483 | 843 | 3→4 | 11.75 s | 2.16 s | **45.13 s** | **53.55 s** |
| 19806 | 19806 | 350 | 3→6 | 3.96 s | 1.06 s | 20.06 s | 31.98 s |
The cost model predicts the migrate path will take 0.72.2 s; the
actual TTFT on these requests is 1245 s. The model's `0.3 s +
bytes / 2.7 GB/s` calibration captures pure RDMA bandwidth in
isolation but misses everything else that happens on the
`decode_sent → first_token` clock: D-side scheduler step latency,
block reservation before KV arrives (so D's cache pressure
increases for the entire wait), the per-layer scatter of
`batch_transfer_sync_write`, and the next-step scheduler promotion
after `finished_recving`. The E2 audit measured this end-to-end at
p50 = 1.1 s and **p90 = 6.7 s** on production runs; the v2.1
triggered requests landed in the p99 tail of that distribution
because their dst was already loaded.
The first-token clock for the 49 k request is **21× the model's
prediction**. This is not a small mis-tuning — it's a structurally
different model.
## Result 4 — three-way comparison
![](figures/fig_three_way_hotspot.png)
The full table:
| metric | unified (plain) | unified_kv_both | unified_v2 (relaxed) |
|---|---:|---:|---:|
| n_ok | 1214 | 1214 | 1214 |
| TTFT p50 | 0.50 s | 0.50 s | 0.49 s |
| TTFT p90 | 7.35 s | 10.67 s | 10.98 s |
| TTFT p99 | 42.34 s | 45.19 s | 49.45 s |
| TPOT p90 | 17.1 ms | 21.3 ms | 18.4 ms |
| E2E p90 | 18.03 s | 22.89 s | 22.53 s |
| APC | 79.4% | 78.3% | 77.6% |
| interference index | n/a (no engine_state) | 8.57 | 8.46 |
| hotspot index | 3.667 | 4.363 | 3.910 |
| n_slow | 189 | 198 | 198 |
### v2 vs the kv_both control (the right comparison)
Compared to the kv_both control — same substrate, no PD-sep — the
5 PD-sep triggers in v2:
- **slightly improve TPOT p90 (14%) and hotspot (10%)**
- **slightly worsen TTFT p90 (+3%) and TTFT p99 (+9%)**, because the
triggered requests themselves take ~20× the predicted transfer
time
The net effect against the kv_both control is in the noise. The
hotspot improvement is within the run-to-run stochastic range we saw
earlier (v2 strict run scored 2.733 hotspot under the same
substrate; v2 relaxed scored 3.910).
### v2 vs plain unified (the headline question)
`unified_v2` is **27% slower on E2E p90** and **49% slower on TTFT
p90** than plain `unified`. The 45 pp of TTFT p90 inflation is from
kv_both substrate, not the routing decision; nothing PD-sep does can
recover this in our current Mooncake implementation.
## Why v2's PD-sep is fundamentally choked
There are three independent structural problems, each by itself
enough to make v2 not win:
1. **The kv_both substrate is the wrong default**. It pays a 45%
TTFT p90 tax on every request. To make selective PD-sep beat
plain `unified`, the saved interference per triggered request
times the trigger rate must exceed 45% × average TTFT, on
average. With 0.41% trigger rate, even saving 100% of TTFT per
triggered request would only save ~0.4%, which can't recover 45%.
2. **Agentic intra-session reuse leaves no headroom for migration**.
Most turns hit cache on the worker that handled the previous
turn. Migrating prefill to a *different* worker is the *exact*
thing intra-session affinity tries to avoid: it forces the new
worker to pay for the cached prefix transfer instead of just
reusing what's already on the affinity worker. This is a
structural mismatch between PD-sep semantics ("send big prefills
to a less-busy worker") and agentic workloads ("keep sessions
sticky to wherever the cache is").
3. **The Mooncake mechanism is 1020× slower than the cost model
predicts**, primarily due to D-side pre-allocation of KV blocks
and the absence of layerwise pipelining (E2 audit §6.1 / §6.3).
The cost model can be re-calibrated, but doing so would push the
gate even tighter, dropping the already-tiny trigger rate to
nearly zero.
The three are stacked: even if any two were fixed, the remaining
one would still make PD-sep a net loss on this trace.
## What this section claims for the paper
1. **Same-worker prefilldecode interference is a real mechanism**
(B2 microbench), but **agentic workloads rarely expose it**: the
typical request has high cache hit and small uncached tail, so
the interference cost is bounded.
2. **Routing-only solutions (unified) already capture 79% of the
intra-session APC ceiling and recover the latency** by avoiding
the heavy-tail sessions through the affinity gate. The remaining
23 pp gap to the ceiling is from APC LRU eviction under capacity
pressure, not from prefilldecode interference.
3. **Per-request PD-sep via Mooncake on agentic workloads is not a
net win** in our measurements, even with a carefully-gated cost
model. The combined effect of kv_both substrate overhead, low
trigger rate, and mechanism-vs-model gap is uniformly negative.
4. **A productive direction is mechanism-level**: fix the Mooncake
D-side block reservation (E2 §6.1), implement layerwise transfer
pipelining (E2 §6.3), and re-measure. Only if these patches drop
the substrate tax to <10% and the realized transfer to 2 s p90
does PD-sep become competitive with routing on agentic traces.
## What v2 still validates
- **The cost model's *qualitative* shape is correct**: when it says
"migrate", that's a request where local interference *would have*
been 4 s and src has 80% prefix cache. The model picks the
right candidate requests.
- **The gate logic catches the right exclusions**: 88% by uncached
tail size, 19% by no-decode-to-protect, the rest by missing
source cache. Each is a structurally correct reason.
- **The proxy shadow-drift fix is necessary infrastructure** for
any long-running routing experiment. We observed 3 phantom
corrections per ~50-minute run.
## Files
- `data/b3_policy_comparison.json` the four policies' headline
metrics from the same B3 sweep root.
- `data/breakdown_<policy>.json` per-request proxy breakdown
including v2 gate fields and triggered-event metadata.
- `data/per_worker_<policy>.json` per-worker TTFT/latency p90s
used in the hotspot figure.
- `figures/*.png` the four section figures referenced above.
- `render_figures.py` regenerates the figures from data/.
## Cross-references
- `analysis/characterization/window_1_results.md` B2 microbench
(same-worker interference causal proof) and B3 baseline 5-policy
sweep
- `analysis/characterization/agentic_dispatch_coupling.md` why
the saturated-replay setup matches agentic production
- `analysis/characterization/b3_policies_pseudocode.md` pickers
for the five baseline policies; `unified_v2` extends `unified`
- E1 / E2 subagent reports (commit `4b833d3` message and the
conversation log) full mechanism audit that informed v2's design

View File

@@ -0,0 +1,211 @@
{
"rows": [
{
"policy": "capped",
"n_ok": 770,
"n_total": 770,
"ttft_p50_s": 1.1989156164927408,
"ttft_p90_s": 12.827629912580612,
"ttft_p99_s": 46.61752380923125,
"tpot_p50_s": 0.007231239004497606,
"tpot_p90_s": 0.015998617687440243,
"tpot_p99_s": 0.11515370831539476,
"e2e_p50_s": 2.598489043477457,
"e2e_p90_s": 21.245602010778384,
"e2e_p99_s": 74.60736650204846,
"apc_ratio": 0.3158312503528108,
"interference_index": 6.331064378362814,
"hotspot_index_ttft_p90": 2.0204268015410918,
"reuse_intra_frac": 0.9192657105586233,
"reuse_cross_frac": 0.0602232594931501,
"n_slow": 185,
"failure_counts": {
"cache_miss_large_append": 60,
"hot_worker_queue": 66,
"same_worker_prefill_overlap": 45,
"unknown": 14
}
},
{
"policy": "lmetric",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 0.9387824369769078,
"ttft_p90_s": 15.671339168207492,
"ttft_p99_s": 53.56683189840049,
"tpot_p50_s": 0.008854518407308914,
"tpot_p90_s": 0.02122720699121469,
"tpot_p99_s": 0.18280341184277568,
"e2e_p50_s": 2.754255389008904,
"e2e_p90_s": 24.8209177934099,
"e2e_p99_s": 80.59924928059091,
"apc_ratio": 0.5694312382571595,
"interference_index": 6.530231061794441,
"hotspot_index_ttft_p90": 2.252837147833725,
"reuse_intra_frac": 0.9321238805590836,
"reuse_cross_frac": 0.05679481258506571,
"n_slow": 295,
"failure_counts": {
"cache_miss_large_append": 94,
"hot_worker_queue": 68,
"same_worker_prefill_overlap": 69,
"unknown": 64
}
},
{
"policy": "load_only",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 1.2609447415161412,
"ttft_p90_s": 20.197147866390882,
"ttft_p99_s": 52.84285237012196,
"tpot_p50_s": 0.009231464695980247,
"tpot_p90_s": 0.026851662550158716,
"tpot_p99_s": 0.3211630676943426,
"e2e_p50_s": 3.58568156149704,
"e2e_p90_s": 33.459180271782685,
"e2e_p99_s": 93.95083751494239,
"apc_ratio": 0.5412093853102866,
"interference_index": 9.16424627504275,
"hotspot_index_ttft_p90": 1.2940319990630569,
"reuse_intra_frac": 0.9353191550754928,
"reuse_cross_frac": 0.053372184678592026,
"n_slow": 379,
"failure_counts": {
"cache_miss_large_append": 151,
"hot_worker_queue": 33,
"same_worker_prefill_overlap": 108,
"unknown": 87
}
},
{
"policy": "sticky",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 0.5415176274836995,
"ttft_p90_s": 18.021296651283045,
"ttft_p99_s": 74.09429564891524,
"tpot_p50_s": 0.008952101894096181,
"tpot_p90_s": 0.03641285916619554,
"tpot_p99_s": 0.35152006935195085,
"e2e_p50_s": 2.081947358994512,
"e2e_p90_s": 34.62592205510591,
"e2e_p99_s": 139.68334607904353,
"apc_ratio": 0.7720092868396378,
"interference_index": 13.651718321568111,
"hotspot_index_ttft_p90": 2.727756623171119,
"reuse_intra_frac": 0.9327723488279339,
"reuse_cross_frac": 0.05495149683864246,
"n_slow": 234,
"failure_counts": {
"cache_miss_large_append": 20,
"hot_worker_queue": 51,
"same_worker_prefill_overlap": 134,
"unknown": 29
}
},
{
"policy": "unified",
"n_ok": 1213,
"n_total": 1214,
"ttft_p50_s": 0.4997710260213353,
"ttft_p90_s": 7.345769894809922,
"ttft_p99_s": 42.34170345296613,
"tpot_p50_s": 0.008079791456705824,
"tpot_p90_s": 0.017110194704198407,
"tpot_p99_s": 0.12655874612209597,
"e2e_p50_s": 1.7495028690318577,
"e2e_p90_s": 18.033410895219994,
"e2e_p99_s": 68.80023987947489,
"apc_ratio": 0.794261466256467,
"interference_index": null,
"hotspot_index_ttft_p90": 3.667136528736114,
"reuse_intra_frac": 0.9311187350942534,
"reuse_cross_frac": 0.056702150437367635,
"n_slow": 189,
"failure_counts": {
"cache_miss_large_append": 18,
"hot_worker_queue": 116,
"unknown": 55
}
},
{
"policy": "unified_kv_both",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 0.4958424885116983,
"ttft_p90_s": 10.671844050800438,
"ttft_p99_s": 45.19353310586651,
"tpot_p50_s": 0.008573156389059812,
"tpot_p90_s": 0.021303916384344358,
"tpot_p99_s": 0.21501837408937963,
"e2e_p50_s": 1.9310281965008471,
"e2e_p90_s": 22.8941433175176,
"e2e_p99_s": 76.06128971517893,
"apc_ratio": 0.7828397082703908,
"interference_index": 8.571603637346875,
"hotspot_index_ttft_p90": 4.363145984888287,
"reuse_intra_frac": 0.9313000825240145,
"reuse_cross_frac": 0.056182260858791105,
"n_slow": 198,
"failure_counts": {
"cache_miss_large_append": 28,
"hot_worker_queue": 34,
"same_worker_prefill_overlap": 87,
"unknown": 49
}
},
{
"policy": "unified_v2",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 0.4851180294645019,
"ttft_p90_s": 10.97665627548705,
"ttft_p99_s": 49.44861259821856,
"tpot_p50_s": 0.008261419251554481,
"tpot_p90_s": 0.018414033703249108,
"tpot_p99_s": 0.20999689490980364,
"e2e_p50_s": 1.8092182099935599,
"e2e_p90_s": 22.528888442111203,
"e2e_p99_s": 82.40234094743934,
"apc_ratio": 0.7758437361549086,
"interference_index": 8.45656745230457,
"hotspot_index_ttft_p90": 3.9096187869766164,
"reuse_intra_frac": 0.9324663389938368,
"reuse_cross_frac": 0.055154184817413764,
"n_slow": 198,
"failure_counts": {
"cache_miss_large_append": 36,
"hot_worker_queue": 26,
"same_worker_prefill_overlap": 82,
"unknown": 54
}
},
{
"policy": "unified_v2_strict",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 0.4849805940175429,
"ttft_p90_s": 8.960840504511737,
"ttft_p99_s": 44.63598358390898,
"tpot_p50_s": 0.008222105788569446,
"tpot_p90_s": 0.018078321745916927,
"tpot_p99_s": 0.14616439095890604,
"e2e_p50_s": 1.8335122870048508,
"e2e_p90_s": 22.435233922180526,
"e2e_p99_s": 68.254801789901,
"apc_ratio": 0.789281361129855,
"interference_index": 6.231677388887276,
"hotspot_index_ttft_p90": 2.7334230011629197,
"reuse_intra_frac": 0.9309082618411778,
"reuse_cross_frac": 0.05689887985860397,
"n_slow": 186,
"failure_counts": {
"cache_miss_large_append": 26,
"hot_worker_queue": 44,
"same_worker_prefill_overlap": 73,
"unknown": 43
}
}
]
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 3.667136528736114,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 41.42001512600109,
"http://127.0.0.1:8001": 12.4878579101933,
"http://127.0.0.1:8002": 22.462878945574648,
"http://127.0.0.1:8003": 15.501050900109117,
"http://127.0.0.1:8004": 39.956250199786155,
"http://127.0.0.1:8005": 36.69850301651168,
"http://127.0.0.1:8006": 10.116177947795954,
"http://127.0.0.1:8007": 20.35038618039107
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 11.264844838529825,
"http://127.0.0.1:8001": 3.6063860427122614,
"http://127.0.0.1:8002": 16.175747957825664,
"http://127.0.0.1:8003": 9.314684258581842,
"http://127.0.0.1:8004": 37.73397144810297,
"http://127.0.0.1:8005": 18.328030522551852,
"http://127.0.0.1:8006": 3.6328767628350773,
"http://127.0.0.1:8007": 7.772977900883419
},
"status": "supported"
}

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 4.363145984888287,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 7.273825440008658,
"http://127.0.0.1:8001": 40.48809068736155,
"http://127.0.0.1:8002": 24.491076068370596,
"http://127.0.0.1:8003": 18.828550089401002,
"http://127.0.0.1:8004": 20.06954986089262,
"http://127.0.0.1:8005": 9.634067087399307,
"http://127.0.0.1:8006": 35.7432237003348,
"http://127.0.0.1:8007": 24.362499430915342
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 2.725343641615472,
"http://127.0.0.1:8001": 30.449911632167645,
"http://127.0.0.1:8002": 16.297463109577073,
"http://127.0.0.1:8003": 6.766894554614579,
"http://127.0.0.1:8004": 11.146178993489595,
"http://127.0.0.1:8005": 4.552643961587455,
"http://127.0.0.1:8006": 6.90922680192164,
"http://127.0.0.1:8007": 7.048551249800954
},
"status": "supported"
}

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 3.9096187869766164,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 27.12522437740119,
"http://127.0.0.1:8001": 15.299228341400166,
"http://127.0.0.1:8002": 49.346961313998335,
"http://127.0.0.1:8003": 22.404519376007386,
"http://127.0.0.1:8004": 22.470557069155618,
"http://127.0.0.1:8005": 17.487964828591807,
"http://127.0.0.1:8006": 21.76291022058577,
"http://127.0.0.1:8007": 18.311422476416926
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 9.26557928660186,
"http://127.0.0.1:8001": 5.734943528624719,
"http://127.0.0.1:8002": 38.812515752378395,
"http://127.0.0.1:8003": 10.589305737824198,
"http://127.0.0.1:8004": 10.83847834250191,
"http://127.0.0.1:8005": 5.034968857781501,
"http://127.0.0.1:8006": 3.5207203380181493,
"http://127.0.0.1:8007": 12.236044214287555
},
"status": "supported"
}

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 2.7334230011629197,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 11.098119341616997,
"http://127.0.0.1:8001": 23.1559918191866,
"http://127.0.0.1:8002": 22.57899510498975,
"http://127.0.0.1:8003": 9.956129518186204,
"http://127.0.0.1:8004": 28.072633931197924,
"http://127.0.0.1:8005": 47.2373243979877,
"http://127.0.0.1:8006": 23.23235769500608,
"http://127.0.0.1:8007": 27.031178803613876
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 3.1871710045961663,
"http://127.0.0.1:8001": 8.824780725361773,
"http://127.0.0.1:8002": 16.364250262192222,
"http://127.0.0.1:8003": 4.1765614019881445,
"http://127.0.0.1:8004": 14.026077619416176,
"http://127.0.0.1:8005": 24.662665293016516,
"http://127.0.0.1:8006": 9.220479947811697,
"http://127.0.0.1:8007": 8.441550621995741
},
"status": "supported"
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 73 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 97 KiB

View File

@@ -0,0 +1,244 @@
"""Render PNG figures for the elastic_migration_v2 section.
Inputs in ./data/ :
- b3_policy_comparison.json
- breakdown_unified.json, breakdown_unified_kv_both.json,
breakdown_unified_v2.json, breakdown_unified_v2_strict.json
- per_worker_<policy>.json for each of the four
Outputs in ./figures/ :
- fig_kv_both_overhead.png — three-way latency bars (plain vs kv_both vs v2)
- fig_v2_trigger_funnel.png — request count per fall-through reason
- fig_v2_predicted_vs_actual.png — cost-model migrate prediction vs realized TTFT
- fig_three_way_hotspot.png — per-worker TTFT p90 grouped bars
"""
from __future__ import annotations
import json
from collections import Counter
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
ROOT = Path(__file__).parent
DATA = ROOT / "data"
OUT = ROOT / "figures"
OUT.mkdir(parents=True, exist_ok=True)
def _load(name: str):
return json.loads((DATA / name).read_text())
POLICY_COLORS = {
"unified": "#2ca02c",
"unified_kv_both": "#9467bd",
"unified_v2": "#d62728",
"unified_v2_strict": "#ff7f0e",
}
def fig_kv_both_overhead():
comp = _load("b3_policy_comparison.json")
by = {r["policy"]: r for r in comp["rows"]}
pols = ["unified", "unified_kv_both", "unified_v2"]
metrics = [
("TTFT p90 (s)", lambda r: r["ttft_p90_s"]),
("TPOT p90 (ms)", lambda r: r["tpot_p90_s"] * 1000),
("E2E p90 (s)", lambda r: r["e2e_p90_s"]),
("hotspot index", lambda r: r["hotspot_index_ttft_p90"]),
]
fig, axes = plt.subplots(1, 4, figsize=(14, 4))
for ax, (label, fn) in zip(axes, metrics):
vals = [fn(by[p]) for p in pols]
bars = ax.bar(pols, vals,
color=[POLICY_COLORS[p] for p in pols],
edgecolor="black", linewidth=0.5)
ax.set_title(label)
ax.tick_params(axis="x", rotation=20, labelsize=9)
for b, v in zip(bars, vals):
ax.text(b.get_x() + b.get_width() / 2, v,
f"{v:.2f}" if v < 100 else f"{v:.0f}",
ha="center", va="bottom", fontsize=9)
ax.grid(alpha=0.3, axis="y")
# delta annotation
baseline = vals[0]
for i, v in enumerate(vals):
if i == 0:
continue
pct = (v - baseline) / baseline * 100
ax.text(i, v * 0.5, f"{pct:+.0f}%", ha="center",
fontsize=10, fontweight="bold",
color="darkred" if pct > 0 else "darkgreen")
fig.suptitle(
"kv_both adds ~45% to TTFT p90 even without PD-sep firing.\n"
"v2's PD-sep barely recovers the gap (and overshoots TTFT p99)."
)
fig.tight_layout()
fig.savefig(OUT / "fig_kv_both_overhead.png", dpi=120)
plt.close(fig)
def _bucket_reasons(data):
"""Collapse v2_reason strings into the funnel buckets."""
buckets = Counter()
for r in data:
if r.get("v2_pd_sep") is True:
buckets["PD-sep TRIGGERED"] += 1
continue
reason = (r.get("v2_reason") or "no_v2_reason").split(" (")[0]
if reason.startswith("local_cost"):
reason = "cost_benefit not enough margin"
buckets[reason] += 1
return buckets
def fig_v2_trigger_funnel():
strict = _load("breakdown_unified_v2_strict.json")
relaxed = _load("breakdown_unified_v2.json")
bs = _bucket_reasons(strict)
br = _bucket_reasons(relaxed)
order = [
"new_local_below_threshold",
"chosen_no_active_decode",
"chosen_few_decodes",
"src_cache_below_threshold",
"src_not_meaningfully_more_cache",
"cost_benefit not enough margin",
"PD-sep TRIGGERED",
]
labels = [k for k in order if k in bs or k in br]
strict_vals = [bs.get(k, 0) for k in labels]
relaxed_vals = [br.get(k, 0) for k in labels]
x = range(len(labels))
width = 0.4
fig, ax = plt.subplots(figsize=(11, 5))
ax.bar([i - width / 2 for i in x], strict_vals, width,
label=f"v2.0 strict (PD-sep={bs['PD-sep TRIGGERED']}/{sum(bs.values())} "
f"= {bs['PD-sep TRIGGERED']*100/sum(bs.values()):.2f}%)",
color="#ff7f0e", edgecolor="black", linewidth=0.5)
ax.bar([i + width / 2 for i in x], relaxed_vals, width,
label=f"v2.1 relaxed (PD-sep={br['PD-sep TRIGGERED']}/{sum(br.values())} "
f"= {br['PD-sep TRIGGERED']*100/sum(br.values()):.2f}%)",
color="#d62728", edgecolor="black", linewidth=0.5)
ax.set_xticks(list(x))
ax.set_xticklabels(labels, rotation=20, ha="right", fontsize=9)
ax.set_ylabel("request count")
ax.set_yscale("log")
ax.set_title(
"Why v2 rarely PD-seps: 88-76% of requests have new_local < threshold\n"
"(intra-session cache already hot). Relaxing thresholds barely helps."
)
ax.legend()
ax.grid(alpha=0.3, axis="y", which="both")
for i, (s, r) in enumerate(zip(strict_vals, relaxed_vals)):
if s > 0:
ax.text(i - width / 2, s * 1.05, str(s), ha="center", fontsize=8)
if r > 0:
ax.text(i + width / 2, r * 1.05, str(r), ha="center", fontsize=8)
fig.tight_layout()
fig.savefig(OUT / "fig_v2_trigger_funnel.png", dpi=120)
plt.close(fig)
def fig_v2_predicted_vs_actual():
"""For each PD-sep'd request, plot model-predicted migrate cost
vs realized TTFT. Should sit near y=x if model is calibrated; sits
far above if mechanism is more expensive than modeled."""
relaxed = _load("breakdown_unified_v2.json")
triggered = [r for r in relaxed if r.get("v2_pd_sep") is True]
if not triggered:
return
predicted = []
actual = []
sizes = []
rids = []
for r in triggered:
cm = r.get("v2_cost_migrate_s")
t0 = r.get("t_proxy_recv")
t_first = r.get("t_first_token")
if cm is None or t0 is None or t_first is None:
continue
ttft = t_first - t0
predicted.append(cm)
actual.append(ttft)
sizes.append(r.get("input_length", 0))
rids.append(r.get("request_id", "?"))
fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(predicted, actual,
s=[max(100, sz / 100) for sz in sizes],
color="#d62728", edgecolors="black", alpha=0.75)
for p, a, sz, rid in zip(predicted, actual, sizes, rids):
ax.annotate(f"input={sz}",
(p, a), xytext=(8, 6), textcoords="offset points",
fontsize=9)
# y=x reference + 10x line + 20x line
lo = 0.5
hi = max(50, max(actual) * 1.2)
ax.plot([lo, hi], [lo, hi], "k--", alpha=0.5, label="y = x (calibrated)")
ax.plot([lo, hi], [lo * 10, hi * 10], color="gray", linestyle=":",
alpha=0.4, label="10x")
ax.plot([lo, hi], [lo * 20, hi * 20], color="lightgray", linestyle=":",
alpha=0.4, label="20x")
ax.set_xscale("log")
ax.set_yscale("log")
ax.set_xlim(lo, hi)
ax.set_ylim(lo, hi)
ax.set_xlabel("Cost model: predicted migrate cost (s)")
ax.set_ylabel("Realized TTFT (s)")
ax.set_title(
"All 5 PD-sep triggered requests in v2.1 sit far above y=x.\n"
"Real transfer cost ~10-20x what the calibrated model predicted."
)
ax.grid(alpha=0.3, which="both")
ax.legend(loc="lower right")
fig.tight_layout()
fig.savefig(OUT / "fig_v2_predicted_vs_actual.png", dpi=120)
plt.close(fig)
def fig_three_way_hotspot():
pols = ["unified", "unified_kv_both", "unified_v2"]
per_worker = {p: _load(f"per_worker_{p}.json") for p in pols}
workers = sorted(per_worker["unified"]["per_worker_ttft_p90_s"].keys())
x = range(len(workers))
width = 0.27
fig, ax = plt.subplots(figsize=(11, 5))
for i, p in enumerate(pols):
d = per_worker[p]["per_worker_ttft_p90_s"]
vals = [d[w] for w in workers]
offset = (i - 1) * width
ax.bar([j + offset for j in x], vals, width,
label=f"{p} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
color=POLICY_COLORS[p], edgecolor="black", linewidth=0.4)
short = [w.replace("http://127.0.0.1:", ":") for w in workers]
ax.set_xticks(list(x))
ax.set_xticklabels(short, rotation=0, fontsize=9)
ax.set_ylabel("worker TTFT p90 (s)")
ax.set_title(
"Per-worker TTFT p90 distribution. kv_both alone makes the hot worker hotter\n"
"(unified→kv_both: 37.7s→43.5s peak); v2's 5 PD-sep triggers nudge it back."
)
ax.legend(loc="upper left", fontsize=9)
ax.grid(alpha=0.3, axis="y")
fig.tight_layout()
fig.savefig(OUT / "fig_three_way_hotspot.png", dpi=120)
plt.close(fig)
def main():
fig_kv_both_overhead()
fig_v2_trigger_funnel()
fig_v2_predicted_vs_actual()
fig_three_way_hotspot()
print(f"wrote 4 figures to {OUT}")
if __name__ == "__main__":
main()