Elastic migration v2 section: PD-sep on agentic workload is net negative

New analysis/characterization/elastic_migration_v2/ packages the unified_v2 + unified_kv_both experiments into a self-contained results section that the paper can cite as the "we tried selective PD-sep migration" case study. The section finds three independent reasons PD-sep doesn't help on agentic w600: 1. Mooncake kv_both substrate alone (no PD-sep ever firing) imposes TTFT p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain unified. Per-step KVConnectorMetadata maintenance and block reservation semantics dominate even when no transfer is pending. 2. PD-sep gate fires only 0.16-0.41% of requests across two gate-tightness configurations. 88-76% are killed by new_local < threshold because 93% intra-session reuse on agentic traces leaves a small uncached tail; 19% are killed by chosen_no_active_decode (snapshot-time gate). Even relaxed thresholds can't grow trigger rate past 0.5%. 3. When PD-sep fires, the calibrated cost model (0.3s + bytes / 2.7 GB/s) is wrong by 10-20x. 5 triggered requests in v2.1 saw realized TTFT 12-45s vs model-predicted migrate cost 0.7-2.2s, consistent with the E2 audit's finding that D-side block pre-reservation and missing layerwise pipelining dominate the decode_sent -> first_token clock. Three-way comparison (unified vs unified_kv_both vs unified_v2): v2 vs the kv_both control is roughly net-zero (-10% hotspot, -14% TPOT p90, +3% TTFT p90, +9% TTFT p99). v2 vs plain unified is strictly worse by 27-49% across latency percentiles because the kv_both substrate tax is unavoidable when the policy is enabled. Contents: - README.md: the four results sections, the three-way comparison table, an explicit "what this claims for the paper" list, and a cross-reference index to the earlier characterization documents. - data/: b3_policy_comparison.json + per-policy breakdown.json + per-policy hotspot_index.json for the four policies in scope. - figures/: 4 PNGs rendered by render_figures.py: * fig_kv_both_overhead.png — 4-metric bar chart with delta annotations showing kv_both alone costs +45% TTFT p90. * fig_v2_trigger_funnel.png — per-reason request count for the two gate configurations on log scale. * fig_v2_predicted_vs_actual.png — scatter of model-predicted migrate cost vs realized TTFT for the 5 triggered requests, with y=x, 10x, and 20x reference lines. * fig_three_way_hotspot.png — per-worker TTFT p90 grouped bars across the three policies. The section is intentionally self-contained: it lists what the experiment validates (cost model picks correct candidates; shadow-drift fix is necessary; same-worker interference is real) alongside what it disproves (per-request PD-sep on agentic via Mooncake is not a net win in current implementation). Refs: E1/E2 subagent audits, B2 microbench, unified_v2 commits 19f69a9 / 4b833d3 / 95c8ef8. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 13:28:37 +08:00
parent 95c8ef853c
commit d76eb02637
15 changed files with 839 additions and 0 deletions
--- a/analysis/characterization/elastic_migration_v2/README.md
+++ b/analysis/characterization/elastic_migration_v2/README.md
@@ -0,0 +1,284 @@
+# Elastic Migration v2: Selective PD-Separation via Mooncake
+
+Date: 2026-05-26
+Trace: `traces/w600_r0.0015_st30.jsonl` (1214 reqs, 274 sessions, 53.3 M tokens)
+Model: Qwen3-Coder-30B-A3B-Instruct, 8 × TP1 on H20
+
+## TL;DR
+
+This section explores whether the **B2-confirmed same-worker
+prefill–decode interference** can be relieved by selectively
+migrating prefill to a different worker for the requests where the
+interference cost would dominate the transfer cost. We implement two
+flavors of the policy (strict gates, then relaxed gates) and a clean
+isolation control (`unified_kv_both`: same picker as `unified`, but
+the vLLMs are launched in `kv_role=kv_both` so the Mooncake
+substrate is on but never triggers).
+
+Three findings:
+
+1. **`kv_role=kv_both` alone imposes a heavy always-on tax**: TTFT
+   p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain `unified`,
+   with no PD-sep ever firing.
+2. **PD-sep almost never triggers on a real agentic workload**:
+   0.16% with strict gates, 0.41% with relaxed gates. Agentic
+   workloads have 93% intra-session reuse, so most requests land on
+   workers that already hold cache — the uncached tail is too small
+   to be worth migrating.
+3. **When PD-sep does fire, the cost model is wrong by ~10–20×**:
+   the calibrated `0.3s + bytes / 2.7 GB/s` predicts 1–2 s migrate
+   cost; observed TTFT on triggered requests is 12–45 s. The same
+   D-side block-reservation pressure and absence of layerwise
+   pipelining that the E2 audit flagged still dominate.
+
+The net latency of `unified_v2` is **not better than plain
+`unified`**. Improving agentic PD-sep requires fixing the underlying
+Mooncake transfer mechanism (E2 patches 6.1 lazy block reservation
+and 6.3 layerwise pipelining), not the routing decision.
+
+## Substrate
+
+We compare three policies on identical traces:
+
+| policy | picker | vLLM launch mode | what's it for |
+|---|---|---|---|
+| `unified` | hybrid affinity + LMetric | plain (no Mooncake) | the headline baseline |
+| `unified_kv_both` | same as `unified` | `kv_role=kv_both` + bootstrap | isolation control: how much does kv_both *alone* cost? |
+| `unified_v2` | unified + selective PD-sep | `kv_role=kv_both` + bootstrap | the actual experiment |
+
+All three use the same trace, the same 8-instance topology, the same
+shadow-drift–corrected proxy (`scripts/cache_aware_proxy.py` post-fix
+`95c8ef8`). Plain `unified` was rerun on the patched proxy
+(`b3_sweep_20260525_095043/unified`) under the same conditions.
+
+## Result 1 — kv_both is expensive by itself
+
+![](figures/fig_kv_both_overhead.png)
+
+Switching the vLLM launch from plain to `kv_role=kv_both` without
+ever triggering PD-sep already costs:
+
+| metric | plain `unified` | `unified_kv_both` | Δ |
+|---|---:|---:|---|
+| TTFT p50 | 0.50 s | 0.50 s | +0% |
+| TTFT p90 | 7.35 s | 10.67 s | **+45%** |
+| TTFT p99 | 42.34 s | 45.19 s | +7% |
+| TPOT p90 | 17.1 ms | 21.3 ms | **+25%** |
+| E2E p90 | 18.03 s | 22.89 s | **+27%** |
+| APC | 79.4% | 78.3% | −1.1 pp |
+| hotspot index | 3.667 | **4.363** | **+19%** |
+
+Two contributing factors:
+
+1. **The Mooncake `MooncakeConnector` runs even when no transfer is
+   pending.** Every scheduler step it walks `set(cache.keys())`
+   against `_known_hash_keys` (E2 audit §6.5) and updates the
+   `KVConnectorMetadata`. This is O(|cache|) per step on every
+   engine, even when no producer/consumer relationship is active.
+2. **Block reservation semantics differ** under kv_both. The
+   scheduler treats blocks as candidates for export-to-others, so
+   the prefix cache LRU pressure is slightly different (we lose 1
+   pp APC).
+
+Practical implication: **you don't enable kv_both for free**. If a
+deployment wants the option to do PD-sep selectively, the 45% TTFT
+p90 tax applies even on requests that stay local. This needs to
+recoverable cost before any selective-PD-sep policy is worth
+shipping.
+
+## Result 2 — PD-sep rarely fires on a real agentic trace
+
+![](figures/fig_v2_trigger_funnel.png)
+
+We log every routing decision's `v2_reason` (why we did or did not
+PD-sep). Two runs with different gate thresholds:
+
+| fall-through bucket | v2.0 strict | v2.1 relaxed | what it means |
+|---|---:|---:|---|
+| `new_local < threshold` | 1077 (88.7%) | 924 (76.1%) | uncached tail too small to justify transfer |
+| `chosen_no_active_decode` | 115 (9.5%) | 229 (18.9%) | no decode on chosen to protect |
+| `src_cache_below_threshold` | 14 (1.2%) | 36 (3.0%) | no alt instance holds enough cache |
+| `src_not_meaningfully_more_cache` | 6 (0.5%) | 16 (1.3%) | alt instance doesn't help vs chosen |
+| `cost_benefit not enough margin` | 0 | 4 (0.3%) | model says transfer cost + interference on src ≥ local interference |
+| **PD-sep TRIGGERED** | **2 (0.16%)** | **5 (0.41%)** | passed all gates and cost-benefit favored migrate |
+
+The dominant filter is `new_local < threshold`. Even with the
+threshold dropped from 16 k to 8 k tokens, three out of four requests
+have less than 8 k uncached tokens at the chosen worker. This is
+structural: with intra-session reuse measured at 93% on the same
+trace (window_1_results.md), most turns hit prefix cache on the
+session's previous worker.
+
+The second filter, `chosen_no_active_decode`, kills another fifth.
+This is a snapshot-time phenomenon: at the moment the picker runs,
+the chosen worker often has its previous request still in prefill,
+not yet decoding. The gate's intent ("don't migrate if no decode is
+being hurt by the prefill we're routing") is correct, but it ends up
+suppressing PD-sep for a real situation where decode is *about to*
+start.
+
+Even after these two filters, the cost-benefit step itself rejects
+nearly half of remaining candidates (4 out of 9 in relaxed). So the
+final trigger rate of 0.41% is a structural property, not a
+parameter-tuning problem.
+
+## Result 3 — when PD-sep fires, the cost model is wrong by 10–20×
+
+![](figures/fig_v2_predicted_vs_actual.png)
+
+The 5 PD-sep-triggered requests in v2.1 relaxed:
+
+| input | new_local | new_src | src→dst | cost_local | cost_migrate (model) | actual TTFT | actual E2E |
+|---:|---:|---:|---|---:|---:|---:|---:|
+| 21963 | 21963 |  9163 | 6→5 | 4.39 s | 4.17 s |   3.69 s |   8.48 s |
+|  8706 |  8706 |  2050 | 5→7 | 1.09 s | 0.73 s |  12.48 s |  14.31 s |
+| 13616 | 13616 |  2352 | 4→0 | 1.70 s | 1.03 s |  18.33 s |  19.50 s |
+| 49483 | 49483 |   843 | 3→4 | 11.75 s | 2.16 s | **45.13 s** | **53.55 s** |
+| 19806 | 19806 |   350 | 3→6 | 3.96 s | 1.06 s |  20.06 s |  31.98 s |
+
+The cost model predicts the migrate path will take 0.7–2.2 s; the
+actual TTFT on these requests is 12–45 s. The model's `0.3 s +
+bytes / 2.7 GB/s` calibration captures pure RDMA bandwidth in
+isolation but misses everything else that happens on the
+`decode_sent → first_token` clock: D-side scheduler step latency,
+block reservation before KV arrives (so D's cache pressure
+increases for the entire wait), the per-layer scatter of
+`batch_transfer_sync_write`, and the next-step scheduler promotion
+after `finished_recving`. The E2 audit measured this end-to-end at
+p50 = 1.1 s and **p90 = 6.7 s** on production runs; the v2.1
+triggered requests landed in the p99 tail of that distribution
+because their dst was already loaded.
+
+The first-token clock for the 49 k request is **21× the model's
+prediction**. This is not a small mis-tuning — it's a structurally
+different model.
+
+## Result 4 — three-way comparison
+
+![](figures/fig_three_way_hotspot.png)
+
+The full table:
+
+| metric | unified (plain) | unified_kv_both | unified_v2 (relaxed) |
+|---|---:|---:|---:|
+| n_ok | 1214 | 1214 | 1214 |
+| TTFT p50 | 0.50 s | 0.50 s | 0.49 s |
+| TTFT p90 | 7.35 s | 10.67 s | 10.98 s |
+| TTFT p99 | 42.34 s | 45.19 s | 49.45 s |
+| TPOT p90 | 17.1 ms | 21.3 ms | 18.4 ms |
+| E2E p90 | 18.03 s | 22.89 s | 22.53 s |
+| APC | 79.4% | 78.3% | 77.6% |
+| interference index | n/a (no engine_state) | 8.57 | 8.46 |
+| hotspot index | 3.667 | 4.363 | 3.910 |
+| n_slow | 189 | 198 | 198 |
+
+### v2 vs the kv_both control (the right comparison)
+
+Compared to the kv_both control — same substrate, no PD-sep — the
+5 PD-sep triggers in v2:
+
+- **slightly improve TPOT p90 (−14%) and hotspot (−10%)**
+- **slightly worsen TTFT p90 (+3%) and TTFT p99 (+9%)**, because the
+  triggered requests themselves take ~20× the predicted transfer
+  time
+
+The net effect against the kv_both control is in the noise. The
+hotspot improvement is within the run-to-run stochastic range we saw
+earlier (v2 strict run scored 2.733 hotspot under the same
+substrate; v2 relaxed scored 3.910).
+
+### v2 vs plain unified (the headline question)
+
+`unified_v2` is **27% slower on E2E p90** and **49% slower on TTFT
+p90** than plain `unified`. The 45 pp of TTFT p90 inflation is from
+kv_both substrate, not the routing decision; nothing PD-sep does can
+recover this in our current Mooncake implementation.
+
+## Why v2's PD-sep is fundamentally choked
+
+There are three independent structural problems, each by itself
+enough to make v2 not win:
+
+1. **The kv_both substrate is the wrong default**. It pays a 45%
+   TTFT p90 tax on every request. To make selective PD-sep beat
+   plain `unified`, the saved interference per triggered request
+   times the trigger rate must exceed 45% × average TTFT, on
+   average. With 0.41% trigger rate, even saving 100% of TTFT per
+   triggered request would only save ~0.4%, which can't recover 45%.
+
+2. **Agentic intra-session reuse leaves no headroom for migration**.
+   Most turns hit cache on the worker that handled the previous
+   turn. Migrating prefill to a *different* worker is the *exact*
+   thing intra-session affinity tries to avoid: it forces the new
+   worker to pay for the cached prefix transfer instead of just
+   reusing what's already on the affinity worker. This is a
+   structural mismatch between PD-sep semantics ("send big prefills
+   to a less-busy worker") and agentic workloads ("keep sessions
+   sticky to wherever the cache is").
+
+3. **The Mooncake mechanism is 10–20× slower than the cost model
+   predicts**, primarily due to D-side pre-allocation of KV blocks
+   and the absence of layerwise pipelining (E2 audit §6.1 / §6.3).
+   The cost model can be re-calibrated, but doing so would push the
+   gate even tighter, dropping the already-tiny trigger rate to
+   nearly zero.
+
+The three are stacked: even if any two were fixed, the remaining
+one would still make PD-sep a net loss on this trace.
+
+## What this section claims for the paper
+
+1. **Same-worker prefill–decode interference is a real mechanism**
+   (B2 microbench), but **agentic workloads rarely expose it**: the
+   typical request has high cache hit and small uncached tail, so
+   the interference cost is bounded.
+2. **Routing-only solutions (unified) already capture 79% of the
+   intra-session APC ceiling and recover the latency** by avoiding
+   the heavy-tail sessions through the affinity gate. The remaining
+   23 pp gap to the ceiling is from APC LRU eviction under capacity
+   pressure, not from prefill–decode interference.
+3. **Per-request PD-sep via Mooncake on agentic workloads is not a
+   net win** in our measurements, even with a carefully-gated cost
+   model. The combined effect of kv_both substrate overhead, low
+   trigger rate, and mechanism-vs-model gap is uniformly negative.
+4. **A productive direction is mechanism-level**: fix the Mooncake
+   D-side block reservation (E2 §6.1), implement layerwise transfer
+   pipelining (E2 §6.3), and re-measure. Only if these patches drop
+   the substrate tax to <10% and the realized transfer to ≤2 s p90
+   does PD-sep become competitive with routing on agentic traces.
+
+## What v2 still validates
+
+- **The cost model's *qualitative* shape is correct**: when it says
+  "migrate", that's a request where local interference *would have*
+  been ≥ 4 s and src has ≥ 80% prefix cache. The model picks the
+  right candidate requests.
+- **The gate logic catches the right exclusions**: 88% by uncached
+  tail size, 19% by no-decode-to-protect, the rest by missing
+  source cache. Each is a structurally correct reason.
+- **The proxy shadow-drift fix is necessary infrastructure** for
+  any long-running routing experiment. We observed 3 phantom
+  corrections per ~50-minute run.
+
+## Files
+
+- `data/b3_policy_comparison.json` — the four policies' headline
+  metrics from the same B3 sweep root.
+- `data/breakdown_<policy>.json` — per-request proxy breakdown
+  including v2 gate fields and triggered-event metadata.
+- `data/per_worker_<policy>.json` — per-worker TTFT/latency p90s
+  used in the hotspot figure.
+- `figures/*.png` — the four section figures referenced above.
+- `render_figures.py` — regenerates the figures from data/.
+
+## Cross-references
+
+- `analysis/characterization/window_1_results.md` — B2 microbench
+  (same-worker interference causal proof) and B3 baseline 5-policy
+  sweep
+- `analysis/characterization/agentic_dispatch_coupling.md` — why
+  the saturated-replay setup matches agentic production
+- `analysis/characterization/b3_policies_pseudocode.md` — pickers
+  for the five baseline policies; `unified_v2` extends `unified`
+- E1 / E2 subagent reports (commit `4b833d3` message and the
+  conversation log) — full mechanism audit that informed v2's design
--- a/analysis/characterization/elastic_migration_v2/data/b3_policy_comparison.json
+++ b/analysis/characterization/elastic_migration_v2/data/b3_policy_comparison.json
@@ -0,0 +1,211 @@
+{
+  "rows": [
+    {
+      "policy": "capped",
+      "n_ok": 770,
+      "n_total": 770,
+      "ttft_p50_s": 1.1989156164927408,
+      "ttft_p90_s": 12.827629912580612,
+      "ttft_p99_s": 46.61752380923125,
+      "tpot_p50_s": 0.007231239004497606,
+      "tpot_p90_s": 0.015998617687440243,
+      "tpot_p99_s": 0.11515370831539476,
+      "e2e_p50_s": 2.598489043477457,
+      "e2e_p90_s": 21.245602010778384,
+      "e2e_p99_s": 74.60736650204846,
+      "apc_ratio": 0.3158312503528108,
+      "interference_index": 6.331064378362814,
+      "hotspot_index_ttft_p90": 2.0204268015410918,
+      "reuse_intra_frac": 0.9192657105586233,
+      "reuse_cross_frac": 0.0602232594931501,
+      "n_slow": 185,
+      "failure_counts": {
+        "cache_miss_large_append": 60,
+        "hot_worker_queue": 66,
+        "same_worker_prefill_overlap": 45,
+        "unknown": 14
+      }
+    },
+    {
+      "policy": "lmetric",
+      "n_ok": 1214,
+      "n_total": 1214,
+      "ttft_p50_s": 0.9387824369769078,
+      "ttft_p90_s": 15.671339168207492,
+      "ttft_p99_s": 53.56683189840049,
+      "tpot_p50_s": 0.008854518407308914,
+      "tpot_p90_s": 0.02122720699121469,
+      "tpot_p99_s": 0.18280341184277568,
+      "e2e_p50_s": 2.754255389008904,
+      "e2e_p90_s": 24.8209177934099,
+      "e2e_p99_s": 80.59924928059091,
+      "apc_ratio": 0.5694312382571595,
+      "interference_index": 6.530231061794441,
+      "hotspot_index_ttft_p90": 2.252837147833725,
+      "reuse_intra_frac": 0.9321238805590836,
+      "reuse_cross_frac": 0.05679481258506571,
+      "n_slow": 295,
+      "failure_counts": {
+        "cache_miss_large_append": 94,
+        "hot_worker_queue": 68,
+        "same_worker_prefill_overlap": 69,
+        "unknown": 64
+      }
+    },
+    {
+      "policy": "load_only",
+      "n_ok": 1214,
+      "n_total": 1214,
+      "ttft_p50_s": 1.2609447415161412,
+      "ttft_p90_s": 20.197147866390882,
+      "ttft_p99_s": 52.84285237012196,
+      "tpot_p50_s": 0.009231464695980247,
+      "tpot_p90_s": 0.026851662550158716,
+      "tpot_p99_s": 0.3211630676943426,
+      "e2e_p50_s": 3.58568156149704,
+      "e2e_p90_s": 33.459180271782685,
+      "e2e_p99_s": 93.95083751494239,
+      "apc_ratio": 0.5412093853102866,
+      "interference_index": 9.16424627504275,
+      "hotspot_index_ttft_p90": 1.2940319990630569,
+      "reuse_intra_frac": 0.9353191550754928,
+      "reuse_cross_frac": 0.053372184678592026,
+      "n_slow": 379,
+      "failure_counts": {
+        "cache_miss_large_append": 151,
+        "hot_worker_queue": 33,
+        "same_worker_prefill_overlap": 108,
+        "unknown": 87
+      }
+    },
+    {
+      "policy": "sticky",
+      "n_ok": 1214,
+      "n_total": 1214,
+      "ttft_p50_s": 0.5415176274836995,
+      "ttft_p90_s": 18.021296651283045,
+      "ttft_p99_s": 74.09429564891524,
+      "tpot_p50_s": 0.008952101894096181,
+      "tpot_p90_s": 0.03641285916619554,
+      "tpot_p99_s": 0.35152006935195085,
+      "e2e_p50_s": 2.081947358994512,
+      "e2e_p90_s": 34.62592205510591,
+      "e2e_p99_s": 139.68334607904353,
+      "apc_ratio": 0.7720092868396378,
+      "interference_index": 13.651718321568111,
+      "hotspot_index_ttft_p90": 2.727756623171119,
+      "reuse_intra_frac": 0.9327723488279339,
+      "reuse_cross_frac": 0.05495149683864246,
+      "n_slow": 234,
+      "failure_counts": {
+        "cache_miss_large_append": 20,
+        "hot_worker_queue": 51,
+        "same_worker_prefill_overlap": 134,
+        "unknown": 29
+      }
+    },
+    {
+      "policy": "unified",
+      "n_ok": 1213,
+      "n_total": 1214,
+      "ttft_p50_s": 0.4997710260213353,
+      "ttft_p90_s": 7.345769894809922,
+      "ttft_p99_s": 42.34170345296613,
+      "tpot_p50_s": 0.008079791456705824,
+      "tpot_p90_s": 0.017110194704198407,
+      "tpot_p99_s": 0.12655874612209597,
+      "e2e_p50_s": 1.7495028690318577,
+      "e2e_p90_s": 18.033410895219994,
+      "e2e_p99_s": 68.80023987947489,
+      "apc_ratio": 0.794261466256467,
+      "interference_index": null,
+      "hotspot_index_ttft_p90": 3.667136528736114,
+      "reuse_intra_frac": 0.9311187350942534,
+      "reuse_cross_frac": 0.056702150437367635,
+      "n_slow": 189,
+      "failure_counts": {
+        "cache_miss_large_append": 18,
+        "hot_worker_queue": 116,
+        "unknown": 55
+      }
+    },
+    {
+      "policy": "unified_kv_both",
+      "n_ok": 1214,
+      "n_total": 1214,
+      "ttft_p50_s": 0.4958424885116983,
+      "ttft_p90_s": 10.671844050800438,
+      "ttft_p99_s": 45.19353310586651,
+      "tpot_p50_s": 0.008573156389059812,
+      "tpot_p90_s": 0.021303916384344358,
+      "tpot_p99_s": 0.21501837408937963,
+      "e2e_p50_s": 1.9310281965008471,
+      "e2e_p90_s": 22.8941433175176,
+      "e2e_p99_s": 76.06128971517893,
+      "apc_ratio": 0.7828397082703908,
+      "interference_index": 8.571603637346875,
+      "hotspot_index_ttft_p90": 4.363145984888287,
+      "reuse_intra_frac": 0.9313000825240145,
+      "reuse_cross_frac": 0.056182260858791105,
+      "n_slow": 198,
+      "failure_counts": {
+        "cache_miss_large_append": 28,
+        "hot_worker_queue": 34,
+        "same_worker_prefill_overlap": 87,
+        "unknown": 49
+      }
+    },
+    {
+      "policy": "unified_v2",
+      "n_ok": 1214,
+      "n_total": 1214,
+      "ttft_p50_s": 0.4851180294645019,
+      "ttft_p90_s": 10.97665627548705,
+      "ttft_p99_s": 49.44861259821856,
+      "tpot_p50_s": 0.008261419251554481,
+      "tpot_p90_s": 0.018414033703249108,
+      "tpot_p99_s": 0.20999689490980364,
+      "e2e_p50_s": 1.8092182099935599,
+      "e2e_p90_s": 22.528888442111203,
+      "e2e_p99_s": 82.40234094743934,
+      "apc_ratio": 0.7758437361549086,
+      "interference_index": 8.45656745230457,
+      "hotspot_index_ttft_p90": 3.9096187869766164,
+      "reuse_intra_frac": 0.9324663389938368,
+      "reuse_cross_frac": 0.055154184817413764,
+      "n_slow": 198,
+      "failure_counts": {
+        "cache_miss_large_append": 36,
+        "hot_worker_queue": 26,
+        "same_worker_prefill_overlap": 82,
+        "unknown": 54
+      }
+    },
+    {
+      "policy": "unified_v2_strict",
+      "n_ok": 1214,
+      "n_total": 1214,
+      "ttft_p50_s": 0.4849805940175429,
+      "ttft_p90_s": 8.960840504511737,
+      "ttft_p99_s": 44.63598358390898,
+      "tpot_p50_s": 0.008222105788569446,
+      "tpot_p90_s": 0.018078321745916927,
+      "tpot_p99_s": 0.14616439095890604,
+      "e2e_p50_s": 1.8335122870048508,
+      "e2e_p90_s": 22.435233922180526,
+      "e2e_p99_s": 68.254801789901,
+      "apc_ratio": 0.789281361129855,
+      "interference_index": 6.231677388887276,
+      "hotspot_index_ttft_p90": 2.7334230011629197,
+      "reuse_intra_frac": 0.9309082618411778,
+      "reuse_cross_frac": 0.05689887985860397,
+      "n_slow": 186,
+      "failure_counts": {
+        "cache_miss_large_append": 26,
+        "hot_worker_queue": 44,
+        "same_worker_prefill_overlap": 73,
+        "unknown": 43
+      }
+    }
+  ]
+}
--- a/analysis/characterization/elastic_migration_v2/data/breakdown_unified.json
+++ b/analysis/characterization/elastic_migration_v2/data/breakdown_unified.json
--- a/analysis/characterization/elastic_migration_v2/data/breakdown_unified_kv_both.json
+++ b/analysis/characterization/elastic_migration_v2/data/breakdown_unified_kv_both.json
--- a/analysis/characterization/elastic_migration_v2/data/breakdown_unified_v2.json
+++ b/analysis/characterization/elastic_migration_v2/data/breakdown_unified_v2.json
--- a/analysis/characterization/elastic_migration_v2/data/breakdown_unified_v2_strict.json
+++ b/analysis/characterization/elastic_migration_v2/data/breakdown_unified_v2_strict.json
--- a/analysis/characterization/elastic_migration_v2/data/per_worker_unified.json
+++ b/analysis/characterization/elastic_migration_v2/data/per_worker_unified.json
@@ -0,0 +1,24 @@
+{
+  "hotspot_index_ttft_p90": 3.667136528736114,
+  "per_worker_latency_p90_s": {
+    "http://127.0.0.1:8000": 41.42001512600109,
+    "http://127.0.0.1:8001": 12.4878579101933,
+    "http://127.0.0.1:8002": 22.462878945574648,
+    "http://127.0.0.1:8003": 15.501050900109117,
+    "http://127.0.0.1:8004": 39.956250199786155,
+    "http://127.0.0.1:8005": 36.69850301651168,
+    "http://127.0.0.1:8006": 10.116177947795954,
+    "http://127.0.0.1:8007": 20.35038618039107
+  },
+  "per_worker_ttft_p90_s": {
+    "http://127.0.0.1:8000": 11.264844838529825,
+    "http://127.0.0.1:8001": 3.6063860427122614,
+    "http://127.0.0.1:8002": 16.175747957825664,
+    "http://127.0.0.1:8003": 9.314684258581842,
+    "http://127.0.0.1:8004": 37.73397144810297,
+    "http://127.0.0.1:8005": 18.328030522551852,
+    "http://127.0.0.1:8006": 3.6328767628350773,
+    "http://127.0.0.1:8007": 7.772977900883419
+  },
+  "status": "supported"
+}
--- a/analysis/characterization/elastic_migration_v2/data/per_worker_unified_kv_both.json
+++ b/analysis/characterization/elastic_migration_v2/data/per_worker_unified_kv_both.json
@@ -0,0 +1,24 @@
+{
+  "hotspot_index_ttft_p90": 4.363145984888287,
+  "per_worker_latency_p90_s": {
+    "http://127.0.0.1:8000": 7.273825440008658,
+    "http://127.0.0.1:8001": 40.48809068736155,
+    "http://127.0.0.1:8002": 24.491076068370596,
+    "http://127.0.0.1:8003": 18.828550089401002,
+    "http://127.0.0.1:8004": 20.06954986089262,
+    "http://127.0.0.1:8005": 9.634067087399307,
+    "http://127.0.0.1:8006": 35.7432237003348,
+    "http://127.0.0.1:8007": 24.362499430915342
+  },
+  "per_worker_ttft_p90_s": {
+    "http://127.0.0.1:8000": 2.725343641615472,
+    "http://127.0.0.1:8001": 30.449911632167645,
+    "http://127.0.0.1:8002": 16.297463109577073,
+    "http://127.0.0.1:8003": 6.766894554614579,
+    "http://127.0.0.1:8004": 11.146178993489595,
+    "http://127.0.0.1:8005": 4.552643961587455,
+    "http://127.0.0.1:8006": 6.90922680192164,
+    "http://127.0.0.1:8007": 7.048551249800954
+  },
+  "status": "supported"
+}
--- a/analysis/characterization/elastic_migration_v2/data/per_worker_unified_v2.json
+++ b/analysis/characterization/elastic_migration_v2/data/per_worker_unified_v2.json
@@ -0,0 +1,24 @@
+{
+  "hotspot_index_ttft_p90": 3.9096187869766164,
+  "per_worker_latency_p90_s": {
+    "http://127.0.0.1:8000": 27.12522437740119,
+    "http://127.0.0.1:8001": 15.299228341400166,
+    "http://127.0.0.1:8002": 49.346961313998335,
+    "http://127.0.0.1:8003": 22.404519376007386,
+    "http://127.0.0.1:8004": 22.470557069155618,
+    "http://127.0.0.1:8005": 17.487964828591807,
+    "http://127.0.0.1:8006": 21.76291022058577,
+    "http://127.0.0.1:8007": 18.311422476416926
+  },
+  "per_worker_ttft_p90_s": {
+    "http://127.0.0.1:8000": 9.26557928660186,
+    "http://127.0.0.1:8001": 5.734943528624719,
+    "http://127.0.0.1:8002": 38.812515752378395,
+    "http://127.0.0.1:8003": 10.589305737824198,
+    "http://127.0.0.1:8004": 10.83847834250191,
+    "http://127.0.0.1:8005": 5.034968857781501,
+    "http://127.0.0.1:8006": 3.5207203380181493,
+    "http://127.0.0.1:8007": 12.236044214287555
+  },
+  "status": "supported"
+}
--- a/analysis/characterization/elastic_migration_v2/data/per_worker_unified_v2_strict.json
+++ b/analysis/characterization/elastic_migration_v2/data/per_worker_unified_v2_strict.json
@@ -0,0 +1,24 @@
+{
+  "hotspot_index_ttft_p90": 2.7334230011629197,
+  "per_worker_latency_p90_s": {
+    "http://127.0.0.1:8000": 11.098119341616997,
+    "http://127.0.0.1:8001": 23.1559918191866,
+    "http://127.0.0.1:8002": 22.57899510498975,
+    "http://127.0.0.1:8003": 9.956129518186204,
+    "http://127.0.0.1:8004": 28.072633931197924,
+    "http://127.0.0.1:8005": 47.2373243979877,
+    "http://127.0.0.1:8006": 23.23235769500608,
+    "http://127.0.0.1:8007": 27.031178803613876
+  },
+  "per_worker_ttft_p90_s": {
+    "http://127.0.0.1:8000": 3.1871710045961663,
+    "http://127.0.0.1:8001": 8.824780725361773,
+    "http://127.0.0.1:8002": 16.364250262192222,
+    "http://127.0.0.1:8003": 4.1765614019881445,
+    "http://127.0.0.1:8004": 14.026077619416176,
+    "http://127.0.0.1:8005": 24.662665293016516,
+    "http://127.0.0.1:8006": 9.220479947811697,
+    "http://127.0.0.1:8007": 8.441550621995741
+  },
+  "status": "supported"
+}
--- a/analysis/characterization/elastic_migration_v2/figures/fig_kv_both_overhead.png
+++ b/analysis/characterization/elastic_migration_v2/figures/fig_kv_both_overhead.png
--- a/analysis/characterization/elastic_migration_v2/figures/fig_three_way_hotspot.png
+++ b/analysis/characterization/elastic_migration_v2/figures/fig_three_way_hotspot.png
--- a/analysis/characterization/elastic_migration_v2/figures/fig_v2_predicted_vs_actual.png
+++ b/analysis/characterization/elastic_migration_v2/figures/fig_v2_predicted_vs_actual.png
--- a/analysis/characterization/elastic_migration_v2/figures/fig_v2_trigger_funnel.png
+++ b/analysis/characterization/elastic_migration_v2/figures/fig_v2_trigger_funnel.png
--- a/analysis/characterization/elastic_migration_v2/render_figures.py
+++ b/analysis/characterization/elastic_migration_v2/render_figures.py
@@ -0,0 +1,244 @@
+"""Render PNG figures for the elastic_migration_v2 section.
+
+Inputs in ./data/ :
+- b3_policy_comparison.json
+- breakdown_unified.json, breakdown_unified_kv_both.json,
+  breakdown_unified_v2.json, breakdown_unified_v2_strict.json
+- per_worker_<policy>.json for each of the four
+
+Outputs in ./figures/ :
+- fig_kv_both_overhead.png      — three-way latency bars (plain vs kv_both vs v2)
+- fig_v2_trigger_funnel.png     — request count per fall-through reason
+- fig_v2_predicted_vs_actual.png — cost-model migrate prediction vs realized TTFT
+- fig_three_way_hotspot.png      — per-worker TTFT p90 grouped bars
+"""
+
+from __future__ import annotations
+
+import json
+from collections import Counter
+from pathlib import Path
+
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+
+ROOT = Path(__file__).parent
+DATA = ROOT / "data"
+OUT = ROOT / "figures"
+OUT.mkdir(parents=True, exist_ok=True)
+
+
+def _load(name: str):
+    return json.loads((DATA / name).read_text())
+
+
+POLICY_COLORS = {
+    "unified":            "#2ca02c",
+    "unified_kv_both":    "#9467bd",
+    "unified_v2":         "#d62728",
+    "unified_v2_strict":  "#ff7f0e",
+}
+
+
+def fig_kv_both_overhead():
+    comp = _load("b3_policy_comparison.json")
+    by = {r["policy"]: r for r in comp["rows"]}
+    pols = ["unified", "unified_kv_both", "unified_v2"]
+    metrics = [
+        ("TTFT p90 (s)",   lambda r: r["ttft_p90_s"]),
+        ("TPOT p90 (ms)",  lambda r: r["tpot_p90_s"] * 1000),
+        ("E2E p90 (s)",    lambda r: r["e2e_p90_s"]),
+        ("hotspot index",  lambda r: r["hotspot_index_ttft_p90"]),
+    ]
+    fig, axes = plt.subplots(1, 4, figsize=(14, 4))
+    for ax, (label, fn) in zip(axes, metrics):
+        vals = [fn(by[p]) for p in pols]
+        bars = ax.bar(pols, vals,
+                       color=[POLICY_COLORS[p] for p in pols],
+                       edgecolor="black", linewidth=0.5)
+        ax.set_title(label)
+        ax.tick_params(axis="x", rotation=20, labelsize=9)
+        for b, v in zip(bars, vals):
+            ax.text(b.get_x() + b.get_width() / 2, v,
+                     f"{v:.2f}" if v < 100 else f"{v:.0f}",
+                     ha="center", va="bottom", fontsize=9)
+        ax.grid(alpha=0.3, axis="y")
+        # delta annotation
+        baseline = vals[0]
+        for i, v in enumerate(vals):
+            if i == 0:
+                continue
+            pct = (v - baseline) / baseline * 100
+            ax.text(i, v * 0.5, f"{pct:+.0f}%", ha="center",
+                     fontsize=10, fontweight="bold",
+                     color="darkred" if pct > 0 else "darkgreen")
+    fig.suptitle(
+        "kv_both adds ~45% to TTFT p90 even without PD-sep firing.\n"
+        "v2's PD-sep barely recovers the gap (and overshoots TTFT p99)."
+    )
+    fig.tight_layout()
+    fig.savefig(OUT / "fig_kv_both_overhead.png", dpi=120)
+    plt.close(fig)
+
+
+def _bucket_reasons(data):
+    """Collapse v2_reason strings into the funnel buckets."""
+    buckets = Counter()
+    for r in data:
+        if r.get("v2_pd_sep") is True:
+            buckets["PD-sep TRIGGERED"] += 1
+            continue
+        reason = (r.get("v2_reason") or "no_v2_reason").split(" (")[0]
+        if reason.startswith("local_cost"):
+            reason = "cost_benefit not enough margin"
+        buckets[reason] += 1
+    return buckets
+
+
+def fig_v2_trigger_funnel():
+    strict = _load("breakdown_unified_v2_strict.json")
+    relaxed = _load("breakdown_unified_v2.json")
+    bs = _bucket_reasons(strict)
+    br = _bucket_reasons(relaxed)
+    order = [
+        "new_local_below_threshold",
+        "chosen_no_active_decode",
+        "chosen_few_decodes",
+        "src_cache_below_threshold",
+        "src_not_meaningfully_more_cache",
+        "cost_benefit not enough margin",
+        "PD-sep TRIGGERED",
+    ]
+    labels = [k for k in order if k in bs or k in br]
+    strict_vals = [bs.get(k, 0) for k in labels]
+    relaxed_vals = [br.get(k, 0) for k in labels]
+
+    x = range(len(labels))
+    width = 0.4
+    fig, ax = plt.subplots(figsize=(11, 5))
+    ax.bar([i - width / 2 for i in x], strict_vals, width,
+            label=f"v2.0 strict (PD-sep={bs['PD-sep TRIGGERED']}/{sum(bs.values())} "
+                  f"= {bs['PD-sep TRIGGERED']*100/sum(bs.values()):.2f}%)",
+            color="#ff7f0e", edgecolor="black", linewidth=0.5)
+    ax.bar([i + width / 2 for i in x], relaxed_vals, width,
+            label=f"v2.1 relaxed (PD-sep={br['PD-sep TRIGGERED']}/{sum(br.values())} "
+                  f"= {br['PD-sep TRIGGERED']*100/sum(br.values()):.2f}%)",
+            color="#d62728", edgecolor="black", linewidth=0.5)
+    ax.set_xticks(list(x))
+    ax.set_xticklabels(labels, rotation=20, ha="right", fontsize=9)
+    ax.set_ylabel("request count")
+    ax.set_yscale("log")
+    ax.set_title(
+        "Why v2 rarely PD-seps: 88-76% of requests have new_local < threshold\n"
+        "(intra-session cache already hot). Relaxing thresholds barely helps."
+    )
+    ax.legend()
+    ax.grid(alpha=0.3, axis="y", which="both")
+    for i, (s, r) in enumerate(zip(strict_vals, relaxed_vals)):
+        if s > 0:
+            ax.text(i - width / 2, s * 1.05, str(s), ha="center", fontsize=8)
+        if r > 0:
+            ax.text(i + width / 2, r * 1.05, str(r), ha="center", fontsize=8)
+    fig.tight_layout()
+    fig.savefig(OUT / "fig_v2_trigger_funnel.png", dpi=120)
+    plt.close(fig)
+
+
+def fig_v2_predicted_vs_actual():
+    """For each PD-sep'd request, plot model-predicted migrate cost
+    vs realized TTFT. Should sit near y=x if model is calibrated; sits
+    far above if mechanism is more expensive than modeled."""
+    relaxed = _load("breakdown_unified_v2.json")
+    triggered = [r for r in relaxed if r.get("v2_pd_sep") is True]
+    if not triggered:
+        return
+    predicted = []
+    actual = []
+    sizes = []
+    rids = []
+    for r in triggered:
+        cm = r.get("v2_cost_migrate_s")
+        t0 = r.get("t_proxy_recv")
+        t_first = r.get("t_first_token")
+        if cm is None or t0 is None or t_first is None:
+            continue
+        ttft = t_first - t0
+        predicted.append(cm)
+        actual.append(ttft)
+        sizes.append(r.get("input_length", 0))
+        rids.append(r.get("request_id", "?"))
+
+    fig, ax = plt.subplots(figsize=(7, 5))
+    ax.scatter(predicted, actual,
+                s=[max(100, sz / 100) for sz in sizes],
+                color="#d62728", edgecolors="black", alpha=0.75)
+    for p, a, sz, rid in zip(predicted, actual, sizes, rids):
+        ax.annotate(f"input={sz}",
+                     (p, a), xytext=(8, 6), textcoords="offset points",
+                     fontsize=9)
+    # y=x reference + 10x line + 20x line
+    lo = 0.5
+    hi = max(50, max(actual) * 1.2)
+    ax.plot([lo, hi], [lo, hi], "k--", alpha=0.5, label="y = x (calibrated)")
+    ax.plot([lo, hi], [lo * 10, hi * 10], color="gray", linestyle=":",
+             alpha=0.4, label="10x")
+    ax.plot([lo, hi], [lo * 20, hi * 20], color="lightgray", linestyle=":",
+             alpha=0.4, label="20x")
+    ax.set_xscale("log")
+    ax.set_yscale("log")
+    ax.set_xlim(lo, hi)
+    ax.set_ylim(lo, hi)
+    ax.set_xlabel("Cost model: predicted migrate cost (s)")
+    ax.set_ylabel("Realized TTFT (s)")
+    ax.set_title(
+        "All 5 PD-sep triggered requests in v2.1 sit far above y=x.\n"
+        "Real transfer cost ~10-20x what the calibrated model predicted."
+    )
+    ax.grid(alpha=0.3, which="both")
+    ax.legend(loc="lower right")
+    fig.tight_layout()
+    fig.savefig(OUT / "fig_v2_predicted_vs_actual.png", dpi=120)
+    plt.close(fig)
+
+
+def fig_three_way_hotspot():
+    pols = ["unified", "unified_kv_both", "unified_v2"]
+    per_worker = {p: _load(f"per_worker_{p}.json") for p in pols}
+    workers = sorted(per_worker["unified"]["per_worker_ttft_p90_s"].keys())
+
+    x = range(len(workers))
+    width = 0.27
+    fig, ax = plt.subplots(figsize=(11, 5))
+    for i, p in enumerate(pols):
+        d = per_worker[p]["per_worker_ttft_p90_s"]
+        vals = [d[w] for w in workers]
+        offset = (i - 1) * width
+        ax.bar([j + offset for j in x], vals, width,
+                label=f"{p} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
+                color=POLICY_COLORS[p], edgecolor="black", linewidth=0.4)
+    short = [w.replace("http://127.0.0.1:", ":") for w in workers]
+    ax.set_xticks(list(x))
+    ax.set_xticklabels(short, rotation=0, fontsize=9)
+    ax.set_ylabel("worker TTFT p90 (s)")
+    ax.set_title(
+        "Per-worker TTFT p90 distribution. kv_both alone makes the hot worker hotter\n"
+        "(unified→kv_both: 37.7s→43.5s peak); v2's 5 PD-sep triggers nudge it back."
+    )
+    ax.legend(loc="upper left", fontsize=9)
+    ax.grid(alpha=0.3, axis="y")
+    fig.tight_layout()
+    fig.savefig(OUT / "fig_three_way_hotspot.png", dpi=120)
+    plt.close(fig)
+
+
+def main():
+    fig_kv_both_overhead()
+    fig_v2_trigger_funnel()
+    fig_v2_predicted_vs_actual()
+    fig_three_way_hotspot()
+    print(f"wrote 4 figures to {OUT}")
+
+
+if __name__ == "__main__":
+    main()