Elastic migration v2 section: PD-sep on agentic workload is net negative
New analysis/characterization/elastic_migration_v2/ packages the
unified_v2 + unified_kv_both experiments into a self-contained
results section that the paper can cite as the "we tried selective
PD-sep migration" case study. The section finds three independent
reasons PD-sep doesn't help on agentic w600:
1. Mooncake kv_both substrate alone (no PD-sep ever firing) imposes
TTFT p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain
unified. Per-step KVConnectorMetadata maintenance and block
reservation semantics dominate even when no transfer is pending.
2. PD-sep gate fires only 0.16-0.41% of requests across two
gate-tightness configurations. 88-76% are killed by
new_local < threshold because 93% intra-session reuse on agentic
traces leaves a small uncached tail; 19% are killed by
chosen_no_active_decode (snapshot-time gate). Even relaxed
thresholds can't grow trigger rate past 0.5%.
3. When PD-sep fires, the calibrated cost model
(0.3s + bytes / 2.7 GB/s) is wrong by 10-20x. 5 triggered
requests in v2.1 saw realized TTFT 12-45s vs model-predicted
migrate cost 0.7-2.2s, consistent with the E2 audit's finding
that D-side block pre-reservation and missing layerwise
pipelining dominate the decode_sent -> first_token clock.
Three-way comparison (unified vs unified_kv_both vs unified_v2):
v2 vs the kv_both control is roughly net-zero (-10% hotspot,
-14% TPOT p90, +3% TTFT p90, +9% TTFT p99). v2 vs plain unified is
strictly worse by 27-49% across latency percentiles because the
kv_both substrate tax is unavoidable when the policy is enabled.
Contents:
- README.md: the four results sections, the three-way comparison
table, an explicit "what this claims for the paper" list, and a
cross-reference index to the earlier characterization documents.
- data/: b3_policy_comparison.json + per-policy breakdown.json
+ per-policy hotspot_index.json for the four policies in scope.
- figures/: 4 PNGs rendered by render_figures.py:
* fig_kv_both_overhead.png — 4-metric bar chart with delta
annotations showing kv_both alone costs +45% TTFT p90.
* fig_v2_trigger_funnel.png — per-reason request count for the
two gate configurations on log scale.
* fig_v2_predicted_vs_actual.png — scatter of model-predicted
migrate cost vs realized TTFT for the 5 triggered requests,
with y=x, 10x, and 20x reference lines.
* fig_three_way_hotspot.png — per-worker TTFT p90 grouped bars
across the three policies.
The section is intentionally self-contained: it lists what the
experiment validates (cost model picks correct candidates;
shadow-drift fix is necessary; same-worker interference is real)
alongside what it disproves (per-request PD-sep on agentic via
Mooncake is not a net win in current implementation).
Refs: E1/E2 subagent audits, B2 microbench, unified_v2 commits
19f69a9 / 4b833d3 / 95c8ef8.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
284
analysis/characterization/elastic_migration_v2/README.md
Normal file
284
analysis/characterization/elastic_migration_v2/README.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# Elastic Migration v2: Selective PD-Separation via Mooncake
|
||||
|
||||
Date: 2026-05-26
|
||||
Trace: `traces/w600_r0.0015_st30.jsonl` (1214 reqs, 274 sessions, 53.3 M tokens)
|
||||
Model: Qwen3-Coder-30B-A3B-Instruct, 8 × TP1 on H20
|
||||
|
||||
## TL;DR
|
||||
|
||||
This section explores whether the **B2-confirmed same-worker
|
||||
prefill–decode interference** can be relieved by selectively
|
||||
migrating prefill to a different worker for the requests where the
|
||||
interference cost would dominate the transfer cost. We implement two
|
||||
flavors of the policy (strict gates, then relaxed gates) and a clean
|
||||
isolation control (`unified_kv_both`: same picker as `unified`, but
|
||||
the vLLMs are launched in `kv_role=kv_both` so the Mooncake
|
||||
substrate is on but never triggers).
|
||||
|
||||
Three findings:
|
||||
|
||||
1. **`kv_role=kv_both` alone imposes a heavy always-on tax**: TTFT
|
||||
p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain `unified`,
|
||||
with no PD-sep ever firing.
|
||||
2. **PD-sep almost never triggers on a real agentic workload**:
|
||||
0.16% with strict gates, 0.41% with relaxed gates. Agentic
|
||||
workloads have 93% intra-session reuse, so most requests land on
|
||||
workers that already hold cache — the uncached tail is too small
|
||||
to be worth migrating.
|
||||
3. **When PD-sep does fire, the cost model is wrong by ~10–20×**:
|
||||
the calibrated `0.3s + bytes / 2.7 GB/s` predicts 1–2 s migrate
|
||||
cost; observed TTFT on triggered requests is 12–45 s. The same
|
||||
D-side block-reservation pressure and absence of layerwise
|
||||
pipelining that the E2 audit flagged still dominate.
|
||||
|
||||
The net latency of `unified_v2` is **not better than plain
|
||||
`unified`**. Improving agentic PD-sep requires fixing the underlying
|
||||
Mooncake transfer mechanism (E2 patches 6.1 lazy block reservation
|
||||
and 6.3 layerwise pipelining), not the routing decision.
|
||||
|
||||
## Substrate
|
||||
|
||||
We compare three policies on identical traces:
|
||||
|
||||
| policy | picker | vLLM launch mode | what's it for |
|
||||
|---|---|---|---|
|
||||
| `unified` | hybrid affinity + LMetric | plain (no Mooncake) | the headline baseline |
|
||||
| `unified_kv_both` | same as `unified` | `kv_role=kv_both` + bootstrap | isolation control: how much does kv_both *alone* cost? |
|
||||
| `unified_v2` | unified + selective PD-sep | `kv_role=kv_both` + bootstrap | the actual experiment |
|
||||
|
||||
All three use the same trace, the same 8-instance topology, the same
|
||||
shadow-drift–corrected proxy (`scripts/cache_aware_proxy.py` post-fix
|
||||
`95c8ef8`). Plain `unified` was rerun on the patched proxy
|
||||
(`b3_sweep_20260525_095043/unified`) under the same conditions.
|
||||
|
||||
## Result 1 — kv_both is expensive by itself
|
||||
|
||||

|
||||
|
||||
Switching the vLLM launch from plain to `kv_role=kv_both` without
|
||||
ever triggering PD-sep already costs:
|
||||
|
||||
| metric | plain `unified` | `unified_kv_both` | Δ |
|
||||
|---|---:|---:|---|
|
||||
| TTFT p50 | 0.50 s | 0.50 s | +0% |
|
||||
| TTFT p90 | 7.35 s | 10.67 s | **+45%** |
|
||||
| TTFT p99 | 42.34 s | 45.19 s | +7% |
|
||||
| TPOT p90 | 17.1 ms | 21.3 ms | **+25%** |
|
||||
| E2E p90 | 18.03 s | 22.89 s | **+27%** |
|
||||
| APC | 79.4% | 78.3% | −1.1 pp |
|
||||
| hotspot index | 3.667 | **4.363** | **+19%** |
|
||||
|
||||
Two contributing factors:
|
||||
|
||||
1. **The Mooncake `MooncakeConnector` runs even when no transfer is
|
||||
pending.** Every scheduler step it walks `set(cache.keys())`
|
||||
against `_known_hash_keys` (E2 audit §6.5) and updates the
|
||||
`KVConnectorMetadata`. This is O(|cache|) per step on every
|
||||
engine, even when no producer/consumer relationship is active.
|
||||
2. **Block reservation semantics differ** under kv_both. The
|
||||
scheduler treats blocks as candidates for export-to-others, so
|
||||
the prefix cache LRU pressure is slightly different (we lose 1
|
||||
pp APC).
|
||||
|
||||
Practical implication: **you don't enable kv_both for free**. If a
|
||||
deployment wants the option to do PD-sep selectively, the 45% TTFT
|
||||
p90 tax applies even on requests that stay local. This needs to
|
||||
recoverable cost before any selective-PD-sep policy is worth
|
||||
shipping.
|
||||
|
||||
## Result 2 — PD-sep rarely fires on a real agentic trace
|
||||
|
||||

|
||||
|
||||
We log every routing decision's `v2_reason` (why we did or did not
|
||||
PD-sep). Two runs with different gate thresholds:
|
||||
|
||||
| fall-through bucket | v2.0 strict | v2.1 relaxed | what it means |
|
||||
|---|---:|---:|---|
|
||||
| `new_local < threshold` | 1077 (88.7%) | 924 (76.1%) | uncached tail too small to justify transfer |
|
||||
| `chosen_no_active_decode` | 115 (9.5%) | 229 (18.9%) | no decode on chosen to protect |
|
||||
| `src_cache_below_threshold` | 14 (1.2%) | 36 (3.0%) | no alt instance holds enough cache |
|
||||
| `src_not_meaningfully_more_cache` | 6 (0.5%) | 16 (1.3%) | alt instance doesn't help vs chosen |
|
||||
| `cost_benefit not enough margin` | 0 | 4 (0.3%) | model says transfer cost + interference on src ≥ local interference |
|
||||
| **PD-sep TRIGGERED** | **2 (0.16%)** | **5 (0.41%)** | passed all gates and cost-benefit favored migrate |
|
||||
|
||||
The dominant filter is `new_local < threshold`. Even with the
|
||||
threshold dropped from 16 k to 8 k tokens, three out of four requests
|
||||
have less than 8 k uncached tokens at the chosen worker. This is
|
||||
structural: with intra-session reuse measured at 93% on the same
|
||||
trace (window_1_results.md), most turns hit prefix cache on the
|
||||
session's previous worker.
|
||||
|
||||
The second filter, `chosen_no_active_decode`, kills another fifth.
|
||||
This is a snapshot-time phenomenon: at the moment the picker runs,
|
||||
the chosen worker often has its previous request still in prefill,
|
||||
not yet decoding. The gate's intent ("don't migrate if no decode is
|
||||
being hurt by the prefill we're routing") is correct, but it ends up
|
||||
suppressing PD-sep for a real situation where decode is *about to*
|
||||
start.
|
||||
|
||||
Even after these two filters, the cost-benefit step itself rejects
|
||||
nearly half of remaining candidates (4 out of 9 in relaxed). So the
|
||||
final trigger rate of 0.41% is a structural property, not a
|
||||
parameter-tuning problem.
|
||||
|
||||
## Result 3 — when PD-sep fires, the cost model is wrong by 10–20×
|
||||
|
||||

|
||||
|
||||
The 5 PD-sep-triggered requests in v2.1 relaxed:
|
||||
|
||||
| input | new_local | new_src | src→dst | cost_local | cost_migrate (model) | actual TTFT | actual E2E |
|
||||
|---:|---:|---:|---|---:|---:|---:|---:|
|
||||
| 21963 | 21963 | 9163 | 6→5 | 4.39 s | 4.17 s | 3.69 s | 8.48 s |
|
||||
| 8706 | 8706 | 2050 | 5→7 | 1.09 s | 0.73 s | 12.48 s | 14.31 s |
|
||||
| 13616 | 13616 | 2352 | 4→0 | 1.70 s | 1.03 s | 18.33 s | 19.50 s |
|
||||
| 49483 | 49483 | 843 | 3→4 | 11.75 s | 2.16 s | **45.13 s** | **53.55 s** |
|
||||
| 19806 | 19806 | 350 | 3→6 | 3.96 s | 1.06 s | 20.06 s | 31.98 s |
|
||||
|
||||
The cost model predicts the migrate path will take 0.7–2.2 s; the
|
||||
actual TTFT on these requests is 12–45 s. The model's `0.3 s +
|
||||
bytes / 2.7 GB/s` calibration captures pure RDMA bandwidth in
|
||||
isolation but misses everything else that happens on the
|
||||
`decode_sent → first_token` clock: D-side scheduler step latency,
|
||||
block reservation before KV arrives (so D's cache pressure
|
||||
increases for the entire wait), the per-layer scatter of
|
||||
`batch_transfer_sync_write`, and the next-step scheduler promotion
|
||||
after `finished_recving`. The E2 audit measured this end-to-end at
|
||||
p50 = 1.1 s and **p90 = 6.7 s** on production runs; the v2.1
|
||||
triggered requests landed in the p99 tail of that distribution
|
||||
because their dst was already loaded.
|
||||
|
||||
The first-token clock for the 49 k request is **21× the model's
|
||||
prediction**. This is not a small mis-tuning — it's a structurally
|
||||
different model.
|
||||
|
||||
## Result 4 — three-way comparison
|
||||
|
||||

|
||||
|
||||
The full table:
|
||||
|
||||
| metric | unified (plain) | unified_kv_both | unified_v2 (relaxed) |
|
||||
|---|---:|---:|---:|
|
||||
| n_ok | 1214 | 1214 | 1214 |
|
||||
| TTFT p50 | 0.50 s | 0.50 s | 0.49 s |
|
||||
| TTFT p90 | 7.35 s | 10.67 s | 10.98 s |
|
||||
| TTFT p99 | 42.34 s | 45.19 s | 49.45 s |
|
||||
| TPOT p90 | 17.1 ms | 21.3 ms | 18.4 ms |
|
||||
| E2E p90 | 18.03 s | 22.89 s | 22.53 s |
|
||||
| APC | 79.4% | 78.3% | 77.6% |
|
||||
| interference index | n/a (no engine_state) | 8.57 | 8.46 |
|
||||
| hotspot index | 3.667 | 4.363 | 3.910 |
|
||||
| n_slow | 189 | 198 | 198 |
|
||||
|
||||
### v2 vs the kv_both control (the right comparison)
|
||||
|
||||
Compared to the kv_both control — same substrate, no PD-sep — the
|
||||
5 PD-sep triggers in v2:
|
||||
|
||||
- **slightly improve TPOT p90 (−14%) and hotspot (−10%)**
|
||||
- **slightly worsen TTFT p90 (+3%) and TTFT p99 (+9%)**, because the
|
||||
triggered requests themselves take ~20× the predicted transfer
|
||||
time
|
||||
|
||||
The net effect against the kv_both control is in the noise. The
|
||||
hotspot improvement is within the run-to-run stochastic range we saw
|
||||
earlier (v2 strict run scored 2.733 hotspot under the same
|
||||
substrate; v2 relaxed scored 3.910).
|
||||
|
||||
### v2 vs plain unified (the headline question)
|
||||
|
||||
`unified_v2` is **27% slower on E2E p90** and **49% slower on TTFT
|
||||
p90** than plain `unified`. The 45 pp of TTFT p90 inflation is from
|
||||
kv_both substrate, not the routing decision; nothing PD-sep does can
|
||||
recover this in our current Mooncake implementation.
|
||||
|
||||
## Why v2's PD-sep is fundamentally choked
|
||||
|
||||
There are three independent structural problems, each by itself
|
||||
enough to make v2 not win:
|
||||
|
||||
1. **The kv_both substrate is the wrong default**. It pays a 45%
|
||||
TTFT p90 tax on every request. To make selective PD-sep beat
|
||||
plain `unified`, the saved interference per triggered request
|
||||
times the trigger rate must exceed 45% × average TTFT, on
|
||||
average. With 0.41% trigger rate, even saving 100% of TTFT per
|
||||
triggered request would only save ~0.4%, which can't recover 45%.
|
||||
|
||||
2. **Agentic intra-session reuse leaves no headroom for migration**.
|
||||
Most turns hit cache on the worker that handled the previous
|
||||
turn. Migrating prefill to a *different* worker is the *exact*
|
||||
thing intra-session affinity tries to avoid: it forces the new
|
||||
worker to pay for the cached prefix transfer instead of just
|
||||
reusing what's already on the affinity worker. This is a
|
||||
structural mismatch between PD-sep semantics ("send big prefills
|
||||
to a less-busy worker") and agentic workloads ("keep sessions
|
||||
sticky to wherever the cache is").
|
||||
|
||||
3. **The Mooncake mechanism is 10–20× slower than the cost model
|
||||
predicts**, primarily due to D-side pre-allocation of KV blocks
|
||||
and the absence of layerwise pipelining (E2 audit §6.1 / §6.3).
|
||||
The cost model can be re-calibrated, but doing so would push the
|
||||
gate even tighter, dropping the already-tiny trigger rate to
|
||||
nearly zero.
|
||||
|
||||
The three are stacked: even if any two were fixed, the remaining
|
||||
one would still make PD-sep a net loss on this trace.
|
||||
|
||||
## What this section claims for the paper
|
||||
|
||||
1. **Same-worker prefill–decode interference is a real mechanism**
|
||||
(B2 microbench), but **agentic workloads rarely expose it**: the
|
||||
typical request has high cache hit and small uncached tail, so
|
||||
the interference cost is bounded.
|
||||
2. **Routing-only solutions (unified) already capture 79% of the
|
||||
intra-session APC ceiling and recover the latency** by avoiding
|
||||
the heavy-tail sessions through the affinity gate. The remaining
|
||||
23 pp gap to the ceiling is from APC LRU eviction under capacity
|
||||
pressure, not from prefill–decode interference.
|
||||
3. **Per-request PD-sep via Mooncake on agentic workloads is not a
|
||||
net win** in our measurements, even with a carefully-gated cost
|
||||
model. The combined effect of kv_both substrate overhead, low
|
||||
trigger rate, and mechanism-vs-model gap is uniformly negative.
|
||||
4. **A productive direction is mechanism-level**: fix the Mooncake
|
||||
D-side block reservation (E2 §6.1), implement layerwise transfer
|
||||
pipelining (E2 §6.3), and re-measure. Only if these patches drop
|
||||
the substrate tax to <10% and the realized transfer to ≤2 s p90
|
||||
does PD-sep become competitive with routing on agentic traces.
|
||||
|
||||
## What v2 still validates
|
||||
|
||||
- **The cost model's *qualitative* shape is correct**: when it says
|
||||
"migrate", that's a request where local interference *would have*
|
||||
been ≥ 4 s and src has ≥ 80% prefix cache. The model picks the
|
||||
right candidate requests.
|
||||
- **The gate logic catches the right exclusions**: 88% by uncached
|
||||
tail size, 19% by no-decode-to-protect, the rest by missing
|
||||
source cache. Each is a structurally correct reason.
|
||||
- **The proxy shadow-drift fix is necessary infrastructure** for
|
||||
any long-running routing experiment. We observed 3 phantom
|
||||
corrections per ~50-minute run.
|
||||
|
||||
## Files
|
||||
|
||||
- `data/b3_policy_comparison.json` — the four policies' headline
|
||||
metrics from the same B3 sweep root.
|
||||
- `data/breakdown_<policy>.json` — per-request proxy breakdown
|
||||
including v2 gate fields and triggered-event metadata.
|
||||
- `data/per_worker_<policy>.json` — per-worker TTFT/latency p90s
|
||||
used in the hotspot figure.
|
||||
- `figures/*.png` — the four section figures referenced above.
|
||||
- `render_figures.py` — regenerates the figures from data/.
|
||||
|
||||
## Cross-references
|
||||
|
||||
- `analysis/characterization/window_1_results.md` — B2 microbench
|
||||
(same-worker interference causal proof) and B3 baseline 5-policy
|
||||
sweep
|
||||
- `analysis/characterization/agentic_dispatch_coupling.md` — why
|
||||
the saturated-replay setup matches agentic production
|
||||
- `analysis/characterization/b3_policies_pseudocode.md` — pickers
|
||||
for the five baseline policies; `unified_v2` extends `unified`
|
||||
- E1 / E2 subagent reports (commit `4b833d3` message and the
|
||||
conversation log) — full mechanism audit that informed v2's design
|
||||
@@ -0,0 +1,211 @@
|
||||
{
|
||||
"rows": [
|
||||
{
|
||||
"policy": "capped",
|
||||
"n_ok": 770,
|
||||
"n_total": 770,
|
||||
"ttft_p50_s": 1.1989156164927408,
|
||||
"ttft_p90_s": 12.827629912580612,
|
||||
"ttft_p99_s": 46.61752380923125,
|
||||
"tpot_p50_s": 0.007231239004497606,
|
||||
"tpot_p90_s": 0.015998617687440243,
|
||||
"tpot_p99_s": 0.11515370831539476,
|
||||
"e2e_p50_s": 2.598489043477457,
|
||||
"e2e_p90_s": 21.245602010778384,
|
||||
"e2e_p99_s": 74.60736650204846,
|
||||
"apc_ratio": 0.3158312503528108,
|
||||
"interference_index": 6.331064378362814,
|
||||
"hotspot_index_ttft_p90": 2.0204268015410918,
|
||||
"reuse_intra_frac": 0.9192657105586233,
|
||||
"reuse_cross_frac": 0.0602232594931501,
|
||||
"n_slow": 185,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 60,
|
||||
"hot_worker_queue": 66,
|
||||
"same_worker_prefill_overlap": 45,
|
||||
"unknown": 14
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "lmetric",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.9387824369769078,
|
||||
"ttft_p90_s": 15.671339168207492,
|
||||
"ttft_p99_s": 53.56683189840049,
|
||||
"tpot_p50_s": 0.008854518407308914,
|
||||
"tpot_p90_s": 0.02122720699121469,
|
||||
"tpot_p99_s": 0.18280341184277568,
|
||||
"e2e_p50_s": 2.754255389008904,
|
||||
"e2e_p90_s": 24.8209177934099,
|
||||
"e2e_p99_s": 80.59924928059091,
|
||||
"apc_ratio": 0.5694312382571595,
|
||||
"interference_index": 6.530231061794441,
|
||||
"hotspot_index_ttft_p90": 2.252837147833725,
|
||||
"reuse_intra_frac": 0.9321238805590836,
|
||||
"reuse_cross_frac": 0.05679481258506571,
|
||||
"n_slow": 295,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 94,
|
||||
"hot_worker_queue": 68,
|
||||
"same_worker_prefill_overlap": 69,
|
||||
"unknown": 64
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "load_only",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 1.2609447415161412,
|
||||
"ttft_p90_s": 20.197147866390882,
|
||||
"ttft_p99_s": 52.84285237012196,
|
||||
"tpot_p50_s": 0.009231464695980247,
|
||||
"tpot_p90_s": 0.026851662550158716,
|
||||
"tpot_p99_s": 0.3211630676943426,
|
||||
"e2e_p50_s": 3.58568156149704,
|
||||
"e2e_p90_s": 33.459180271782685,
|
||||
"e2e_p99_s": 93.95083751494239,
|
||||
"apc_ratio": 0.5412093853102866,
|
||||
"interference_index": 9.16424627504275,
|
||||
"hotspot_index_ttft_p90": 1.2940319990630569,
|
||||
"reuse_intra_frac": 0.9353191550754928,
|
||||
"reuse_cross_frac": 0.053372184678592026,
|
||||
"n_slow": 379,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 151,
|
||||
"hot_worker_queue": 33,
|
||||
"same_worker_prefill_overlap": 108,
|
||||
"unknown": 87
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "sticky",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.5415176274836995,
|
||||
"ttft_p90_s": 18.021296651283045,
|
||||
"ttft_p99_s": 74.09429564891524,
|
||||
"tpot_p50_s": 0.008952101894096181,
|
||||
"tpot_p90_s": 0.03641285916619554,
|
||||
"tpot_p99_s": 0.35152006935195085,
|
||||
"e2e_p50_s": 2.081947358994512,
|
||||
"e2e_p90_s": 34.62592205510591,
|
||||
"e2e_p99_s": 139.68334607904353,
|
||||
"apc_ratio": 0.7720092868396378,
|
||||
"interference_index": 13.651718321568111,
|
||||
"hotspot_index_ttft_p90": 2.727756623171119,
|
||||
"reuse_intra_frac": 0.9327723488279339,
|
||||
"reuse_cross_frac": 0.05495149683864246,
|
||||
"n_slow": 234,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 20,
|
||||
"hot_worker_queue": 51,
|
||||
"same_worker_prefill_overlap": 134,
|
||||
"unknown": 29
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "unified",
|
||||
"n_ok": 1213,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.4997710260213353,
|
||||
"ttft_p90_s": 7.345769894809922,
|
||||
"ttft_p99_s": 42.34170345296613,
|
||||
"tpot_p50_s": 0.008079791456705824,
|
||||
"tpot_p90_s": 0.017110194704198407,
|
||||
"tpot_p99_s": 0.12655874612209597,
|
||||
"e2e_p50_s": 1.7495028690318577,
|
||||
"e2e_p90_s": 18.033410895219994,
|
||||
"e2e_p99_s": 68.80023987947489,
|
||||
"apc_ratio": 0.794261466256467,
|
||||
"interference_index": null,
|
||||
"hotspot_index_ttft_p90": 3.667136528736114,
|
||||
"reuse_intra_frac": 0.9311187350942534,
|
||||
"reuse_cross_frac": 0.056702150437367635,
|
||||
"n_slow": 189,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 18,
|
||||
"hot_worker_queue": 116,
|
||||
"unknown": 55
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "unified_kv_both",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.4958424885116983,
|
||||
"ttft_p90_s": 10.671844050800438,
|
||||
"ttft_p99_s": 45.19353310586651,
|
||||
"tpot_p50_s": 0.008573156389059812,
|
||||
"tpot_p90_s": 0.021303916384344358,
|
||||
"tpot_p99_s": 0.21501837408937963,
|
||||
"e2e_p50_s": 1.9310281965008471,
|
||||
"e2e_p90_s": 22.8941433175176,
|
||||
"e2e_p99_s": 76.06128971517893,
|
||||
"apc_ratio": 0.7828397082703908,
|
||||
"interference_index": 8.571603637346875,
|
||||
"hotspot_index_ttft_p90": 4.363145984888287,
|
||||
"reuse_intra_frac": 0.9313000825240145,
|
||||
"reuse_cross_frac": 0.056182260858791105,
|
||||
"n_slow": 198,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 28,
|
||||
"hot_worker_queue": 34,
|
||||
"same_worker_prefill_overlap": 87,
|
||||
"unknown": 49
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "unified_v2",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.4851180294645019,
|
||||
"ttft_p90_s": 10.97665627548705,
|
||||
"ttft_p99_s": 49.44861259821856,
|
||||
"tpot_p50_s": 0.008261419251554481,
|
||||
"tpot_p90_s": 0.018414033703249108,
|
||||
"tpot_p99_s": 0.20999689490980364,
|
||||
"e2e_p50_s": 1.8092182099935599,
|
||||
"e2e_p90_s": 22.528888442111203,
|
||||
"e2e_p99_s": 82.40234094743934,
|
||||
"apc_ratio": 0.7758437361549086,
|
||||
"interference_index": 8.45656745230457,
|
||||
"hotspot_index_ttft_p90": 3.9096187869766164,
|
||||
"reuse_intra_frac": 0.9324663389938368,
|
||||
"reuse_cross_frac": 0.055154184817413764,
|
||||
"n_slow": 198,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 36,
|
||||
"hot_worker_queue": 26,
|
||||
"same_worker_prefill_overlap": 82,
|
||||
"unknown": 54
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "unified_v2_strict",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.4849805940175429,
|
||||
"ttft_p90_s": 8.960840504511737,
|
||||
"ttft_p99_s": 44.63598358390898,
|
||||
"tpot_p50_s": 0.008222105788569446,
|
||||
"tpot_p90_s": 0.018078321745916927,
|
||||
"tpot_p99_s": 0.14616439095890604,
|
||||
"e2e_p50_s": 1.8335122870048508,
|
||||
"e2e_p90_s": 22.435233922180526,
|
||||
"e2e_p99_s": 68.254801789901,
|
||||
"apc_ratio": 0.789281361129855,
|
||||
"interference_index": 6.231677388887276,
|
||||
"hotspot_index_ttft_p90": 2.7334230011629197,
|
||||
"reuse_intra_frac": 0.9309082618411778,
|
||||
"reuse_cross_frac": 0.05689887985860397,
|
||||
"n_slow": 186,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 26,
|
||||
"hot_worker_queue": 44,
|
||||
"same_worker_prefill_overlap": 73,
|
||||
"unknown": 43
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 3.667136528736114,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 41.42001512600109,
|
||||
"http://127.0.0.1:8001": 12.4878579101933,
|
||||
"http://127.0.0.1:8002": 22.462878945574648,
|
||||
"http://127.0.0.1:8003": 15.501050900109117,
|
||||
"http://127.0.0.1:8004": 39.956250199786155,
|
||||
"http://127.0.0.1:8005": 36.69850301651168,
|
||||
"http://127.0.0.1:8006": 10.116177947795954,
|
||||
"http://127.0.0.1:8007": 20.35038618039107
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 11.264844838529825,
|
||||
"http://127.0.0.1:8001": 3.6063860427122614,
|
||||
"http://127.0.0.1:8002": 16.175747957825664,
|
||||
"http://127.0.0.1:8003": 9.314684258581842,
|
||||
"http://127.0.0.1:8004": 37.73397144810297,
|
||||
"http://127.0.0.1:8005": 18.328030522551852,
|
||||
"http://127.0.0.1:8006": 3.6328767628350773,
|
||||
"http://127.0.0.1:8007": 7.772977900883419
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 4.363145984888287,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 7.273825440008658,
|
||||
"http://127.0.0.1:8001": 40.48809068736155,
|
||||
"http://127.0.0.1:8002": 24.491076068370596,
|
||||
"http://127.0.0.1:8003": 18.828550089401002,
|
||||
"http://127.0.0.1:8004": 20.06954986089262,
|
||||
"http://127.0.0.1:8005": 9.634067087399307,
|
||||
"http://127.0.0.1:8006": 35.7432237003348,
|
||||
"http://127.0.0.1:8007": 24.362499430915342
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 2.725343641615472,
|
||||
"http://127.0.0.1:8001": 30.449911632167645,
|
||||
"http://127.0.0.1:8002": 16.297463109577073,
|
||||
"http://127.0.0.1:8003": 6.766894554614579,
|
||||
"http://127.0.0.1:8004": 11.146178993489595,
|
||||
"http://127.0.0.1:8005": 4.552643961587455,
|
||||
"http://127.0.0.1:8006": 6.90922680192164,
|
||||
"http://127.0.0.1:8007": 7.048551249800954
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 3.9096187869766164,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 27.12522437740119,
|
||||
"http://127.0.0.1:8001": 15.299228341400166,
|
||||
"http://127.0.0.1:8002": 49.346961313998335,
|
||||
"http://127.0.0.1:8003": 22.404519376007386,
|
||||
"http://127.0.0.1:8004": 22.470557069155618,
|
||||
"http://127.0.0.1:8005": 17.487964828591807,
|
||||
"http://127.0.0.1:8006": 21.76291022058577,
|
||||
"http://127.0.0.1:8007": 18.311422476416926
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 9.26557928660186,
|
||||
"http://127.0.0.1:8001": 5.734943528624719,
|
||||
"http://127.0.0.1:8002": 38.812515752378395,
|
||||
"http://127.0.0.1:8003": 10.589305737824198,
|
||||
"http://127.0.0.1:8004": 10.83847834250191,
|
||||
"http://127.0.0.1:8005": 5.034968857781501,
|
||||
"http://127.0.0.1:8006": 3.5207203380181493,
|
||||
"http://127.0.0.1:8007": 12.236044214287555
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 2.7334230011629197,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 11.098119341616997,
|
||||
"http://127.0.0.1:8001": 23.1559918191866,
|
||||
"http://127.0.0.1:8002": 22.57899510498975,
|
||||
"http://127.0.0.1:8003": 9.956129518186204,
|
||||
"http://127.0.0.1:8004": 28.072633931197924,
|
||||
"http://127.0.0.1:8005": 47.2373243979877,
|
||||
"http://127.0.0.1:8006": 23.23235769500608,
|
||||
"http://127.0.0.1:8007": 27.031178803613876
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 3.1871710045961663,
|
||||
"http://127.0.0.1:8001": 8.824780725361773,
|
||||
"http://127.0.0.1:8002": 16.364250262192222,
|
||||
"http://127.0.0.1:8003": 4.1765614019881445,
|
||||
"http://127.0.0.1:8004": 14.026077619416176,
|
||||
"http://127.0.0.1:8005": 24.662665293016516,
|
||||
"http://127.0.0.1:8006": 9.220479947811697,
|
||||
"http://127.0.0.1:8007": 8.441550621995741
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 70 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 54 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 73 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 97 KiB |
244
analysis/characterization/elastic_migration_v2/render_figures.py
Normal file
244
analysis/characterization/elastic_migration_v2/render_figures.py
Normal file
@@ -0,0 +1,244 @@
|
||||
"""Render PNG figures for the elastic_migration_v2 section.
|
||||
|
||||
Inputs in ./data/ :
|
||||
- b3_policy_comparison.json
|
||||
- breakdown_unified.json, breakdown_unified_kv_both.json,
|
||||
breakdown_unified_v2.json, breakdown_unified_v2_strict.json
|
||||
- per_worker_<policy>.json for each of the four
|
||||
|
||||
Outputs in ./figures/ :
|
||||
- fig_kv_both_overhead.png — three-way latency bars (plain vs kv_both vs v2)
|
||||
- fig_v2_trigger_funnel.png — request count per fall-through reason
|
||||
- fig_v2_predicted_vs_actual.png — cost-model migrate prediction vs realized TTFT
|
||||
- fig_three_way_hotspot.png — per-worker TTFT p90 grouped bars
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
ROOT = Path(__file__).parent
|
||||
DATA = ROOT / "data"
|
||||
OUT = ROOT / "figures"
|
||||
OUT.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
|
||||
def _load(name: str):
|
||||
return json.loads((DATA / name).read_text())
|
||||
|
||||
|
||||
POLICY_COLORS = {
|
||||
"unified": "#2ca02c",
|
||||
"unified_kv_both": "#9467bd",
|
||||
"unified_v2": "#d62728",
|
||||
"unified_v2_strict": "#ff7f0e",
|
||||
}
|
||||
|
||||
|
||||
def fig_kv_both_overhead():
|
||||
comp = _load("b3_policy_comparison.json")
|
||||
by = {r["policy"]: r for r in comp["rows"]}
|
||||
pols = ["unified", "unified_kv_both", "unified_v2"]
|
||||
metrics = [
|
||||
("TTFT p90 (s)", lambda r: r["ttft_p90_s"]),
|
||||
("TPOT p90 (ms)", lambda r: r["tpot_p90_s"] * 1000),
|
||||
("E2E p90 (s)", lambda r: r["e2e_p90_s"]),
|
||||
("hotspot index", lambda r: r["hotspot_index_ttft_p90"]),
|
||||
]
|
||||
fig, axes = plt.subplots(1, 4, figsize=(14, 4))
|
||||
for ax, (label, fn) in zip(axes, metrics):
|
||||
vals = [fn(by[p]) for p in pols]
|
||||
bars = ax.bar(pols, vals,
|
||||
color=[POLICY_COLORS[p] for p in pols],
|
||||
edgecolor="black", linewidth=0.5)
|
||||
ax.set_title(label)
|
||||
ax.tick_params(axis="x", rotation=20, labelsize=9)
|
||||
for b, v in zip(bars, vals):
|
||||
ax.text(b.get_x() + b.get_width() / 2, v,
|
||||
f"{v:.2f}" if v < 100 else f"{v:.0f}",
|
||||
ha="center", va="bottom", fontsize=9)
|
||||
ax.grid(alpha=0.3, axis="y")
|
||||
# delta annotation
|
||||
baseline = vals[0]
|
||||
for i, v in enumerate(vals):
|
||||
if i == 0:
|
||||
continue
|
||||
pct = (v - baseline) / baseline * 100
|
||||
ax.text(i, v * 0.5, f"{pct:+.0f}%", ha="center",
|
||||
fontsize=10, fontweight="bold",
|
||||
color="darkred" if pct > 0 else "darkgreen")
|
||||
fig.suptitle(
|
||||
"kv_both adds ~45% to TTFT p90 even without PD-sep firing.\n"
|
||||
"v2's PD-sep barely recovers the gap (and overshoots TTFT p99)."
|
||||
)
|
||||
fig.tight_layout()
|
||||
fig.savefig(OUT / "fig_kv_both_overhead.png", dpi=120)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def _bucket_reasons(data):
|
||||
"""Collapse v2_reason strings into the funnel buckets."""
|
||||
buckets = Counter()
|
||||
for r in data:
|
||||
if r.get("v2_pd_sep") is True:
|
||||
buckets["PD-sep TRIGGERED"] += 1
|
||||
continue
|
||||
reason = (r.get("v2_reason") or "no_v2_reason").split(" (")[0]
|
||||
if reason.startswith("local_cost"):
|
||||
reason = "cost_benefit not enough margin"
|
||||
buckets[reason] += 1
|
||||
return buckets
|
||||
|
||||
|
||||
def fig_v2_trigger_funnel():
|
||||
strict = _load("breakdown_unified_v2_strict.json")
|
||||
relaxed = _load("breakdown_unified_v2.json")
|
||||
bs = _bucket_reasons(strict)
|
||||
br = _bucket_reasons(relaxed)
|
||||
order = [
|
||||
"new_local_below_threshold",
|
||||
"chosen_no_active_decode",
|
||||
"chosen_few_decodes",
|
||||
"src_cache_below_threshold",
|
||||
"src_not_meaningfully_more_cache",
|
||||
"cost_benefit not enough margin",
|
||||
"PD-sep TRIGGERED",
|
||||
]
|
||||
labels = [k for k in order if k in bs or k in br]
|
||||
strict_vals = [bs.get(k, 0) for k in labels]
|
||||
relaxed_vals = [br.get(k, 0) for k in labels]
|
||||
|
||||
x = range(len(labels))
|
||||
width = 0.4
|
||||
fig, ax = plt.subplots(figsize=(11, 5))
|
||||
ax.bar([i - width / 2 for i in x], strict_vals, width,
|
||||
label=f"v2.0 strict (PD-sep={bs['PD-sep TRIGGERED']}/{sum(bs.values())} "
|
||||
f"= {bs['PD-sep TRIGGERED']*100/sum(bs.values()):.2f}%)",
|
||||
color="#ff7f0e", edgecolor="black", linewidth=0.5)
|
||||
ax.bar([i + width / 2 for i in x], relaxed_vals, width,
|
||||
label=f"v2.1 relaxed (PD-sep={br['PD-sep TRIGGERED']}/{sum(br.values())} "
|
||||
f"= {br['PD-sep TRIGGERED']*100/sum(br.values()):.2f}%)",
|
||||
color="#d62728", edgecolor="black", linewidth=0.5)
|
||||
ax.set_xticks(list(x))
|
||||
ax.set_xticklabels(labels, rotation=20, ha="right", fontsize=9)
|
||||
ax.set_ylabel("request count")
|
||||
ax.set_yscale("log")
|
||||
ax.set_title(
|
||||
"Why v2 rarely PD-seps: 88-76% of requests have new_local < threshold\n"
|
||||
"(intra-session cache already hot). Relaxing thresholds barely helps."
|
||||
)
|
||||
ax.legend()
|
||||
ax.grid(alpha=0.3, axis="y", which="both")
|
||||
for i, (s, r) in enumerate(zip(strict_vals, relaxed_vals)):
|
||||
if s > 0:
|
||||
ax.text(i - width / 2, s * 1.05, str(s), ha="center", fontsize=8)
|
||||
if r > 0:
|
||||
ax.text(i + width / 2, r * 1.05, str(r), ha="center", fontsize=8)
|
||||
fig.tight_layout()
|
||||
fig.savefig(OUT / "fig_v2_trigger_funnel.png", dpi=120)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def fig_v2_predicted_vs_actual():
|
||||
"""For each PD-sep'd request, plot model-predicted migrate cost
|
||||
vs realized TTFT. Should sit near y=x if model is calibrated; sits
|
||||
far above if mechanism is more expensive than modeled."""
|
||||
relaxed = _load("breakdown_unified_v2.json")
|
||||
triggered = [r for r in relaxed if r.get("v2_pd_sep") is True]
|
||||
if not triggered:
|
||||
return
|
||||
predicted = []
|
||||
actual = []
|
||||
sizes = []
|
||||
rids = []
|
||||
for r in triggered:
|
||||
cm = r.get("v2_cost_migrate_s")
|
||||
t0 = r.get("t_proxy_recv")
|
||||
t_first = r.get("t_first_token")
|
||||
if cm is None or t0 is None or t_first is None:
|
||||
continue
|
||||
ttft = t_first - t0
|
||||
predicted.append(cm)
|
||||
actual.append(ttft)
|
||||
sizes.append(r.get("input_length", 0))
|
||||
rids.append(r.get("request_id", "?"))
|
||||
|
||||
fig, ax = plt.subplots(figsize=(7, 5))
|
||||
ax.scatter(predicted, actual,
|
||||
s=[max(100, sz / 100) for sz in sizes],
|
||||
color="#d62728", edgecolors="black", alpha=0.75)
|
||||
for p, a, sz, rid in zip(predicted, actual, sizes, rids):
|
||||
ax.annotate(f"input={sz}",
|
||||
(p, a), xytext=(8, 6), textcoords="offset points",
|
||||
fontsize=9)
|
||||
# y=x reference + 10x line + 20x line
|
||||
lo = 0.5
|
||||
hi = max(50, max(actual) * 1.2)
|
||||
ax.plot([lo, hi], [lo, hi], "k--", alpha=0.5, label="y = x (calibrated)")
|
||||
ax.plot([lo, hi], [lo * 10, hi * 10], color="gray", linestyle=":",
|
||||
alpha=0.4, label="10x")
|
||||
ax.plot([lo, hi], [lo * 20, hi * 20], color="lightgray", linestyle=":",
|
||||
alpha=0.4, label="20x")
|
||||
ax.set_xscale("log")
|
||||
ax.set_yscale("log")
|
||||
ax.set_xlim(lo, hi)
|
||||
ax.set_ylim(lo, hi)
|
||||
ax.set_xlabel("Cost model: predicted migrate cost (s)")
|
||||
ax.set_ylabel("Realized TTFT (s)")
|
||||
ax.set_title(
|
||||
"All 5 PD-sep triggered requests in v2.1 sit far above y=x.\n"
|
||||
"Real transfer cost ~10-20x what the calibrated model predicted."
|
||||
)
|
||||
ax.grid(alpha=0.3, which="both")
|
||||
ax.legend(loc="lower right")
|
||||
fig.tight_layout()
|
||||
fig.savefig(OUT / "fig_v2_predicted_vs_actual.png", dpi=120)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def fig_three_way_hotspot():
|
||||
pols = ["unified", "unified_kv_both", "unified_v2"]
|
||||
per_worker = {p: _load(f"per_worker_{p}.json") for p in pols}
|
||||
workers = sorted(per_worker["unified"]["per_worker_ttft_p90_s"].keys())
|
||||
|
||||
x = range(len(workers))
|
||||
width = 0.27
|
||||
fig, ax = plt.subplots(figsize=(11, 5))
|
||||
for i, p in enumerate(pols):
|
||||
d = per_worker[p]["per_worker_ttft_p90_s"]
|
||||
vals = [d[w] for w in workers]
|
||||
offset = (i - 1) * width
|
||||
ax.bar([j + offset for j in x], vals, width,
|
||||
label=f"{p} (hotspot={per_worker[p]['hotspot_index_ttft_p90']:.2f})",
|
||||
color=POLICY_COLORS[p], edgecolor="black", linewidth=0.4)
|
||||
short = [w.replace("http://127.0.0.1:", ":") for w in workers]
|
||||
ax.set_xticks(list(x))
|
||||
ax.set_xticklabels(short, rotation=0, fontsize=9)
|
||||
ax.set_ylabel("worker TTFT p90 (s)")
|
||||
ax.set_title(
|
||||
"Per-worker TTFT p90 distribution. kv_both alone makes the hot worker hotter\n"
|
||||
"(unified→kv_both: 37.7s→43.5s peak); v2's 5 PD-sep triggers nudge it back."
|
||||
)
|
||||
ax.legend(loc="upper left", fontsize=9)
|
||||
ax.grid(alpha=0.3, axis="y")
|
||||
fig.tight_layout()
|
||||
fig.savefig(OUT / "fig_three_way_hotspot.png", dpi=120)
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def main():
|
||||
fig_kv_both_overhead()
|
||||
fig_v2_trigger_funnel()
|
||||
fig_v2_predicted_vs_actual()
|
||||
fig_three_way_hotspot()
|
||||
print(f"wrote 4 figures to {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user