Full v3 trace re-profile with layer-wise: matched migrations improve

1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s,
scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99
-5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms
layer-wise removes the transfer half of migration overhead but not the
control-plane/queue residual. DESIGN.md updated with results.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-28 19:16:37 +08:00
parent 21db2affb4
commit 63387f614d
5 changed files with 2467 additions and 1 deletions

View File

@@ -130,7 +130,43 @@ not slow (load LW `t_A` == load `prefill_only`); the transfer (0.56/1.46/4.37 s,
producer logs) ran inside the prefill window even with 16 concurrent decodes.
Correctness PASS under load.
## Verdict
## FULL 1200-req v3 TRACE re-profile (chunk-safe + concurrent + write-mode)
Hardened connector (per-step incremental shipping, per-transfer state) +
write-mode proxy (concurrent prefill/decode dispatch). Two passes of
`w600_r0.0015_st30.jsonl` under `unified_v3`, differing only in transfer mode.
Correctness: layer-wise **1213/1214 success** (1 connection-error on the 128k
req, not KV corruption); byte-level KV correctness validated on mb7
(chunked + 3-way concurrent, `cached==prompt`); producer logs confirm
incremental shipping (e.g. `shipped 7872/7872 blocks`).
Migration sets differ between runs (write-mode timing shifts which requests
trigger migration; only 4 migrated in both), but are distributionally
comparable (median new_local/input 0.42 vs 0.46). **Matched migrations
all improved**, scaling with the transfer hidden behind prefill:
| request | input | new_local | base TTFT | LW TTFT | Δ |
|---|--:|--:|--:|--:|--:|
| 1268630 | 102k | 97k | 41.20 | 33.96 | **7.23s** |
| 1334223 | 37k | 14k | 6.04 | 3.23 | 2.81s |
| 1279412 | 40k | 8k | 5.50 | 2.92 | 2.58s |
| 1271459 | 8.9k | 8.9k | 37.01 | 36.98 | 0.03s (queue-bound) |
Trace-level TTFT (different sets, directional): overall p90 9.799.16 (6%),
p99 44.8942.85 (5%). **Modest** because (a) migrations are only 25/1214
**2%** of requests, and (b) several migrations are queue/contention-bound, not
transfer-bound layer-wise removes the transfer component but not the
control-plane/queue residual (the ~45% from the b3_v3_fullbreak profile).
**Verdict on the trace re-profile:** layer-wise does exactly what the profile
predicted it removes the transfer half of migration overhead (matched
migrations 2.6 to 7.2s, biggest where there's the most prefill to hide
behind), but the trace-level gain is small because migrations are rare and
partly queue-bound. It does NOT, on its own, flip migration to a clear win
over unified for this agentic workload.
## Verdict (microbench)
The mechanism **works and the benefit holds under load**: layer-wise push turns
migration's KV-transfer cost from O(KV size) on the critical path into a

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff