Full v3 trace re-profile with layer-wise: matched migrations improve
1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s, scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99 -5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms layer-wise removes the transfer half of migration overhead but not the control-plane/queue residual. DESIGN.md updated with results. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -130,7 +130,43 @@ not slow (load LW `t_A` == load `prefill_only`); the transfer (0.56/1.46/4.37 s,
|
||||
producer logs) ran inside the prefill window even with 16 concurrent decodes.
|
||||
Correctness PASS under load.
|
||||
|
||||
## Verdict
|
||||
## FULL 1200-req v3 TRACE re-profile (chunk-safe + concurrent + write-mode)
|
||||
|
||||
Hardened connector (per-step incremental shipping, per-transfer state) +
|
||||
write-mode proxy (concurrent prefill/decode dispatch). Two passes of
|
||||
`w600_r0.0015_st30.jsonl` under `unified_v3`, differing only in transfer mode.
|
||||
|
||||
Correctness: layer-wise **1213/1214 success** (1 connection-error on the 128k
|
||||
req, not KV corruption); byte-level KV correctness validated on mb7
|
||||
(chunked + 3-way concurrent, `cached==prompt`); producer logs confirm
|
||||
incremental shipping (e.g. `shipped 7872/7872 blocks`).
|
||||
|
||||
Migration sets differ between runs (write-mode timing shifts which requests
|
||||
trigger migration; only 4 migrated in both), but are distributionally
|
||||
comparable (median new_local/input ≈ 0.42 vs 0.46). **Matched migrations
|
||||
all improved**, scaling with the transfer hidden behind prefill:
|
||||
|
||||
| request | input | new_local | base TTFT | LW TTFT | Δ |
|
||||
|---|--:|--:|--:|--:|--:|
|
||||
| 1268630 | 102k | 97k | 41.20 | 33.96 | **−7.23s** |
|
||||
| 1334223 | 37k | 14k | 6.04 | 3.23 | −2.81s |
|
||||
| 1279412 | 40k | 8k | 5.50 | 2.92 | −2.58s |
|
||||
| 1271459 | 8.9k | 8.9k | 37.01 | 36.98 | −0.03s (queue-bound) |
|
||||
|
||||
Trace-level TTFT (different sets, directional): overall p90 9.79→9.16 (−6%),
|
||||
p99 44.89→42.85 (−5%). **Modest** because (a) migrations are only 25/1214 ≈
|
||||
**2%** of requests, and (b) several migrations are queue/contention-bound, not
|
||||
transfer-bound — layer-wise removes the transfer component but not the
|
||||
control-plane/queue residual (the ~45% from the b3_v3_fullbreak profile).
|
||||
|
||||
**Verdict on the trace re-profile:** layer-wise does exactly what the profile
|
||||
predicted — it removes the transfer half of migration overhead (matched
|
||||
migrations −2.6 to −7.2s, biggest where there's the most prefill to hide
|
||||
behind), but the trace-level gain is small because migrations are rare and
|
||||
partly queue-bound. It does NOT, on its own, flip migration to a clear win
|
||||
over unified for this agentic workload.
|
||||
|
||||
## Verdict (microbench)
|
||||
|
||||
The mechanism **works and the benefit holds under load**: layer-wise push turns
|
||||
migration's KV-transfer cost from O(KV size) on the critical path into a
|
||||
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user