agentic-kvc

gahow/agentic-kvc

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	63387f614d	Full v3 trace re-profile with layer-wise: matched migrations improve 1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s, scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99 -5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms layer-wise removes the transfer half of migration overhead but not the control-plane/queue residual. DESIGN.md updated with results. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:16:37 +08:00
Gahow Wang	e77bdcac5a	Layerwise under load: overlap benefit survives (bg=16) mb7 with background decode load (8/instance). Critical-path transfer overhead stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at 32k), prefill not slowed, KV correct. Confirms the overlap holds on busy instances. DESIGN.md updated with idle-vs-load table + the two blockers (chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:30:14 +08:00
Gahow Wang	fec50fa45d	Layerwise KV transfer on Mooncake: PoC + microbench (worktree exploration) Implements per-layer KV push during prefill (write mode) on vLLM's MooncakeConnector, env-gated by MOONCAKE_LAYERWISE=1. 2-instance microbench (mb7) shows correctness (KV lands, cached==prompt) and that the transfer is hidden behind prefill compute: critical-path overhead drops from O(KV size) (123/202/529ms for 8k/16k/32k) to a flat ~58ms (2-9x), with no prefill slowdown, on idle instances. Caveats: idle-only, chunked-prefill disabled, single concurrent transfer — see DESIGN.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 15:34:43 +08:00

Author

SHA1

Message

Date

Gahow Wang

63387f614d

Full v3 trace re-profile with layer-wise: matched migrations improve

1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s,
scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99
-5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms
layer-wise removes the transfer half of migration overhead but not the
control-plane/queue residual. DESIGN.md updated with results.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 19:16:37 +08:00

Gahow Wang

e77bdcac5a

Layerwise under load: overlap benefit survives (bg=16)

mb7 with background decode load (8/instance). Critical-path transfer overhead
stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at
32k), prefill not slowed, KV correct. Confirms the overlap holds on busy
instances. DESIGN.md updated with idle-vs-load table + the two blockers
(chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 16:30:14 +08:00

Gahow Wang

fec50fa45d

Layerwise KV transfer on Mooncake: PoC + microbench (worktree exploration)

Implements per-layer KV push during prefill (write mode) on vLLM's
MooncakeConnector, env-gated by MOONCAKE_LAYERWISE=1. 2-instance microbench
(mb7) shows correctness (KV lands, cached==prompt) and that the transfer is
hidden behind prefill compute: critical-path overhead drops from O(KV size)
(123/202/529ms for 8k/16k/32k) to a flat ~58ms (2-9x), with no prefill
slowdown, on idle instances. Caveats: idle-only, chunked-prefill disabled,
single concurrent transfer — see DESIGN.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 15:34:43 +08:00

3 Commits