Gahow Wang
3bc37cc6d5
PS experiments + H4 cache-gate + GPU profiling + Mooncake elif→if fix
Experiments run:
- Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise)
- PS V1 (cold prefill): REJECTED — PS always slower than cached C
- PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck
- V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal
- H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD
TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance.
- H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s)
Key findings:
- Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead
- 92% of HEAVY are turn-1 cold → offloading cold requests always loses
- GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT)
- RDMA transfer negates the routing benefit for offloaded requests
Code changes:
- bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark)
- cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing
- mooncake_connector.py: elif→if fix (allow dual prefill+decode flags)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 02:14:37 +08:00
..
2026-05-21 22:13:38 +08:00
2026-05-21 22:42:20 +08:00
2026-05-21 23:02:42 +08:00
2026-05-22 01:50:27 +08:00
2026-05-22 12:28:24 +08:00
2026-05-22 00:13:50 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 01:27:22 +08:00
2026-05-21 22:13:38 +08:00
2026-05-23 02:14:37 +08:00
2026-05-23 02:14:37 +08:00
2026-05-22 13:25:34 +08:00
2026-05-21 21:21:57 +08:00
2026-05-23 02:14:37 +08:00
2026-05-23 02:14:37 +08:00
2026-05-22 15:48:51 +08:00
2026-05-22 01:00:10 +08:00
2026-05-22 12:28:24 +08:00
2026-05-22 02:13:15 +08:00
2026-05-22 15:08:16 +08:00
2026-05-22 12:28:24 +08:00
2026-05-21 21:21:57 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 10:35:18 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 03:14:05 +08:00
2026-05-21 22:13:38 +08:00
2026-05-22 16:17:41 +08:00
2026-05-21 21:21:57 +08:00
2026-05-21 21:21:57 +08:00
2026-05-23 02:14:37 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 15:48:51 +08:00
2026-05-22 00:13:50 +08:00
2026-05-22 10:58:59 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 17:54:21 +08:00
2026-05-21 21:21:57 +08:00
2026-05-23 02:14:37 +08:00
2026-05-22 16:57:32 +08:00
2026-05-23 02:14:37 +08:00
2026-05-23 02:14:37 +08:00
2026-05-23 02:14:37 +08:00
2026-05-23 02:14:37 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 01:28:53 +08:00