Gahow Wang
e4fa56cb1e
LMetric routing policy (OSDI'26) + A/B results vs linear baseline
Implement LMetric (P_tokens × BS multiplication score) from "Simple is
Better" (Zhang et al., OSDI'26) as alternative routing policy for
combined mode. Key changes:
- cache_aware_proxy.py: add --policy {linear,lmetric} flag, track
pending_prefill_tokens and num_requests per instance, /stats endpoint
- run_lmetric_ab.sh: automated A/B script for fair comparison
Results (200 req, fresh restart, same trace):
Linear: TTFT50=1.086 TPOT90=0.077 E2E50=5.423
LMetric: TTFT50=1.099 TPOT90=0.073 E2E50=5.205
Delta: TTFT +1.2% TPOT -5.9% E2E -4.0%
LMetric improves TPOT/E2E modestly through better load balancing, but
routing policy headroom is limited vs elastic P2P offload (-44% E2E).
TODO: vLLM → Redis → router pipeline for exact state ablation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 16:57:32 +08:00
..
2026-05-21 22:13:38 +08:00
2026-05-21 22:42:20 +08:00
2026-05-21 23:02:42 +08:00
2026-05-22 01:50:27 +08:00
2026-05-22 12:28:24 +08:00
2026-05-22 00:13:50 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 01:27:22 +08:00
2026-05-21 22:13:38 +08:00
2026-05-22 13:25:34 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 16:57:32 +08:00
2026-05-22 15:48:51 +08:00
2026-05-22 01:00:10 +08:00
2026-05-22 12:28:24 +08:00
2026-05-22 02:13:15 +08:00
2026-05-22 15:08:16 +08:00
2026-05-22 12:28:24 +08:00
2026-05-21 21:21:57 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 10:35:18 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 03:14:05 +08:00
2026-05-21 22:13:38 +08:00
2026-05-22 16:17:41 +08:00
2026-05-21 21:21:57 +08:00
2026-05-21 21:21:57 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 15:48:51 +08:00
2026-05-22 00:13:50 +08:00
2026-05-22 10:58:59 +08:00
2026-05-21 21:21:57 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 16:57:32 +08:00
2026-05-21 21:21:57 +08:00
2026-05-22 01:28:53 +08:00