e4fa56cb1ed6309c13fe83e62326a4fd6e39e781
Implement LMetric (P_tokens × BS multiplication score) from "Simple is
Better" (Zhang et al., OSDI'26) as alternative routing policy for
combined mode. Key changes:
- cache_aware_proxy.py: add --policy {linear,lmetric} flag, track
pending_prefill_tokens and num_requests per instance, /stats endpoint
- run_lmetric_ab.sh: automated A/B script for fair comparison
Results (200 req, fresh restart, same trace):
Linear: TTFT50=1.086 TPOT90=0.077 E2E50=5.423
LMetric: TTFT50=1.099 TPOT90=0.073 E2E50=5.205
Delta: TTFT +1.2% TPOT -5.9% E2E -4.0%
LMetric improves TPOT/E2E modestly through better load balancing, but
routing policy headroom is limited vs elastic P2P offload (-44% E2E).
TODO: vLLM → Redis → router pipeline for exact state ablation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%