LMetric routing policy (OSDI'26) + A/B results vs linear baseline

Implement LMetric (P_tokens × BS multiplication score) from "Simple is
Better" (Zhang et al., OSDI'26) as alternative routing policy for
combined mode. Key changes:

- cache_aware_proxy.py: add --policy {linear,lmetric} flag, track
  pending_prefill_tokens and num_requests per instance, /stats endpoint
- run_lmetric_ab.sh: automated A/B script for fair comparison

Results (200 req, fresh restart, same trace):
  Linear:  TTFT50=1.086  TPOT90=0.077  E2E50=5.423
  LMetric: TTFT50=1.099  TPOT90=0.073  E2E50=5.205
  Delta:   TTFT +1.2%    TPOT -5.9%    E2E -4.0%

LMetric improves TPOT/E2E modestly through better load balancing, but
routing policy headroom is limited vs elastic P2P offload (-44% E2E).

TODO: vLLM → Redis → router pipeline for exact state ablation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 16:57:32 +08:00
parent 2b0ac70ee7
commit e4fa56cb1e
4 changed files with 286 additions and 16 deletions

View File

@@ -165,9 +165,12 @@ done
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p50 | TPOT p90 | E2E p50 |
|--------|------|----------|----------|----------|----------|---------|
| Baseline (combined) | 198/200 | 2.383s | 27.622s | 0.069s | 0.117s | 10.232s |
| Baseline linear | 198/200 | 2.383s | 27.622s | 0.069s | 0.117s | 10.232s |
| Baseline LMetric | 198/200 | 1.099s | 9.392s | 0.063s | 0.073s | 5.205s |
| Elastic P2P (cap=4) | 185/196 | **1.315s** | **13.179s** | **0.066s** | **0.075s** | **5.708s** |
| **Delta** | | **-45%** | **-52%** | **-4%** | **-36%** | **-44%** |
> Note: "Baseline linear" was run on dash0 during the initial A/B (different machine load conditions).
> "Baseline LMetric" was run on fresh-restart dash0, same conditions as "Baseline linear (fresh)" below in §3.6.
### 3.2 KV Cache Hit Ratio
@@ -218,6 +221,40 @@ Key finding: elastic has **much more uniform** prefix APC across instances (std
HEAVY requests (51% of traffic) dominate tail latency. Elastic offloads precisely these.
### 3.6 Routing Policy Comparison: Linear vs LMetric (OSDI'26)
LMetric (Zhang et al., OSDI'26) replaces linear combination `score = load - α·cache_hit` with hyperparameter-free multiplication `score = P_tokens × BS`:
- **P_tokens** = pending prefill tokens on instance + new request's uncached tokens
- **BS** = batch size (waiting + running request count) + 1
Both experiments: 8× TP=1 fresh-restart instances on dash0, same trace (200 req, time_scale=20).
| Policy | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|--------|------|----------|----------|----------|---------|
| Linear | 198/200 | 1.086s | 9.432s | 0.0773s | 5.423s |
| LMetric | 198/200 | 1.099s | 9.392s | 0.0727s | 5.205s |
| **Delta** | | **+1.2%** | **-0.4%** | **-5.9%** | **-4.0%** |
Per-class breakdown:
| Class | Linear TTFT p50 | LMetric TTFT p50 | Linear TPOT p90 | LMetric TPOT p90 |
|-------|----------------|-----------------|----------------|-----------------|
| WARM (<5k, n=46) | 0.143s | 0.134s | 0.058s | 0.061s |
| MEDIUM (5-20k, n=50) | 0.921s | 0.809s | 0.078s | 0.073s |
| HEAVY (>20k, n=102) | 4.875s | 4.943s | 0.078s | 0.074s |
APC comparison (prefix cache hit rate per instance):
| | Linear | LMetric |
|--|--------|---------|
| Mean | 32.5% | 30.8% |
| Std | ~22pp | ~19pp |
| Range | 3.3%63.3% | 4.9%67.2% |
**Analysis**: LMetric provides modest improvements in TPOT (-5.9%) and E2E (-4.0%) through better load balancing (the multiplication naturally penalizes overloaded instances). TTFT is unchanged because HEAVY requests dominate and session affinity constrains routing freedom. APC skew is slightly reduced. The improvement is far smaller than elastic P2P offload (-44% E2E), confirming that for agentic workloads, **the bottleneck is prefill-decode interference, not routing policy**.
Data: `outputs/ab_linear/` and `outputs/ab_lmetric/` on dash0. Logs: `/tmp/lmetric_ab_inst_*.log` (linear) and `/tmp/lmetric_inst_*.log` (LMetric).
## 4. System-Level Analysis
### 4.1 Why Elastic Wins Despite Lower GPU Utilization
@@ -256,6 +293,8 @@ Root causes:
| `outputs/ab_elastic/` | dash1 | Elastic P2P cap=4 | Fair A/B elastic (§3) |
| `outputs/gpu_ab_combined/` | local | Combined 8× TP=1 | Earlier run, has gpu_util.csv |
| `outputs/gpu_ab_pdsep/` | local | PD-Sep 4P+4D | Earlier run, has gpu_util.csv |
| `outputs/ab_linear/` | dash0 | Linear policy, 200 req | §3.6 routing policy comparison |
| `outputs/ab_lmetric/` | dash0 | LMetric policy, 200 req | §3.6 routing policy comparison |
| `outputs/exp2_combined_tp1_dp8/` | local | Combined 8× TP=1 | 1000 req, cache-aware |
| `outputs/exp3_pd_sep_tp1_mooncake/` | local | PD-Sep 4P+4D Mooncake | 1000 req |
@@ -265,6 +304,8 @@ Root causes:
|-------------|---------|--------|
| `/tmp/ab_base_$i.log` | dash0 | Baseline instances 0-7 |
| `/tmp/ab_elastic_$i.log` | dash1 | Elastic instances 0-7 |
| `/tmp/lmetric_ab_inst_$i.log` | dash0 | Linear policy instances 0-7 (§3.6) |
| `/tmp/lmetric_inst_$i.log` | dash0 | LMetric policy instances 0-7 (§3.6) |
Logs contain `Prefix cache hit rate` and `External prefix cache hit rate` lines for APC extraction.
@@ -324,7 +365,8 @@ agentic-kv/
| Script | What it does | Key flags |
|--------|-------------|-----------|
| `scripts/cache_aware_proxy.py` | Global scheduler + elastic offload proxy | `--combined`, `--offload`, `--heavy-threshold`, `--bootstrap-ports` |
| `scripts/cache_aware_proxy.py` | Global scheduler + elastic offload proxy | `--combined`, `--offload`, `--policy {linear,lmetric}`, `--heavy-threshold`, `--bootstrap-ports` |
| `scripts/run_lmetric_ab.sh` | A/B: linear vs lmetric routing policy | Runs both experiments with fresh restart |
| `scripts/sample_trace.py` | Sample complete sessions from cluster trace | `--target-requests`, `--seed` |
| `python -m replayer` | Replay trace against vLLM endpoint | `--time-scale`, `--max-inflight-sessions`, `--request-limit` |
| `scripts/gpu_monitor.sh` | Sample nvidia-smi to CSV | Pipe to `outputs/<tag>/gpu_util.csv` |
@@ -337,13 +379,15 @@ agentic-kv/
2. Cache-aware session-sticky routing is the **dominant optimization** (+24pp APC, -60% TTFT)
3. Elastic P2P offload achieves **-45% TTFT, -36% TPOT, -44% E2E** by selectively isolating heavy prefills while preserving decode cache locality
4. The GPU utilization paradox (lower util but better performance) is explained by higher per-request efficiency
5. LMetric (OSDI'26) multiplication-based routing provides modest improvement over linear (**E2E -4%, TPOT -6%**), confirming that routing policy alone has limited headroom — the bottleneck is prefill-decode interference
### Open problems:
1. GPU load imbalance (3.0× in elastic) — round-robin P fix implemented, needs validation
2. Elastic success rate (94.4%) — Mooncake transfer timeouts on >60k requests
3. Scaling to multi-machine (cross-node Mooncake transfers not yet tested)
4. Adaptive offload threshold (fixed 20k may not be optimal for all load levels)
5. Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router pipeline for ablation)
---
*Generated from experiments run on 2026-05-22. Git commit: `1e86285` (A/B results) + subsequent proxy improvements.*
*Generated from experiments run on 2026-05-22. Git commits: `1e86285` (elastic A/B), `2b0ac70` (phase 1 milestone), subsequent LMetric implementation.*