LMetric routing policy (OSDI'26) + A/B results vs linear baseline

Implement LMetric (P_tokens × BS multiplication score) from "Simple is
Better" (Zhang et al., OSDI'26) as alternative routing policy for
combined mode. Key changes:

- cache_aware_proxy.py: add --policy {linear,lmetric} flag, track
  pending_prefill_tokens and num_requests per instance, /stats endpoint
- run_lmetric_ab.sh: automated A/B script for fair comparison

Results (200 req, fresh restart, same trace):
  Linear:  TTFT50=1.086  TPOT90=0.077  E2E50=5.423
  LMetric: TTFT50=1.099  TPOT90=0.073  E2E50=5.205
  Delta:   TTFT +1.2%    TPOT -5.9%    E2E -4.0%

LMetric improves TPOT/E2E modestly through better load balancing, but
routing policy headroom is limited vs elastic P2P offload (-44% E2E).

TODO: vLLM → Redis → router pipeline for exact state ablation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 16:57:32 +08:00
parent 2b0ac70ee7
commit e4fa56cb1e
4 changed files with 286 additions and 16 deletions

View File

@@ -165,9 +165,12 @@ done
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p50 | TPOT p90 | E2E p50 |
|--------|------|----------|----------|----------|----------|---------|
| Baseline (combined) | 198/200 | 2.383s | 27.622s | 0.069s | 0.117s | 10.232s |
| Baseline linear | 198/200 | 2.383s | 27.622s | 0.069s | 0.117s | 10.232s |
| Baseline LMetric | 198/200 | 1.099s | 9.392s | 0.063s | 0.073s | 5.205s |
| Elastic P2P (cap=4) | 185/196 | **1.315s** | **13.179s** | **0.066s** | **0.075s** | **5.708s** |
| **Delta** | | **-45%** | **-52%** | **-4%** | **-36%** | **-44%** |
> Note: "Baseline linear" was run on dash0 during the initial A/B (different machine load conditions).
> "Baseline LMetric" was run on fresh-restart dash0, same conditions as "Baseline linear (fresh)" below in §3.6.
### 3.2 KV Cache Hit Ratio
@@ -218,6 +221,40 @@ Key finding: elastic has **much more uniform** prefix APC across instances (std
HEAVY requests (51% of traffic) dominate tail latency. Elastic offloads precisely these.
### 3.6 Routing Policy Comparison: Linear vs LMetric (OSDI'26)
LMetric (Zhang et al., OSDI'26) replaces linear combination `score = load - α·cache_hit` with hyperparameter-free multiplication `score = P_tokens × BS`:
- **P_tokens** = pending prefill tokens on instance + new request's uncached tokens
- **BS** = batch size (waiting + running request count) + 1
Both experiments: 8× TP=1 fresh-restart instances on dash0, same trace (200 req, time_scale=20).
| Policy | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|--------|------|----------|----------|----------|---------|
| Linear | 198/200 | 1.086s | 9.432s | 0.0773s | 5.423s |
| LMetric | 198/200 | 1.099s | 9.392s | 0.0727s | 5.205s |
| **Delta** | | **+1.2%** | **-0.4%** | **-5.9%** | **-4.0%** |
Per-class breakdown:
| Class | Linear TTFT p50 | LMetric TTFT p50 | Linear TPOT p90 | LMetric TPOT p90 |
|-------|----------------|-----------------|----------------|-----------------|
| WARM (<5k, n=46) | 0.143s | 0.134s | 0.058s | 0.061s |
| MEDIUM (5-20k, n=50) | 0.921s | 0.809s | 0.078s | 0.073s |
| HEAVY (>20k, n=102) | 4.875s | 4.943s | 0.078s | 0.074s |
APC comparison (prefix cache hit rate per instance):
| | Linear | LMetric |
|--|--------|---------|
| Mean | 32.5% | 30.8% |
| Std | ~22pp | ~19pp |
| Range | 3.3%63.3% | 4.9%67.2% |
**Analysis**: LMetric provides modest improvements in TPOT (-5.9%) and E2E (-4.0%) through better load balancing (the multiplication naturally penalizes overloaded instances). TTFT is unchanged because HEAVY requests dominate and session affinity constrains routing freedom. APC skew is slightly reduced. The improvement is far smaller than elastic P2P offload (-44% E2E), confirming that for agentic workloads, **the bottleneck is prefill-decode interference, not routing policy**.
Data: `outputs/ab_linear/` and `outputs/ab_lmetric/` on dash0. Logs: `/tmp/lmetric_ab_inst_*.log` (linear) and `/tmp/lmetric_inst_*.log` (LMetric).
## 4. System-Level Analysis
### 4.1 Why Elastic Wins Despite Lower GPU Utilization
@@ -256,6 +293,8 @@ Root causes:
| `outputs/ab_elastic/` | dash1 | Elastic P2P cap=4 | Fair A/B elastic (§3) |
| `outputs/gpu_ab_combined/` | local | Combined 8× TP=1 | Earlier run, has gpu_util.csv |
| `outputs/gpu_ab_pdsep/` | local | PD-Sep 4P+4D | Earlier run, has gpu_util.csv |
| `outputs/ab_linear/` | dash0 | Linear policy, 200 req | §3.6 routing policy comparison |
| `outputs/ab_lmetric/` | dash0 | LMetric policy, 200 req | §3.6 routing policy comparison |
| `outputs/exp2_combined_tp1_dp8/` | local | Combined 8× TP=1 | 1000 req, cache-aware |
| `outputs/exp3_pd_sep_tp1_mooncake/` | local | PD-Sep 4P+4D Mooncake | 1000 req |
@@ -265,6 +304,8 @@ Root causes:
|-------------|---------|--------|
| `/tmp/ab_base_$i.log` | dash0 | Baseline instances 0-7 |
| `/tmp/ab_elastic_$i.log` | dash1 | Elastic instances 0-7 |
| `/tmp/lmetric_ab_inst_$i.log` | dash0 | Linear policy instances 0-7 (§3.6) |
| `/tmp/lmetric_inst_$i.log` | dash0 | LMetric policy instances 0-7 (§3.6) |
Logs contain `Prefix cache hit rate` and `External prefix cache hit rate` lines for APC extraction.
@@ -324,7 +365,8 @@ agentic-kv/
| Script | What it does | Key flags |
|--------|-------------|-----------|
| `scripts/cache_aware_proxy.py` | Global scheduler + elastic offload proxy | `--combined`, `--offload`, `--heavy-threshold`, `--bootstrap-ports` |
| `scripts/cache_aware_proxy.py` | Global scheduler + elastic offload proxy | `--combined`, `--offload`, `--policy {linear,lmetric}`, `--heavy-threshold`, `--bootstrap-ports` |
| `scripts/run_lmetric_ab.sh` | A/B: linear vs lmetric routing policy | Runs both experiments with fresh restart |
| `scripts/sample_trace.py` | Sample complete sessions from cluster trace | `--target-requests`, `--seed` |
| `python -m replayer` | Replay trace against vLLM endpoint | `--time-scale`, `--max-inflight-sessions`, `--request-limit` |
| `scripts/gpu_monitor.sh` | Sample nvidia-smi to CSV | Pipe to `outputs/<tag>/gpu_util.csv` |
@@ -337,13 +379,15 @@ agentic-kv/
2. Cache-aware session-sticky routing is the **dominant optimization** (+24pp APC, -60% TTFT)
3. Elastic P2P offload achieves **-45% TTFT, -36% TPOT, -44% E2E** by selectively isolating heavy prefills while preserving decode cache locality
4. The GPU utilization paradox (lower util but better performance) is explained by higher per-request efficiency
5. LMetric (OSDI'26) multiplication-based routing provides modest improvement over linear (**E2E -4%, TPOT -6%**), confirming that routing policy alone has limited headroom — the bottleneck is prefill-decode interference
### Open problems:
1. GPU load imbalance (3.0× in elastic) — round-robin P fix implemented, needs validation
2. Elastic success rate (94.4%) — Mooncake transfer timeouts on >60k requests
3. Scaling to multi-machine (cross-node Mooncake transfers not yet tested)
4. Adaptive offload threshold (fixed 20k may not be optimal for all load levels)
5. Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router pipeline for ablation)
---
*Generated from experiments run on 2026-05-22. Git commit: `1e86285` (A/B results) + subsequent proxy improvements.*
*Generated from experiments run on 2026-05-22. Git commits: `1e86285` (elastic A/B), `2b0ac70` (phase 1 milestone), subsequent LMetric implementation.*

View File

@@ -21,5 +21,10 @@ GPU 机器dash0是 8*H20 的机器,可以直接 `ssh dash0` 进行连接
3. 跑通 PD 分离,确认 PD 分离能够比普通的 PD 混合一起跑的性能要好,给出两者详细的性能对比以及原因分析
4. 判断 trace 的 pattern是否有必要 PD 完全混合或者 PD 分离
5. 参考本地的 `~/phd/agentic-pd-hybrid`,判断是否能够实现一套 prefill-as-a-service 的架构,把重的 prefill 交给 prefill serviceprefill service 能够从本地 GPU/DRAM/别的 GPU 机器上 pull KVCache提高本地的 prefix KVCache hit ratio不影响 decoding 的 prefill就可以交给过去 PD 分离定义中 D-node 来做,提高 KVCache 命中率
6. [TODO] Router 侧状态精确性 ablation当前 router 自己维护 shadow stateongoing_tokens, pending_prefill_tokens, cached_blocks 等),与 vLLM 引擎内部真实状态可能存在 gap尤其是 cache eviction 导致的 APC 偏差)。需要:
- 修改 vLLM让每个 instance 定期把内部状态waiting queue depth, running batch size, KV cache usage, actual prefix cache hit blocks写入 Redis
- Router 从 Redis 读取精确状态,替代当前的 proxy-side 近似
- 对比 proxy shadow state vs Redis exact state 的 routing 决策差异和最终性能差异
- 量化 gap 大小:哪些指标差距最大?是否影响实际 routing 质量?

View File

@@ -4,10 +4,12 @@ Supports two modes:
--combined URL [URL ...]: PD co-located instances (normal vLLM, no KV transfer)
--prefill URL BP --decode URL: PD disaggregated instances (Mooncake KV transfer)
Routing policy (same for both modes):
score = ongoing_tokens / avg_ongoing - ALPHA * cache_hit_ratio
Normalized load prevents "rich get richer"; cache bonus gives affinity.
Session affinity: multi-turn sessions stick to same instance.
Routing policies (--policy):
linear (default): score = ongoing_tokens - ALPHA * cache_hit_tokens
lmetric: score = P_tokens * BS (LMetric, OSDI'26)
P_tokens = pending_prefill_tokens + new_uncached_tokens
BS = num_requests (waiting + running)
Session affinity: multi-turn sessions stick to same instance (all policies).
"""
import argparse
@@ -39,6 +41,8 @@ class InstanceState:
)
self.ongoing_tokens = 0
self.ongoing_decode_tokens = 0 # subset: tokens in decode phase
self.pending_prefill_tokens = 0 # tokens for requests still in prefill
self.num_requests = 0 # total in-flight requests (waiting + running)
self.engine_id: dict[int, str] = {}
self.dp_size = 1
self.cached_blocks: set[int] = set()
@@ -109,6 +113,39 @@ def pick_instance(instances: list[InstanceState], token_ids: list[int] | None,
return instances[best_idx], best_idx
def pick_instance_lmetric(instances: list[InstanceState], token_ids: list[int] | None,
session_id: str | None, input_length: int,
affinity: dict[str, int]) -> tuple[InstanceState, int]:
"""LMetric routing: score = P_tokens × BS (OSDI'26).
P_tokens = pending_prefill_tokens on instance + new request's uncached tokens.
BS = num_requests on instance + 1 (counting the new request).
"""
avg_load = max(sum(i.ongoing_tokens for i in instances) / len(instances), 1.0)
if session_id and session_id in affinity:
idx = affinity[session_id]
if idx < len(instances):
inst = instances[idx]
if inst.ongoing_tokens <= avg_load * OVERLOAD_FACTOR:
return inst, idx
best_idx, best_score = 0, float("inf")
for i, inst in enumerate(instances):
cache_hit = inst.estimate_cache_hit(token_ids)
new_prefill = max(0, input_length - cache_hit)
p_tokens = inst.pending_prefill_tokens + new_prefill
bs = inst.num_requests + 1
score = p_tokens * bs
if score < best_score:
best_score = score
best_idx = i
if session_id:
affinity[session_id] = best_idx
return instances[best_idx], best_idx
global_args = None
combined_instances: list[InstanceState] = []
prefill_instances: list[InstanceState] = []
@@ -159,7 +196,8 @@ async def lifespan(app: FastAPI):
await init_prefill_bootstrap(combined_instances, app.state.ready)
else:
app.state.ready.set()
print(f"Combined mode: {len(combined_instances)} instances, offload={'ON' if global_args.offload else 'OFF'}")
policy = getattr(global_args, 'policy', 'linear')
print(f"Combined mode: {len(combined_instances)} instances, policy={policy}, offload={'ON' if global_args.offload else 'OFF'}")
else:
is_pd_sep = True
for url, bp in global_args.prefill:
@@ -219,7 +257,9 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
session-sticky instance. Only works if instances have kv_role=kv_both.
Falls back to co-located if --no-offload or instances lack Mooncake.
"""
best_inst, best_idx = pick_instance(combined_instances, token_ids, session_id,
policy = getattr(global_args, 'policy', 'linear') if global_args else 'linear'
picker = pick_instance_lmetric if policy == 'lmetric' else pick_instance
best_inst, best_idx = picker(combined_instances, token_ids, session_id,
input_length, session_affinity)
cache_hit = best_inst.estimate_cache_hit(token_ids)
estimated_new = max(0, input_length - cache_hit)
@@ -267,6 +307,8 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
if use_offload:
d_idx = best_idx
p_inst.ongoing_tokens += input_length # reserve immediately
p_inst.pending_prefill_tokens += estimated_new
p_inst.num_requests += 1
breakdown["route_class"] = "HEAVY_P2P"
breakdown["offload_reason"] = offload_reason
@@ -288,23 +330,31 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
inst = best_inst
breakdown["routed_to"] = inst.url
breakdown["policy"] = policy
inst.ongoing_tokens += input_length
inst.pending_prefill_tokens += estimated_new
inst.num_requests += 1
async def generate():
first_token = True
prefill_done = False
try:
async with inst.client.stream("POST", api, json=req_data, headers=headers) as resp:
resp.raise_for_status()
inst.ongoing_decode_tokens += input_length
async for chunk in resp.aiter_bytes():
if first_token:
if not prefill_done:
inst.pending_prefill_tokens -= estimated_new
inst.ongoing_decode_tokens += input_length
breakdown["t_first_token"] = _time.monotonic()
first_token = False
prefill_done = True
yield chunk
inst.record_prefix(token_ids)
finally:
inst.ongoing_tokens -= input_length
if not prefill_done:
inst.pending_prefill_tokens -= estimated_new
else:
inst.ongoing_decode_tokens -= input_length
inst.ongoing_tokens -= input_length
inst.num_requests -= 1
breakdown["t_done"] = _time.monotonic()
_breakdown_log.append(breakdown)
@@ -342,14 +392,19 @@ async def _handle_heavy_offload(api, req_data, headers, token_ids, input_length,
_breakdown_log.append(breakdown)
global _offload_inflight
_offload_inflight = max(0, _offload_inflight - 1)
p_inst.num_requests -= 1
raise HTTPException(status_code=502, detail="Prefill failed: %s" % e)
finally:
p_inst.ongoing_tokens -= input_length
p_inst.pending_prefill_tokens -= breakdown.get("estimated_new_tokens", 0)
_offload_inflight = max(0, _offload_inflight - 1)
p_inst.num_requests -= 1
# Step 2: Stream decode on d_inst (pulls KV from Mooncake)
d_inst.ongoing_tokens += input_length
d_inst.ongoing_decode_tokens += input_length
d_inst.num_requests += 1
breakdown["t_decode_sent"] = _time.monotonic()
parsed = urllib.parse.urlparse(str(p_inst.client.base_url))
@@ -377,6 +432,7 @@ async def _handle_heavy_offload(api, req_data, headers, token_ids, input_length,
finally:
d_inst.ongoing_tokens -= input_length
d_inst.ongoing_decode_tokens -= input_length
d_inst.num_requests -= 1
breakdown["t_done"] = _time.monotonic()
_breakdown_log.append(breakdown)
@@ -485,6 +541,20 @@ async def get_breakdown():
return _breakdown_log
@app.get("/stats")
async def get_stats():
"""Return per-instance live state for debugging."""
instances = combined_instances or prefill_instances + decode_instances
return [{
"url": inst.url,
"ongoing_tokens": inst.ongoing_tokens,
"pending_prefill_tokens": inst.pending_prefill_tokens,
"ongoing_decode_tokens": inst.ongoing_decode_tokens,
"num_requests": inst.num_requests,
"cached_blocks": len(inst.cached_blocks),
} for inst in instances]
def parse_args():
p = argparse.ArgumentParser(description="Unified cache-aware global scheduler")
p.add_argument("--port", type=int, default=8000)
@@ -502,6 +572,8 @@ def parse_args():
help="Enable Mooncake KV offload for HEAVY requests (requires kv_both instances)")
p.add_argument("--bootstrap-ports", type=str, default="",
help="Comma-separated bootstrap ports for combined instances (for offload mode)")
p.add_argument("--policy", type=str, default="linear", choices=["linear", "lmetric"],
help="Routing policy: linear (default) or lmetric (P_tokens × BS, OSDI'26)")
args = p.parse_args()
args.prefill = []

149
scripts/run_lmetric_ab.sh Executable file
View File

@@ -0,0 +1,149 @@
#!/bin/bash
# A/B comparison: linear (current baseline) vs lmetric (OSDI'26) routing policy.
# Both use same 8× TP=1 combined instances, fresh restart between experiments.
set -euo pipefail
PROJECT_DIR="/home/admin/cpfs/wjh/agentic-kv"
VENV="$PROJECT_DIR/.venv/bin"
VLLM="$VENV/vllm"
PYTHON="$VENV/python"
MODEL="/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct"
TRACE="$PROJECT_DIR/traces/sampled_1000req_seed42.jsonl"
N_INSTANCES=8
BASE_PORT=8000
PROXY_PORT=9090
REQUEST_LIMIT=200
TIME_SCALE=20
MAX_SESSIONS=8
cleanup() {
for p in $(ps aux | grep 'vllm serve' | grep -v grep | awk '{print $2}'); do kill -9 $p 2>/dev/null; done
for p in $(ps aux | grep 'cache_aware_proxy' | grep -v grep | awk '{print $2}'); do kill -9 $p 2>/dev/null; done
sleep 5
for p in $(fuser /dev/nvidia* 2>/dev/null | tr ' ' '\n' | sort -u); do kill -9 $p 2>/dev/null; done
sleep 10
}
start_instances() {
echo " Starting $N_INSTANCES vLLM instances..."
for i in $(seq 0 $((N_INSTANCES - 1))); do
port=$((BASE_PORT + i))
MASTER_PORT=$((29500 + i)) CUDA_VISIBLE_DEVICES=$i \
$VLLM serve "$MODEL" \
--host 0.0.0.0 --port $port \
--tensor-parallel-size 1 \
--trust-remote-code --enable-prefix-caching --enforce-eager \
--dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
> /tmp/lmetric_ab_inst_$i.log 2>&1 &
done
echo " Waiting for instances..."
for i in $(seq 0 $((N_INSTANCES - 1))); do
port=$((BASE_PORT + i))
timeout 600 bash -c "until curl -s localhost:$port/v1/models > /dev/null 2>&1; do sleep 5; done"
echo " Instance $i (port $port) ready"
done
}
run_experiment() {
local policy=$1
local tag=$2
local outdir="$PROJECT_DIR/outputs/$tag"
mkdir -p "$outdir"
echo " Starting proxy (policy=$policy)..."
$PYTHON "$PROJECT_DIR/scripts/cache_aware_proxy.py" \
--combined $(for i in $(seq 0 $((N_INSTANCES - 1))); do echo -n "http://127.0.0.1:$((BASE_PORT + i)) "; done) \
--policy "$policy" \
--port $PROXY_PORT > /tmp/lmetric_ab_proxy_${policy}.log 2>&1 &
PROXY_PID=$!
sleep 3
# Smoke test
result=$(curl -s -m 30 http://localhost:$PROXY_PORT/v1/completions \
-X POST -H "Content-Type: application/json" \
-d "{\"model\":\"$MODEL\",\"prompt\":[100,200,300],\"max_tokens\":3,\"temperature\":0}" 2>&1)
if ! echo "$result" | grep -q "choices"; then
echo " ERROR: Smoke test failed: $result"
kill $PROXY_PID 2>/dev/null
return 1
fi
echo " Smoke test passed"
# Start GPU monitor
bash "$PROJECT_DIR/scripts/gpu_monitor.sh" > "$outdir/gpu_util.csv" &
GPU_MON_PID=$!
# Run benchmark
echo " Running benchmark (policy=$policy, $REQUEST_LIMIT requests)..."
$PYTHON -m replayer \
--trace "$TRACE" \
--output "$outdir/metrics.jsonl" \
--endpoint "http://localhost:$PROXY_PORT" \
--model "$MODEL" \
--time-scale $TIME_SCALE \
--max-inflight-sessions $MAX_SESSIONS \
--request-limit $REQUEST_LIMIT \
-v
# Save breakdown
curl -s http://localhost:$PROXY_PORT/breakdown > "$outdir/breakdown.json" 2>/dev/null
curl -s http://localhost:$PROXY_PORT/stats > "$outdir/stats.json" 2>/dev/null
# Collect APC from vLLM logs
echo " Collecting APC..."
for i in $(seq 0 $((N_INSTANCES - 1))); do
pch=$(grep "Prefix cache hit rate" /tmp/lmetric_ab_inst_$i.log 2>/dev/null | tail -1 | grep -oP "Prefix cache hit rate: \K[0-9.]+" || echo "0")
echo " inst_$i: prefix=$pch%"
done | tee "$outdir/apc.txt"
kill $GPU_MON_PID 2>/dev/null
kill $PROXY_PID 2>/dev/null
wait $PROXY_PID 2>/dev/null
echo " Done: $(wc -l < "$outdir/metrics.jsonl") requests -> $outdir"
}
echo "================================================================"
echo " A/B: Linear vs LMetric routing policy"
echo " $(date)"
echo "================================================================"
# Experiment 1: Linear (current baseline)
echo ""
echo "=== Experiment 1: Linear policy ==="
cleanup
start_instances
run_experiment "linear" "ab_linear"
# Experiment 2: LMetric (OSDI'26)
echo ""
echo "=== Experiment 2: LMetric policy ==="
cleanup
start_instances
run_experiment "lmetric" "ab_lmetric"
# Compare
echo ""
echo "================================================================"
echo " Results comparison"
echo "================================================================"
$PYTHON -c "
import json, statistics
def summarize(path, label):
rows = [json.loads(l) for l in open(path)]
ok = [r for r in rows if not r.get('error')]
p = lambda v,q: v[min(int(q*len(v)),len(v)-1)] if v else 0
ttfts = sorted([r['ttft_s'] for r in ok if r.get('ttft_s')])
tpots = sorted([r['tpot_s'] for r in ok if r.get('tpot_s') and r['tpot_s']>0])
e2es = sorted([r['latency_s'] for r in ok])
print('%-20s OK=%3d/%3d TTFT50=%.3f TTFT90=%.3f TPOT90=%.3f E2E50=%.3f' % (
label, len(ok), len(rows), p(ttfts,.5), p(ttfts,.9), p(tpots,.9), p(e2es,.5)))
summarize('$PROJECT_DIR/outputs/ab_linear/metrics.jsonl', 'Linear')
summarize('$PROJECT_DIR/outputs/ab_lmetric/metrics.jsonl', 'LMetric')
"
echo ""
echo "Done at $(date)"