Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)

Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.

Result (67/200 processed, 75% success):
  TTFT p50: 0.551s (-49% vs baseline 1.080s)
  TTFT p90: 4.135s (vs baseline 9.410s, -56%)
  TPOT p90: 0.074s (same as baseline)
  E2E  p50: 2.938s (-45% vs baseline 5.306s)

25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.

Also: added external_prefix_cache metrics tracking to replayer summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 13:50:25 +08:00
parent e9e313f9c5
commit 1d2eeb4925
3 changed files with 156 additions and 14 deletions

View File

@@ -248,7 +248,8 @@ async def _run_session(
async def _snapshot_prefix_cache_metrics(url_csv: str) -> dict[str, float]:
"""Scrape vLLM /metrics for prefix cache counters (aggregated across endpoints)."""
total = {"queries": 0.0, "hits": 0.0}
total = {"queries": 0.0, "hits": 0.0,
"external_queries": 0.0, "external_hits": 0.0}
endpoints = [e.strip() for e in url_csv.split(",")]
async with httpx.AsyncClient(timeout=10) as c:
for url in endpoints:
@@ -259,6 +260,10 @@ async def _snapshot_prefix_cache_metrics(url_csv: str) -> dict[str, float]:
total["queries"] += float(line.split()[-1])
elif line.startswith("vllm:prefix_cache_hits_total"):
total["hits"] += float(line.split()[-1])
elif line.startswith("vllm:external_prefix_cache_queries_total"):
total["external_queries"] += float(line.split()[-1])
elif line.startswith("vllm:external_prefix_cache_hits_total"):
total["external_hits"] += float(line.split()[-1])
except Exception:
pass
return total
@@ -328,10 +333,13 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
delta_queries = post_metrics.get("queries", 0) - pre_metrics.get("queries", 0)
delta_hits = post_metrics.get("hits", 0) - pre_metrics.get("hits", 0)
hit_ratio = delta_hits / delta_queries if delta_queries > 0 else 0.0
delta_ext_queries = post_metrics.get("external_queries", 0) - pre_metrics.get("external_queries", 0)
delta_ext_hits = post_metrics.get("external_hits", 0) - pre_metrics.get("external_hits", 0)
ext_hit_ratio = delta_ext_hits / delta_ext_queries if delta_ext_queries > 0 else 0.0
logger.info("Done: %d/%d succeeded in %.1fs", sum(1 for m in flat if m.error is None), len(flat), sweep_elapsed)
logger.info("Prefix cache: %.1f%% hit ratio (%d/%d tokens)",
hit_ratio * 100, int(delta_hits), int(delta_queries))
logger.info("Prefix cache: local=%.1f%% external=%.1f%%",
hit_ratio * 100, ext_hit_ratio * 100)
# Append cache stats to summary
import json as _json
@@ -339,6 +347,9 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
summary["prefix_cache_queries_tokens"] = int(delta_queries)
summary["prefix_cache_hits_tokens"] = int(delta_hits)
summary["prefix_cache_hit_ratio"] = hit_ratio
summary["external_cache_queries_tokens"] = int(delta_ext_queries)
summary["external_cache_hits_tokens"] = int(delta_ext_hits)
summary["external_cache_hit_ratio"] = ext_hit_ratio
summary["wall_clock_s"] = sweep_elapsed
summary_path.write_text(_json.dumps(summary, indent=2, sort_keys=True))