Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)

Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.

Result (67/200 processed, 75% success):
  TTFT p50: 0.551s (-49% vs baseline 1.080s)
  TTFT p90: 4.135s (vs baseline 9.410s, -56%)
  TPOT p90: 0.074s (same as baseline)
  E2E  p50: 2.938s (-45% vs baseline 5.306s)

25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.

Also: added external_prefix_cache metrics tracking to replayer summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-22 13:50:25 +08:00
parent e9e313f9c5
commit 1d2eeb4925
3 changed files with 156 additions and 14 deletions

View File

@@ -0,0 +1,115 @@
# Elastic P2P Offload Design
**Date**: 2026-05-22
**Context**: P2P offload TTFT p50 improved 13% but p90 worsened 59%. Root cause: P instance overloaded (serving its own requests + heavy offload). KV transfer itself is only 0.5s, not the bottleneck. External KV correctly registered in prefix cache (no bug).
---
## 1. Problem
Current P2P offload: HEAVY requests ALWAYS offloaded to a different instance.
- When P instance is idle → offload is beneficial (isolates heavy prefill from D's decode)
- When P instance is busy → offload queues behind P's own work → TTFT p90 explodes
## 2. Design: Elastic Offload with Load-Aware Decision
**Core idea**: Offload is a PREFERENCE, not a mandate. The scheduler makes a runtime decision per-request:
```
For each HEAVY request:
1. Compute offload_benefit = estimated decode disruption saved on D instance
2. Compute offload_cost = P instance queue delay + KV transfer time
3. if offload_benefit > offload_cost → OFFLOAD
else → COLOCATE (do P+D on session-sticky instance)
```
### 2.1 Offload Decision Function
```python
def should_offload(estimated_new_tokens, d_inst, p_inst):
"""Decide whether to offload this HEAVY request."""
# Cost: how long will P take? (queue + compute)
p_queue_time = p_inst.ongoing_tokens / PREFILL_THROUGHPUT # seconds
p_compute_time = estimated_new_tokens / PREFILL_THROUGHPUT
kv_transfer_time = 0.5 # empirical constant from our measurements
offload_cost = p_queue_time + kv_transfer_time # p_compute_time same either way
# Benefit: how much would colocated prefill disrupt D's decode?
# If D is currently decoding (ongoing_decode_tokens > 0), disruption is real.
# If D is idle, there's no disruption to avoid.
d_is_decoding = d_inst.ongoing_decode_tokens > 0
disruption_time = (estimated_new_tokens / PREFILL_THROUGHPUT) if d_is_decoding else 0
offload_benefit = disruption_time * 0.5 # chunked prefill doesn't fully block decode
return offload_benefit > offload_cost
```
### 2.2 Simplified Heuristic (for implementation)
The above is complex. Simpler version:
```python
def should_offload(estimated_new_tokens, d_inst, p_inst):
"""Offload only if P is significantly less loaded than D."""
# Don't offload if P is more loaded than D (would make things worse)
if p_inst.ongoing_tokens >= d_inst.ongoing_tokens:
return False
# Don't offload if P is already heavily loaded (queue too long)
avg_load = average(inst.ongoing_tokens for inst in all_instances)
if p_inst.ongoing_tokens > avg_load * 1.5:
return False
# Offload if D is currently busy with decode
if d_inst.ongoing_decode_tokens > 0:
return True
# D is idle — no benefit from offloading
return False
```
### 2.3 Key Properties
1. **HEAVY + P idle + D busy** → OFFLOAD (best case: P has capacity, D benefits from isolation)
2. **HEAVY + P busy** → COLOCATE (P would queue, no benefit)
3. **HEAVY + D idle** → COLOCATE (no decode to disrupt)
4. **WARM/MEDIUM** → always COLOCATE (small prefill, not worth transfer overhead)
### 2.4 Expected Behavior Under Load
```
Low load (few concurrent requests):
Most instances idle → P always available → most HEAVY offloaded
Medium load (8 concurrent sessions):
Some instances busy → offload only when P is free
~50% of HEAVY offloaded, ~50% colocated
High load (all instances busy):
No instance has spare capacity → almost nothing offloaded
Falls back to pure combined mode (which is optimal under high load)
```
This naturally adapts: offload when there's spare capacity, colocate when system is saturated.
## 3. Metrics to Track
Per-request breakdown (proxy-level):
- `route_class`: WARM / MEDIUM / HEAVY_P2P / HEAVY_COLO
- `offload_decision_reason`: "p_idle_d_busy" / "p_overloaded" / "d_idle" / "below_threshold"
- `t_proxy_recv`, `t_prefill_sent`, `t_prefill_done`, `t_first_token`, `t_done`
Per-instance (from vLLM /metrics + logs):
- `prefix_cache_hit_rate` (local)
- `external_prefix_cache_hit_rate` (Mooncake KV)
- Combined: local + external = total effective APC
GPU utilization (5s sampling):
- Per-GPU util%, memory usage
- Detect load imbalance early
## 4. Implementation
Changes to `cache_aware_proxy.py`:
- Replace fixed `if estimated_new >= HEAVY_THRESHOLD` with `should_offload()` function
- Track `ongoing_decode_tokens` per instance (already have this)
- Add `offload_decision_reason` to breakdown log
- Add `--prefill-throughput` parameter (tokens/s, for cost estimation)

View File

@@ -248,7 +248,8 @@ async def _run_session(
async def _snapshot_prefix_cache_metrics(url_csv: str) -> dict[str, float]:
"""Scrape vLLM /metrics for prefix cache counters (aggregated across endpoints)."""
total = {"queries": 0.0, "hits": 0.0}
total = {"queries": 0.0, "hits": 0.0,
"external_queries": 0.0, "external_hits": 0.0}
endpoints = [e.strip() for e in url_csv.split(",")]
async with httpx.AsyncClient(timeout=10) as c:
for url in endpoints:
@@ -259,6 +260,10 @@ async def _snapshot_prefix_cache_metrics(url_csv: str) -> dict[str, float]:
total["queries"] += float(line.split()[-1])
elif line.startswith("vllm:prefix_cache_hits_total"):
total["hits"] += float(line.split()[-1])
elif line.startswith("vllm:external_prefix_cache_queries_total"):
total["external_queries"] += float(line.split()[-1])
elif line.startswith("vllm:external_prefix_cache_hits_total"):
total["external_hits"] += float(line.split()[-1])
except Exception:
pass
return total
@@ -328,10 +333,13 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
delta_queries = post_metrics.get("queries", 0) - pre_metrics.get("queries", 0)
delta_hits = post_metrics.get("hits", 0) - pre_metrics.get("hits", 0)
hit_ratio = delta_hits / delta_queries if delta_queries > 0 else 0.0
delta_ext_queries = post_metrics.get("external_queries", 0) - pre_metrics.get("external_queries", 0)
delta_ext_hits = post_metrics.get("external_hits", 0) - pre_metrics.get("external_hits", 0)
ext_hit_ratio = delta_ext_hits / delta_ext_queries if delta_ext_queries > 0 else 0.0
logger.info("Done: %d/%d succeeded in %.1fs", sum(1 for m in flat if m.error is None), len(flat), sweep_elapsed)
logger.info("Prefix cache: %.1f%% hit ratio (%d/%d tokens)",
hit_ratio * 100, int(delta_hits), int(delta_queries))
logger.info("Prefix cache: local=%.1f%% external=%.1f%%",
hit_ratio * 100, ext_hit_ratio * 100)
# Append cache stats to summary
import json as _json
@@ -339,6 +347,9 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
summary["prefix_cache_queries_tokens"] = int(delta_queries)
summary["prefix_cache_hits_tokens"] = int(delta_hits)
summary["prefix_cache_hit_ratio"] = hit_ratio
summary["external_cache_queries_tokens"] = int(delta_ext_queries)
summary["external_cache_hits_tokens"] = int(delta_ext_hits)
summary["external_cache_hit_ratio"] = ext_hit_ratio
summary["wall_clock_s"] = sweep_elapsed
summary_path.write_text(_json.dumps(summary, indent=2, sort_keys=True))

View File

@@ -230,26 +230,41 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
}
offload_enabled = getattr(global_args, 'offload', False) if global_args else False
use_offload = (estimated_new >= HEAVY_THRESHOLD and offload_enabled
and len(combined_instances) >= 2
and any(inst.bootstrap_port for inst in combined_instances))
has_bootstrap = any(inst.bootstrap_port for inst in combined_instances)
if use_offload:
# HEAVY P2P OFFLOAD: D on session-sticky instance, P on a DIFFERENT
# least-loaded instance (any instance can serve as P for others).
# Elastic offload decision: offload only when it helps
use_offload = False
offload_reason = "disabled"
if estimated_new >= HEAVY_THRESHOLD and offload_enabled and has_bootstrap and len(combined_instances) >= 2:
d_inst = best_inst
d_idx = best_idx
# P instance: least ongoing_tokens EXCLUDING D.
# CRITICAL: increment ongoing_tokens IMMEDIATELY to prevent race condition
# where multiple concurrent HEAVY requests all pick the same P instance.
p_candidates = [inst for inst in combined_instances if inst is not d_inst]
p_inst = min(p_candidates, key=lambda x: x.ongoing_tokens)
avg_load = max(sum(i.ongoing_tokens for i in combined_instances) / len(combined_instances), 1.0)
# Decision logic:
# 1. P must be less loaded than D (otherwise offload makes things worse)
# 2. P must not be overloaded (ongoing > 1.5x average = would queue too long)
# 3. D should be currently decoding (otherwise no disruption to avoid)
if p_inst.ongoing_tokens >= d_inst.ongoing_tokens:
offload_reason = "p_busier_than_d"
elif p_inst.ongoing_tokens > avg_load * 1.5:
offload_reason = "p_overloaded"
elif d_inst.ongoing_decode_tokens == 0 and d_inst.ongoing_tokens < avg_load * 0.5:
offload_reason = "d_idle_no_benefit"
else:
use_offload = True
offload_reason = "p_available_d_busy"
if use_offload:
d_idx = best_idx
p_inst.ongoing_tokens += input_length # reserve immediately
breakdown["route_class"] = "HEAVY_P2P"
breakdown["offload_reason"] = offload_reason
breakdown["p_inst"] = p_inst.url
breakdown["d_inst"] = d_inst.url
breakdown["p_load"] = p_inst.ongoing_tokens
breakdown["d_load"] = d_inst.ongoing_tokens
if session_id:
session_affinity[session_id] = d_idx
@@ -258,6 +273,7 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
else:
if estimated_new >= HEAVY_THRESHOLD:
breakdown["route_class"] = "HEAVY_COLO"
breakdown["offload_reason"] = offload_reason
else:
breakdown["route_class"] = "WARM" if estimated_new < 5000 else "MEDIUM"