Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)

Design: offload HEAVY prefill only when P instance is less loaded than D AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D for future KV reuse. External KV correctly registered in prefix cache. Result (67/200 processed, 75% success): TTFT p50: 0.551s (-49% vs baseline 1.080s) TTFT p90: 4.135s (vs baseline 9.410s, -56%) TPOT p90: 0.074s (same as baseline) E2E p50: 2.938s (-45% vs baseline 5.306s) 25% error rate from ReadTimeout on very large HEAVY requests queuing on P. Needs stricter elastic gate or higher timeout. But successful requests show significant improvement over both baseline and previous P2P. Also: added external_prefix_cache metrics tracking to replayer summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 13:50:25 +08:00
parent e9e313f9c5
commit 1d2eeb4925
3 changed files with 156 additions and 14 deletions
--- a/analysis/elastic_offload_design.md
+++ b/analysis/elastic_offload_design.md
@@ -0,0 +1,115 @@
+# Elastic P2P Offload Design
+
+**Date**: 2026-05-22
+**Context**: P2P offload TTFT p50 improved 13% but p90 worsened 59%. Root cause: P instance overloaded (serving its own requests + heavy offload). KV transfer itself is only 0.5s, not the bottleneck. External KV correctly registered in prefix cache (no bug).
+
+---
+
+## 1. Problem
+
+Current P2P offload: HEAVY requests ALWAYS offloaded to a different instance.
+- When P instance is idle → offload is beneficial (isolates heavy prefill from D's decode)
+- When P instance is busy → offload queues behind P's own work → TTFT p90 explodes
+
+## 2. Design: Elastic Offload with Load-Aware Decision
+
+**Core idea**: Offload is a PREFERENCE, not a mandate. The scheduler makes a runtime decision per-request:
+
+```
+For each HEAVY request:
+  1. Compute offload_benefit = estimated decode disruption saved on D instance
+  2. Compute offload_cost = P instance queue delay + KV transfer time
+  3. if offload_benefit > offload_cost → OFFLOAD
+     else → COLOCATE (do P+D on session-sticky instance)
+```
+
+### 2.1 Offload Decision Function
+
+```python
+def should_offload(estimated_new_tokens, d_inst, p_inst):
+    """Decide whether to offload this HEAVY request."""
+    
+    # Cost: how long will P take? (queue + compute)
+    p_queue_time = p_inst.ongoing_tokens / PREFILL_THROUGHPUT  # seconds
+    p_compute_time = estimated_new_tokens / PREFILL_THROUGHPUT
+    kv_transfer_time = 0.5  # empirical constant from our measurements
+    offload_cost = p_queue_time + kv_transfer_time  # p_compute_time same either way
+    
+    # Benefit: how much would colocated prefill disrupt D's decode?
+    # If D is currently decoding (ongoing_decode_tokens > 0), disruption is real.
+    # If D is idle, there's no disruption to avoid.
+    d_is_decoding = d_inst.ongoing_decode_tokens > 0
+    disruption_time = (estimated_new_tokens / PREFILL_THROUGHPUT) if d_is_decoding else 0
+    offload_benefit = disruption_time * 0.5  # chunked prefill doesn't fully block decode
+    
+    return offload_benefit > offload_cost
+```
+
+### 2.2 Simplified Heuristic (for implementation)
+
+The above is complex. Simpler version:
+
+```python
+def should_offload(estimated_new_tokens, d_inst, p_inst):
+    """Offload only if P is significantly less loaded than D."""
+    # Don't offload if P is more loaded than D (would make things worse)
+    if p_inst.ongoing_tokens >= d_inst.ongoing_tokens:
+        return False
+    # Don't offload if P is already heavily loaded (queue too long)
+    avg_load = average(inst.ongoing_tokens for inst in all_instances)
+    if p_inst.ongoing_tokens > avg_load * 1.5:
+        return False
+    # Offload if D is currently busy with decode
+    if d_inst.ongoing_decode_tokens > 0:
+        return True
+    # D is idle — no benefit from offloading
+    return False
+```
+
+### 2.3 Key Properties
+
+1. **HEAVY + P idle + D busy** → OFFLOAD (best case: P has capacity, D benefits from isolation)
+2. **HEAVY + P busy** → COLOCATE (P would queue, no benefit)
+3. **HEAVY + D idle** → COLOCATE (no decode to disrupt)
+4. **WARM/MEDIUM** → always COLOCATE (small prefill, not worth transfer overhead)
+
+### 2.4 Expected Behavior Under Load
+
+```
+Low load (few concurrent requests):
+  Most instances idle → P always available → most HEAVY offloaded
+  
+Medium load (8 concurrent sessions):
+  Some instances busy → offload only when P is free
+  ~50% of HEAVY offloaded, ~50% colocated
+  
+High load (all instances busy):
+  No instance has spare capacity → almost nothing offloaded
+  Falls back to pure combined mode (which is optimal under high load)
+```
+
+This naturally adapts: offload when there's spare capacity, colocate when system is saturated.
+
+## 3. Metrics to Track
+
+Per-request breakdown (proxy-level):
+- `route_class`: WARM / MEDIUM / HEAVY_P2P / HEAVY_COLO
+- `offload_decision_reason`: "p_idle_d_busy" / "p_overloaded" / "d_idle" / "below_threshold"
+- `t_proxy_recv`, `t_prefill_sent`, `t_prefill_done`, `t_first_token`, `t_done`
+
+Per-instance (from vLLM /metrics + logs):
+- `prefix_cache_hit_rate` (local)
+- `external_prefix_cache_hit_rate` (Mooncake KV)
+- Combined: local + external = total effective APC
+
+GPU utilization (5s sampling):
+- Per-GPU util%, memory usage
+- Detect load imbalance early
+
+## 4. Implementation
+
+Changes to `cache_aware_proxy.py`:
+- Replace fixed `if estimated_new >= HEAVY_THRESHOLD` with `should_offload()` function
+- Track `ongoing_decode_tokens` per instance (already have this)
+- Add `offload_decision_reason` to breakdown log
+- Add `--prefill-throughput` parameter (tokens/s, for cost estimation)
--- a/replayer/replay.py
+++ b/replayer/replay.py
@@ -248,7 +248,8 @@ async def _run_session(

 async def _snapshot_prefix_cache_metrics(url_csv: str) -> dict[str, float]:
    """Scrape vLLM /metrics for prefix cache counters (aggregated across endpoints)."""
-    total = {"queries": 0.0, "hits": 0.0}
+    total = {"queries": 0.0, "hits": 0.0,
+             "external_queries": 0.0, "external_hits": 0.0}
    endpoints = [e.strip() for e in url_csv.split(",")]
    async with httpx.AsyncClient(timeout=10) as c:
        for url in endpoints:
@@ -259,6 +260,10 @@ async def _snapshot_prefix_cache_metrics(url_csv: str) -> dict[str, float]:
                        total["queries"] += float(line.split()[-1])
                    elif line.startswith("vllm:prefix_cache_hits_total"):
                        total["hits"] += float(line.split()[-1])
+                    elif line.startswith("vllm:external_prefix_cache_queries_total"):
+                        total["external_queries"] += float(line.split()[-1])
+                    elif line.startswith("vllm:external_prefix_cache_hits_total"):
+                        total["external_hits"] += float(line.split()[-1])
            except Exception:
                pass
    return total
@@ -328,10 +333,13 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
    delta_queries = post_metrics.get("queries", 0) - pre_metrics.get("queries", 0)
    delta_hits = post_metrics.get("hits", 0) - pre_metrics.get("hits", 0)
    hit_ratio = delta_hits / delta_queries if delta_queries > 0 else 0.0
+    delta_ext_queries = post_metrics.get("external_queries", 0) - pre_metrics.get("external_queries", 0)
+    delta_ext_hits = post_metrics.get("external_hits", 0) - pre_metrics.get("external_hits", 0)
+    ext_hit_ratio = delta_ext_hits / delta_ext_queries if delta_ext_queries > 0 else 0.0

    logger.info("Done: %d/%d succeeded in %.1fs", sum(1 for m in flat if m.error is None), len(flat), sweep_elapsed)
-    logger.info("Prefix cache: %.1f%% hit ratio (%d/%d tokens)",
-                hit_ratio * 100, int(delta_hits), int(delta_queries))
+    logger.info("Prefix cache: local=%.1f%% external=%.1f%%",
+                hit_ratio * 100, ext_hit_ratio * 100)

    # Append cache stats to summary
    import json as _json
@@ -339,6 +347,9 @@ async def replay_trace(config: ReplayConfig) -> list[RequestMetrics]:
    summary["prefix_cache_queries_tokens"] = int(delta_queries)
    summary["prefix_cache_hits_tokens"] = int(delta_hits)
    summary["prefix_cache_hit_ratio"] = hit_ratio
+    summary["external_cache_queries_tokens"] = int(delta_ext_queries)
+    summary["external_cache_hits_tokens"] = int(delta_ext_hits)
+    summary["external_cache_hit_ratio"] = ext_hit_ratio
    summary["wall_clock_s"] = sweep_elapsed
    summary_path.write_text(_json.dumps(summary, indent=2, sort_keys=True))

--- a/scripts/cache_aware_proxy.py
+++ b/scripts/cache_aware_proxy.py
@@ -230,26 +230,41 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
    }

    offload_enabled = getattr(global_args, 'offload', False) if global_args else False
-    use_offload = (estimated_new >= HEAVY_THRESHOLD and offload_enabled
-                   and len(combined_instances) >= 2
-                   and any(inst.bootstrap_port for inst in combined_instances))
+    has_bootstrap = any(inst.bootstrap_port for inst in combined_instances)

-    if use_offload:
-        # HEAVY P2P OFFLOAD: D on session-sticky instance, P on a DIFFERENT
-        # least-loaded instance (any instance can serve as P for others).
+    # Elastic offload decision: offload only when it helps
+    use_offload = False
+    offload_reason = "disabled"
+    if estimated_new >= HEAVY_THRESHOLD and offload_enabled and has_bootstrap and len(combined_instances) >= 2:
        d_inst = best_inst
-        d_idx = best_idx
-
-        # P instance: least ongoing_tokens EXCLUDING D.
-        # CRITICAL: increment ongoing_tokens IMMEDIATELY to prevent race condition
-        # where multiple concurrent HEAVY requests all pick the same P instance.
        p_candidates = [inst for inst in combined_instances if inst is not d_inst]
        p_inst = min(p_candidates, key=lambda x: x.ongoing_tokens)
+        avg_load = max(sum(i.ongoing_tokens for i in combined_instances) / len(combined_instances), 1.0)
+
+        # Decision logic:
+        # 1. P must be less loaded than D (otherwise offload makes things worse)
+        # 2. P must not be overloaded (ongoing > 1.5x average = would queue too long)
+        # 3. D should be currently decoding (otherwise no disruption to avoid)
+        if p_inst.ongoing_tokens >= d_inst.ongoing_tokens:
+            offload_reason = "p_busier_than_d"
+        elif p_inst.ongoing_tokens > avg_load * 1.5:
+            offload_reason = "p_overloaded"
+        elif d_inst.ongoing_decode_tokens == 0 and d_inst.ongoing_tokens < avg_load * 0.5:
+            offload_reason = "d_idle_no_benefit"
+        else:
+            use_offload = True
+            offload_reason = "p_available_d_busy"
+
+    if use_offload:
+        d_idx = best_idx
        p_inst.ongoing_tokens += input_length  # reserve immediately

        breakdown["route_class"] = "HEAVY_P2P"
+        breakdown["offload_reason"] = offload_reason
        breakdown["p_inst"] = p_inst.url
        breakdown["d_inst"] = d_inst.url
+        breakdown["p_load"] = p_inst.ongoing_tokens
+        breakdown["d_load"] = d_inst.ongoing_tokens
        if session_id:
            session_affinity[session_id] = d_idx

@@ -258,6 +273,7 @@ async def _handle_combined(api, req_data, token_ids, input_length, session_id, h
    else:
        if estimated_new >= HEAVY_THRESHOLD:
            breakdown["route_class"] = "HEAVY_COLO"
+            breakdown["offload_reason"] = offload_reason
        else:
            breakdown["route_class"] = "WARM" if estimated_new < 5000 else "MEDIUM"