Files
agentic-kvc/analysis/elastic_offload_design.md
Gahow Wang 1d2eeb4925 Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)
Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.

Result (67/200 processed, 75% success):
  TTFT p50: 0.551s (-49% vs baseline 1.080s)
  TTFT p90: 4.135s (vs baseline 9.410s, -56%)
  TPOT p90: 0.074s (same as baseline)
  E2E  p50: 2.938s (-45% vs baseline 5.306s)

25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.

Also: added external_prefix_cache metrics tracking to replayer summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 13:50:25 +08:00

4.5 KiB

Elastic P2P Offload Design

Date: 2026-05-22 Context: P2P offload TTFT p50 improved 13% but p90 worsened 59%. Root cause: P instance overloaded (serving its own requests + heavy offload). KV transfer itself is only 0.5s, not the bottleneck. External KV correctly registered in prefix cache (no bug).


1. Problem

Current P2P offload: HEAVY requests ALWAYS offloaded to a different instance.

  • When P instance is idle → offload is beneficial (isolates heavy prefill from D's decode)
  • When P instance is busy → offload queues behind P's own work → TTFT p90 explodes

2. Design: Elastic Offload with Load-Aware Decision

Core idea: Offload is a PREFERENCE, not a mandate. The scheduler makes a runtime decision per-request:

For each HEAVY request:
  1. Compute offload_benefit = estimated decode disruption saved on D instance
  2. Compute offload_cost = P instance queue delay + KV transfer time
  3. if offload_benefit > offload_cost → OFFLOAD
     else → COLOCATE (do P+D on session-sticky instance)

2.1 Offload Decision Function

def should_offload(estimated_new_tokens, d_inst, p_inst):
    """Decide whether to offload this HEAVY request."""
    
    # Cost: how long will P take? (queue + compute)
    p_queue_time = p_inst.ongoing_tokens / PREFILL_THROUGHPUT  # seconds
    p_compute_time = estimated_new_tokens / PREFILL_THROUGHPUT
    kv_transfer_time = 0.5  # empirical constant from our measurements
    offload_cost = p_queue_time + kv_transfer_time  # p_compute_time same either way
    
    # Benefit: how much would colocated prefill disrupt D's decode?
    # If D is currently decoding (ongoing_decode_tokens > 0), disruption is real.
    # If D is idle, there's no disruption to avoid.
    d_is_decoding = d_inst.ongoing_decode_tokens > 0
    disruption_time = (estimated_new_tokens / PREFILL_THROUGHPUT) if d_is_decoding else 0
    offload_benefit = disruption_time * 0.5  # chunked prefill doesn't fully block decode
    
    return offload_benefit > offload_cost

2.2 Simplified Heuristic (for implementation)

The above is complex. Simpler version:

def should_offload(estimated_new_tokens, d_inst, p_inst):
    """Offload only if P is significantly less loaded than D."""
    # Don't offload if P is more loaded than D (would make things worse)
    if p_inst.ongoing_tokens >= d_inst.ongoing_tokens:
        return False
    # Don't offload if P is already heavily loaded (queue too long)
    avg_load = average(inst.ongoing_tokens for inst in all_instances)
    if p_inst.ongoing_tokens > avg_load * 1.5:
        return False
    # Offload if D is currently busy with decode
    if d_inst.ongoing_decode_tokens > 0:
        return True
    # D is idle — no benefit from offloading
    return False

2.3 Key Properties

  1. HEAVY + P idle + D busy → OFFLOAD (best case: P has capacity, D benefits from isolation)
  2. HEAVY + P busy → COLOCATE (P would queue, no benefit)
  3. HEAVY + D idle → COLOCATE (no decode to disrupt)
  4. WARM/MEDIUM → always COLOCATE (small prefill, not worth transfer overhead)

2.4 Expected Behavior Under Load

Low load (few concurrent requests):
  Most instances idle → P always available → most HEAVY offloaded
  
Medium load (8 concurrent sessions):
  Some instances busy → offload only when P is free
  ~50% of HEAVY offloaded, ~50% colocated
  
High load (all instances busy):
  No instance has spare capacity → almost nothing offloaded
  Falls back to pure combined mode (which is optimal under high load)

This naturally adapts: offload when there's spare capacity, colocate when system is saturated.

3. Metrics to Track

Per-request breakdown (proxy-level):

  • route_class: WARM / MEDIUM / HEAVY_P2P / HEAVY_COLO
  • offload_decision_reason: "p_idle_d_busy" / "p_overloaded" / "d_idle" / "below_threshold"
  • t_proxy_recv, t_prefill_sent, t_prefill_done, t_first_token, t_done

Per-instance (from vLLM /metrics + logs):

  • prefix_cache_hit_rate (local)
  • external_prefix_cache_hit_rate (Mooncake KV)
  • Combined: local + external = total effective APC

GPU utilization (5s sampling):

  • Per-GPU util%, memory usage
  • Detect load imbalance early

4. Implementation

Changes to cache_aware_proxy.py:

  • Replace fixed if estimated_new >= HEAVY_THRESHOLD with should_offload() function
  • Track ongoing_decode_tokens per instance (already have this)
  • Add offload_decision_reason to breakdown log
  • Add --prefill-throughput parameter (tokens/s, for cost estimation)