# Elastic P2P Offload Design

**Date**: 2026-05-22
**Context**: P2P offload TTFT p50 improved 13% but p90 worsened 59%. Root cause: P instance overloaded (serving its own requests + heavy offload). KV transfer itself is only 0.5s, not the bottleneck. External KV correctly registered in prefix cache (no bug).

---

## 1. Problem

Current P2P offload: HEAVY requests ALWAYS offloaded to a different instance.
- When P instance is idle → offload is beneficial (isolates heavy prefill from D's decode)
- When P instance is busy → offload queues behind P's own work → TTFT p90 explodes

## 2. Design: Elastic Offload with Load-Aware Decision

**Core idea**: Offload is a PREFERENCE, not a mandate. The scheduler makes a runtime decision per-request:

```
For each HEAVY request:
  1. Compute offload_benefit = estimated decode disruption saved on D instance
  2. Compute offload_cost = P instance queue delay + KV transfer time
  3. if offload_benefit > offload_cost → OFFLOAD
     else → COLOCATE (do P+D on session-sticky instance)
```

### 2.1 Offload Decision Function

```python
def should_offload(estimated_new_tokens, d_inst, p_inst):
    """Decide whether to offload this HEAVY request."""
    
    # Cost: how long will P take? (queue + compute)
    p_queue_time = p_inst.ongoing_tokens / PREFILL_THROUGHPUT  # seconds
    p_compute_time = estimated_new_tokens / PREFILL_THROUGHPUT
    kv_transfer_time = 0.5  # empirical constant from our measurements
    offload_cost = p_queue_time + kv_transfer_time  # p_compute_time same either way
    
    # Benefit: how much would colocated prefill disrupt D's decode?
    # If D is currently decoding (ongoing_decode_tokens > 0), disruption is real.
    # If D is idle, there's no disruption to avoid.
    d_is_decoding = d_inst.ongoing_decode_tokens > 0
    disruption_time = (estimated_new_tokens / PREFILL_THROUGHPUT) if d_is_decoding else 0
    offload_benefit = disruption_time * 0.5  # chunked prefill doesn't fully block decode
    
    return offload_benefit > offload_cost
```

### 2.2 Simplified Heuristic (for implementation)

The above is complex. Simpler version:

```python
def should_offload(estimated_new_tokens, d_inst, p_inst):
    """Offload only if P is significantly less loaded than D."""
    # Don't offload if P is more loaded than D (would make things worse)
    if p_inst.ongoing_tokens >= d_inst.ongoing_tokens:
        return False
    # Don't offload if P is already heavily loaded (queue too long)
    avg_load = average(inst.ongoing_tokens for inst in all_instances)
    if p_inst.ongoing_tokens > avg_load * 1.5:
        return False
    # Offload if D is currently busy with decode
    if d_inst.ongoing_decode_tokens > 0:
        return True
    # D is idle — no benefit from offloading
    return False
```

### 2.3 Key Properties

1. **HEAVY + P idle + D busy** → OFFLOAD (best case: P has capacity, D benefits from isolation)
2. **HEAVY + P busy** → COLOCATE (P would queue, no benefit)
3. **HEAVY + D idle** → COLOCATE (no decode to disrupt)
4. **WARM/MEDIUM** → always COLOCATE (small prefill, not worth transfer overhead)

### 2.4 Expected Behavior Under Load

```
Low load (few concurrent requests):
  Most instances idle → P always available → most HEAVY offloaded
  
Medium load (8 concurrent sessions):
  Some instances busy → offload only when P is free
  ~50% of HEAVY offloaded, ~50% colocated
  
High load (all instances busy):
  No instance has spare capacity → almost nothing offloaded
  Falls back to pure combined mode (which is optimal under high load)
```

This naturally adapts: offload when there's spare capacity, colocate when system is saturated.

## 3. Metrics to Track

Per-request breakdown (proxy-level):
- `route_class`: WARM / MEDIUM / HEAVY_P2P / HEAVY_COLO
- `offload_decision_reason`: "p_idle_d_busy" / "p_overloaded" / "d_idle" / "below_threshold"
- `t_proxy_recv`, `t_prefill_sent`, `t_prefill_done`, `t_first_token`, `t_done`

Per-instance (from vLLM /metrics + logs):
- `prefix_cache_hit_rate` (local)
- `external_prefix_cache_hit_rate` (Mooncake KV)
- Combined: local + external = total effective APC

GPU utilization (5s sampling):
- Per-GPU util%, memory usage
- Detect load imbalance early

## 4. Implementation

Changes to `cache_aware_proxy.py`:
- Replace fixed `if estimated_new >= HEAVY_THRESHOLD` with `should_offload()` function
- Track `ongoing_decode_tokens` per instance (already have this)
- Add `offload_decision_reason` to breakdown log
- Add `--prefill-throughput` parameter (tokens/s, for cost estimation)