Design doc: Adaptive Prefill Offload

All 8 GPUs stay PD-combined. Global scheduler classifies requests as WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache. Only HEAVY requests (20%, cold start >20k new tokens) get offloaded; 80% of requests are co-located with zero KV transfer. This avoids the KV cache memory wall (no decode concentration) while isolating heavy prefills from decode when needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:44:22 +08:00
parent 445e491123
commit d6e47d3742
1 changed files with 196 additions and 0 deletions
--- a/analysis/adaptive_prefill_offload_design.md
+++ b/analysis/adaptive_prefill_offload_design.md
@@ -0,0 +1,196 @@
+# Adaptive Prefill Offload: Design Document
+
+**Date**: 2026-05-22
+**Status**: Design, pending implementation + experiment
+**Context**: Our PD-Sep experiments showed that static P/D partitioning hurts agentic workloads due to KV cache memory wall on decode instances (97.1% usage, 87.7% of TTFT spent waiting for KV cache). Meanwhile, cache-aware routing on PD-combined instances is the dominant optimization. This design combines the best of both.
+
+---
+
+## 1. Problem Statement
+
+Static PD separation has two fundamental issues for agentic workloads:
+
+1. **KV cache memory concentration**: Decode instances receive all KV from all requests, filling their KV cache (97.1%). Small requests wait 100+s for large requests to finish and release memory.
+
+2. **All-or-nothing KV transfer**: Every request must transfer KV from P→D, even when 22% of requests have >90% cache hit and only need 1.3k new tokens of prefill (nearly free).
+
+But PD co-location also has a problem: heavy cold-start prefills (55% of requests, avg 17.7k new tokens) can temporarily disrupt decode on the same GPU.
+
+**Goal**: Get the decode isolation benefit of PD-Sep for heavy prefills, without the KV cache memory wall, and without KV transfer overhead for lightweight prefills.
+
+## 2. Design
+
+### 2.1 Architecture
+
+All 8 GPUs run PD-combined vLLM instances (no Mooncake, no KV transfer by default). A global scheduler classifies each request and routes accordingly:
+
+```
+                     ┌──────────────────┐
+                     │  Global Scheduler │
+                     │  (cache_aware +   │
+                     │   adaptive offload)│
+                     └────────┬─────────┘
+                              │ classify
+                 ┌────────────┼────────────┐
+                 │            │            │
+              WARM         MEDIUM        HEAVY
+           (cache >50%,   (cold, new    (cold, new
+            new <5k tok)   <T tokens)    ≥T tokens)
+                 │            │            │
+           Same instance  Same instance  Offload:
+           P+D co-located P+D co-located P on inst_A (least loaded)
+           Zero overhead  Zero overhead  D on inst_B (different)
+                                         KV transfer A→B
+```
+
+### 2.2 Request Classification
+
+Based on our trace analysis:
+
+| Class | Criteria | % of requests | Avg new tokens | Action |
+|-------|----------|---------------|----------------|--------|
+| WARM | Cache hit >50% OR new_tokens < 5k | ~36% | 1.3k-5k | Same-instance P+D |
+| MEDIUM | Cold, new_tokens < T | ~44% | 5k-T | Same-instance P+D |
+| HEAVY | Cold, new_tokens ≥ T | ~20% | 17k+ | Offload: P on A, D on B |
+
+Threshold T is a tunable parameter (default: 20k tokens based on trace p50).
+
+### 2.3 Routing Logic
+
+```python
+def route(request):
+    # 1. Session affinity (multi-turn reuse)
+    if request.session_id in affinity_table:
+        return affinity_table[request.session_id], mode="COLOCATED"
+
+    # 2. Estimate cache hit
+    best_inst = pick_by_score(ongoing_tokens, cache_hit)
+    estimated_new_tokens = request.input_length - best_inst.estimate_cache_hit(request)
+
+    # 3. Classify
+    if estimated_new_tokens < HEAVY_THRESHOLD:
+        # WARM or MEDIUM: co-located P+D
+        affinity_table[request.session_id] = best_inst
+        return best_inst, mode="COLOCATED"
+    else:
+        # HEAVY: offload
+        # Pick P instance: least ongoing_tokens (will do compute-heavy prefill)
+        p_inst = pick_least_loaded(exclude=best_inst)
+        # Pick D instance: best cache hit (will hold KV for decode)
+        d_inst = best_inst  # or pick_by_score for decode
+        affinity_table[request.session_id] = d_inst
+        return (p_inst, d_inst), mode="OFFLOAD"
+```
+
+### 2.4 Offload Flow (HEAVY requests only)
+
+```
+t=0:  Scheduler sends prefill request to inst_A
+      inst_A computes prefill (heavy, e.g. 30k new tokens)
+      inst_A pushes KV to Mooncake DRAM pool via RDMA
+
+t=Xms: Scheduler (await) receives prefill completion
+        Sends decode request to inst_B
+        inst_B pulls KV from Mooncake (or directly from inst_A)
+        inst_B starts decode
+
+      Key: inst_B's KV cache only holds this ONE offloaded request's KV
+      plus its own co-located requests' KV. No concentration problem.
+```
+
+### 2.5 Why This Avoids the KV Cache Memory Wall
+
+In pure PD-Sep with 6P+2D:
+- 2 decode GPUs each hold KV for ~50% of ALL requests → 97% KV cache usage
+
+In adaptive offload:
+- Each of 8 GPUs holds KV for ~12.5% of requests (their own co-located + some offloaded)
+- Only 20% of requests are offloaded (80% have zero transfer)
+- KV cache pressure: ~30-40% per instance (well below saturation)
+
+## 3. Implementation Plan
+
+### 3.1 Changes to `cache_aware_proxy.py`
+
+1. Add `--heavy-threshold T` parameter
+2. In `_handle()`, classify request before routing
+3. For COLOCATED: same as current `_handle_combined()` (stream directly)
+4. For OFFLOAD: pick P and D instances separately, await-prefill, then stream decode (reuse current `_handle_pd_sep()` logic but only for heavy requests)
+
+### 3.2 Instance Setup
+
+- All 8 instances: standard vLLM with `--enable-prefix-caching`
+- Additionally, all instances have Mooncake kv_connector with `kv_role=kv_both` (can produce AND consume KV)
+- Or simpler: 8 combined instances + only HEAVY requests go through Mooncake proxy path
+
+**Simplified v1**: No Mooncake. HEAVY requests just go to the least-loaded instance for co-located P+D (same as MEDIUM), but the scheduler avoids sending them to instances that are already doing decode for other sessions.
+
+**v2 with Mooncake**: HEAVY requests do P on one instance, KV transfer, D on another.
+
+### 3.3 Simplified v1 (No Mooncake, Pure Routing)
+
+The simplest version: all requests are co-located, but the scheduler is **aware of request weight** and avoids overloading any single instance with heavy prefills while it's decoding.
+
+```python
+def route_v1(request, instances):
+    # Estimate new tokens
+    best = pick_best_cache_hit(instances)
+    new_tokens = request.input_length - best.estimate_cache_hit(request)
+
+    if new_tokens >= HEAVY_THRESHOLD:
+        # HEAVY: pick instance with least DECODE load (not least total load)
+        # This avoids sending heavy prefill to an instance busy decoding
+        return pick_least_decode_load(instances)
+    else:
+        # WARM/MEDIUM: pick best cache hit
+        return best
+```
+
+This doesn't eliminate P-D interference but **minimizes it by routing heavy prefills away from busy decode instances**. No KV transfer needed.
+
+## 4. Experiment Plan
+
+### Exp A: Baseline (Combined cache-aware, current)
+- 8 combined instances, cache-aware + token-level LB
+- Same as `gpu_ab_combined`
+
+### Exp B: Adaptive v1 (routing-only, no Mooncake)
+- 8 combined instances, adaptive scheduler with HEAVY_THRESHOLD=20k
+- HEAVY requests routed to least-decode-load instance
+- WARM/MEDIUM requests routed by cache-hit + token-level LB
+
+### Exp C: Threshold Ablation
+- Same as Exp B but with HEAVY_THRESHOLD = 10k, 20k, 40k
+- Find optimal threshold
+
+### Exp D: Adaptive v2 (with Mooncake offload) — if v1 shows promise
+- 8 combined+Mooncake instances (kv_role=kv_both)
+- HEAVY requests: P on least-loaded, KV transfer, D on best-cache-hit
+- WARM/MEDIUM: co-located, no transfer
+
+### Metrics per experiment
+- TTFT p50/p90 (breakdown by WARM/MEDIUM/HEAVY)
+- TPOT p50/p90
+- E2E p50/p90
+- GPU utilization (5s sampling)
+- KV cache usage per instance
+- Error rate
+- Per-request breakdown (proxy timestamps)
+
+## 5. Expected Outcomes
+
+| Metric | Combined baseline | Adaptive v1 | Adaptive v2 |
+|--------|------------------|-------------|-------------|
+| TTFT (warm) | Same | Same | Same |
+| TTFT (heavy) | Sometimes slow (P blocks D) | Better (routed away) | Best (P on separate GPU) |
+| TPOT | 0.073s | ≤0.073s | ≤0.073s |
+| KV cache pressure | Low | Low | Low |
+| KV transfer overhead | None | None | Only for heavy (20%) |
+| Complexity | Low | Low | Medium (Mooncake) |
+
+## 6. Relationship to Prior Work
+
+- **DistServe/Splitwise**: Static P/D partition → bad for agentic (KV cache wall)
+- **PPD (Li et al. 2026)**: "Not All Prefills Are Equal" → same insight, but PPD uses dedicated P nodes. We use all-combined with dynamic offload.
+- **agentic-pd-hybrid KVC v3**: 1P+7D with session-aware routing → found overlap scheduler makes prefill TPOT impact negligible on SGLang. Our approach is similar but without dedicated P.
+- **This work**: All-combined + adaptive offload = no dedicated nodes, no KV cache wall, selective KV transfer only for heavy requests.