# Adaptive Prefill Offload: Design Document **Date**: 2026-05-22 **Status**: Design, pending implementation + experiment **Context**: Our PD-Sep experiments showed that static P/D partitioning hurts agentic workloads due to KV cache memory wall on decode instances (97.1% usage, 87.7% of TTFT spent waiting for KV cache). Meanwhile, cache-aware routing on PD-combined instances is the dominant optimization. This design combines the best of both. --- ## 1. Problem Statement Static PD separation has two fundamental issues for agentic workloads: 1. **KV cache memory concentration**: Decode instances receive all KV from all requests, filling their KV cache (97.1%). Small requests wait 100+s for large requests to finish and release memory. 2. **All-or-nothing KV transfer**: Every request must transfer KV from P→D, even when 22% of requests have >90% cache hit and only need 1.3k new tokens of prefill (nearly free). But PD co-location also has a problem: heavy cold-start prefills (55% of requests, avg 17.7k new tokens) can temporarily disrupt decode on the same GPU. **Goal**: Get the decode isolation benefit of PD-Sep for heavy prefills, without the KV cache memory wall, and without KV transfer overhead for lightweight prefills. ## 2. Design ### 2.1 Architecture All 8 GPUs run PD-combined vLLM instances (no Mooncake, no KV transfer by default). A global scheduler classifies each request and routes accordingly: ``` ┌──────────────────┐ │ Global Scheduler │ │ (cache_aware + │ │ adaptive offload)│ └────────┬─────────┘ │ classify ┌────────────┼────────────┐ │ │ │ WARM MEDIUM HEAVY (cache >50%, (cold, new (cold, new new <5k tok) 50% OR new_tokens < 5k | ~36% | 1.3k-5k | Same-instance P+D | | MEDIUM | Cold, new_tokens < T | ~44% | 5k-T | Same-instance P+D | | HEAVY | Cold, new_tokens ≥ T | ~20% | 17k+ | Offload: P on A, D on B | Threshold T is a tunable parameter (default: 20k tokens based on trace p50). ### 2.3 Routing Logic ```python def route(request): # 1. Session affinity (multi-turn reuse) if request.session_id in affinity_table: return affinity_table[request.session_id], mode="COLOCATED" # 2. Estimate cache hit best_inst = pick_by_score(ongoing_tokens, cache_hit) estimated_new_tokens = request.input_length - best_inst.estimate_cache_hit(request) # 3. Classify if estimated_new_tokens < HEAVY_THRESHOLD: # WARM or MEDIUM: co-located P+D affinity_table[request.session_id] = best_inst return best_inst, mode="COLOCATED" else: # HEAVY: offload # Pick P instance: least ongoing_tokens (will do compute-heavy prefill) p_inst = pick_least_loaded(exclude=best_inst) # Pick D instance: best cache hit (will hold KV for decode) d_inst = best_inst # or pick_by_score for decode affinity_table[request.session_id] = d_inst return (p_inst, d_inst), mode="OFFLOAD" ``` ### 2.4 Offload Flow (HEAVY requests only) ``` t=0: Scheduler sends prefill request to inst_A inst_A computes prefill (heavy, e.g. 30k new tokens) inst_A pushes KV to Mooncake DRAM pool via RDMA t=Xms: Scheduler (await) receives prefill completion Sends decode request to inst_B inst_B pulls KV from Mooncake (or directly from inst_A) inst_B starts decode Key: inst_B's KV cache only holds this ONE offloaded request's KV plus its own co-located requests' KV. No concentration problem. ``` ### 2.5 Why This Avoids the KV Cache Memory Wall In pure PD-Sep with 6P+2D: - 2 decode GPUs each hold KV for ~50% of ALL requests → 97% KV cache usage In adaptive offload: - Each of 8 GPUs holds KV for ~12.5% of requests (their own co-located + some offloaded) - Only 20% of requests are offloaded (80% have zero transfer) - KV cache pressure: ~30-40% per instance (well below saturation) ## 3. Implementation Plan ### 3.1 Changes to `cache_aware_proxy.py` 1. Add `--heavy-threshold T` parameter 2. In `_handle()`, classify request before routing 3. For COLOCATED: same as current `_handle_combined()` (stream directly) 4. For OFFLOAD: pick P and D instances separately, await-prefill, then stream decode (reuse current `_handle_pd_sep()` logic but only for heavy requests) ### 3.2 Instance Setup - All 8 instances: standard vLLM with `--enable-prefix-caching` - Additionally, all instances have Mooncake kv_connector with `kv_role=kv_both` (can produce AND consume KV) - Or simpler: 8 combined instances + only HEAVY requests go through Mooncake proxy path **Simplified v1**: No Mooncake. HEAVY requests just go to the least-loaded instance for co-located P+D (same as MEDIUM), but the scheduler avoids sending them to instances that are already doing decode for other sessions. **v2 with Mooncake**: HEAVY requests do P on one instance, KV transfer, D on another. ### 3.3 Simplified v1 (No Mooncake, Pure Routing) The simplest version: all requests are co-located, but the scheduler is **aware of request weight** and avoids overloading any single instance with heavy prefills while it's decoding. ```python def route_v1(request, instances): # Estimate new tokens best = pick_best_cache_hit(instances) new_tokens = request.input_length - best.estimate_cache_hit(request) if new_tokens >= HEAVY_THRESHOLD: # HEAVY: pick instance with least DECODE load (not least total load) # This avoids sending heavy prefill to an instance busy decoding return pick_least_decode_load(instances) else: # WARM/MEDIUM: pick best cache hit return best ``` This doesn't eliminate P-D interference but **minimizes it by routing heavy prefills away from busy decode instances**. No KV transfer needed. ## 4. Experiment Plan ### Exp A: Baseline (Combined cache-aware, current) - 8 combined instances, cache-aware + token-level LB - Same as `gpu_ab_combined` ### Exp B: Adaptive v1 (routing-only, no Mooncake) - 8 combined instances, adaptive scheduler with HEAVY_THRESHOLD=20k - HEAVY requests routed to least-decode-load instance - WARM/MEDIUM requests routed by cache-hit + token-level LB ### Exp C: Threshold Ablation - Same as Exp B but with HEAVY_THRESHOLD = 10k, 20k, 40k - Find optimal threshold ### Exp D: Adaptive v2 (with Mooncake offload) — if v1 shows promise - 8 combined+Mooncake instances (kv_role=kv_both) - HEAVY requests: P on least-loaded, KV transfer, D on best-cache-hit - WARM/MEDIUM: co-located, no transfer ### Metrics per experiment - TTFT p50/p90 (breakdown by WARM/MEDIUM/HEAVY) - TPOT p50/p90 - E2E p50/p90 - GPU utilization (5s sampling) - KV cache usage per instance - Error rate - Per-request breakdown (proxy timestamps) ## 5. Expected Outcomes | Metric | Combined baseline | Adaptive v1 | Adaptive v2 | |--------|------------------|-------------|-------------| | TTFT (warm) | Same | Same | Same | | TTFT (heavy) | Sometimes slow (P blocks D) | Better (routed away) | Best (P on separate GPU) | | TPOT | 0.073s | ≤0.073s | ≤0.073s | | KV cache pressure | Low | Low | Low | | KV transfer overhead | None | None | Only for heavy (20%) | | Complexity | Low | Low | Medium (Mooncake) | ## 6. Relationship to Prior Work - **DistServe/Splitwise**: Static P/D partition → bad for agentic (KV cache wall) - **PPD (Li et al. 2026)**: "Not All Prefills Are Equal" → same insight, but PPD uses dedicated P nodes. We use all-combined with dynamic offload. - **agentic-pd-hybrid KVC v3**: 1P+7D with session-aware routing → found overlap scheduler makes prefill TPOT impact negligible on SGLang. Our approach is similar but without dedicated P. - **This work**: All-combined + adaptive offload = no dedicated nodes, no KV cache wall, selective KV transfer only for heavy requests.