Files

Gahow Wang d6e47d3742 Design doc: Adaptive Prefill Offload

All 8 GPUs stay PD-combined. Global scheduler classifies requests as
WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache.
Only HEAVY requests (20%, cold start >20k new tokens) get offloaded;
80% of requests are co-located with zero KV transfer.

This avoids the KV cache memory wall (no decode concentration) while
isolating heavy prefills from decode when needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 00:44:22 +08:00

8.5 KiB

Raw Blame History

Adaptive Prefill Offload: Design Document

Date: 2026-05-22 Status: Design, pending implementation + experiment Context: Our PD-Sep experiments showed that static P/D partitioning hurts agentic workloads due to KV cache memory wall on decode instances (97.1% usage, 87.7% of TTFT spent waiting for KV cache). Meanwhile, cache-aware routing on PD-combined instances is the dominant optimization. This design combines the best of both.

1. Problem Statement

Static PD separation has two fundamental issues for agentic workloads:

KV cache memory concentration: Decode instances receive all KV from all requests, filling their KV cache (97.1%). Small requests wait 100+s for large requests to finish and release memory.
All-or-nothing KV transfer: Every request must transfer KV from P→D, even when 22% of requests have >90% cache hit and only need 1.3k new tokens of prefill (nearly free).

But PD co-location also has a problem: heavy cold-start prefills (55% of requests, avg 17.7k new tokens) can temporarily disrupt decode on the same GPU.

Goal: Get the decode isolation benefit of PD-Sep for heavy prefills, without the KV cache memory wall, and without KV transfer overhead for lightweight prefills.

2. Design

2.1 Architecture

All 8 GPUs run PD-combined vLLM instances (no Mooncake, no KV transfer by default). A global scheduler classifies each request and routes accordingly:

                     ┌──────────────────┐
                     │  Global Scheduler │
                     │  (cache_aware +   │
                     │   adaptive offload)│
                     └────────┬─────────┘
                              │ classify
                 ┌────────────┼────────────┐
                 │            │            │
              WARM         MEDIUM        HEAVY
           (cache >50%,   (cold, new    (cold, new
            new <5k tok)   <T tokens)    ≥T tokens)
                 │            │            │
           Same instance  Same instance  Offload:
           P+D co-located P+D co-located P on inst_A (least loaded)
           Zero overhead  Zero overhead  D on inst_B (different)
                                         KV transfer A→B

2.2 Request Classification

Based on our trace analysis:

Class	Criteria	% of requests	Avg new tokens	Action
WARM	Cache hit >50% OR new_tokens < 5k	~36%	1.3k-5k	Same-instance P+D
MEDIUM	Cold, new_tokens < T	~44%	5k-T	Same-instance P+D
HEAVY	Cold, new_tokens ≥ T	~20%	17k+	Offload: P on A, D on B

Threshold T is a tunable parameter (default: 20k tokens based on trace p50).

2.3 Routing Logic

def route(request):
    # 1. Session affinity (multi-turn reuse)
    if request.session_id in affinity_table:
        return affinity_table[request.session_id], mode="COLOCATED"

    # 2. Estimate cache hit
    best_inst = pick_by_score(ongoing_tokens, cache_hit)
    estimated_new_tokens = request.input_length - best_inst.estimate_cache_hit(request)

    # 3. Classify
    if estimated_new_tokens < HEAVY_THRESHOLD:
        # WARM or MEDIUM: co-located P+D
        affinity_table[request.session_id] = best_inst
        return best_inst, mode="COLOCATED"
    else:
        # HEAVY: offload
        # Pick P instance: least ongoing_tokens (will do compute-heavy prefill)
        p_inst = pick_least_loaded(exclude=best_inst)
        # Pick D instance: best cache hit (will hold KV for decode)
        d_inst = best_inst  # or pick_by_score for decode
        affinity_table[request.session_id] = d_inst
        return (p_inst, d_inst), mode="OFFLOAD"

2.4 Offload Flow (HEAVY requests only)

t=0:  Scheduler sends prefill request to inst_A
      inst_A computes prefill (heavy, e.g. 30k new tokens)
      inst_A pushes KV to Mooncake DRAM pool via RDMA

t=Xms: Scheduler (await) receives prefill completion
        Sends decode request to inst_B
        inst_B pulls KV from Mooncake (or directly from inst_A)
        inst_B starts decode

      Key: inst_B's KV cache only holds this ONE offloaded request's KV
      plus its own co-located requests' KV. No concentration problem.

2.5 Why This Avoids the KV Cache Memory Wall

In pure PD-Sep with 6P+2D:

2 decode GPUs each hold KV for ~50% of ALL requests → 97% KV cache usage

In adaptive offload:

Each of 8 GPUs holds KV for ~12.5% of requests (their own co-located + some offloaded)
Only 20% of requests are offloaded (80% have zero transfer)
KV cache pressure: ~30-40% per instance (well below saturation)

3. Implementation Plan

3.1 Changes to `cache_aware_proxy.py`

Add --heavy-threshold T parameter
In _handle(), classify request before routing
For COLOCATED: same as current _handle_combined() (stream directly)
For OFFLOAD: pick P and D instances separately, await-prefill, then stream decode (reuse current _handle_pd_sep() logic but only for heavy requests)

3.2 Instance Setup

All 8 instances: standard vLLM with --enable-prefix-caching
Additionally, all instances have Mooncake kv_connector with kv_role=kv_both (can produce AND consume KV)
Or simpler: 8 combined instances + only HEAVY requests go through Mooncake proxy path

Simplified v1: No Mooncake. HEAVY requests just go to the least-loaded instance for co-located P+D (same as MEDIUM), but the scheduler avoids sending them to instances that are already doing decode for other sessions.

v2 with Mooncake: HEAVY requests do P on one instance, KV transfer, D on another.

3.3 Simplified v1 (No Mooncake, Pure Routing)

The simplest version: all requests are co-located, but the scheduler is aware of request weight and avoids overloading any single instance with heavy prefills while it's decoding.

def route_v1(request, instances):
    # Estimate new tokens
    best = pick_best_cache_hit(instances)
    new_tokens = request.input_length - best.estimate_cache_hit(request)

    if new_tokens >= HEAVY_THRESHOLD:
        # HEAVY: pick instance with least DECODE load (not least total load)
        # This avoids sending heavy prefill to an instance busy decoding
        return pick_least_decode_load(instances)
    else:
        # WARM/MEDIUM: pick best cache hit
        return best

This doesn't eliminate P-D interference but minimizes it by routing heavy prefills away from busy decode instances. No KV transfer needed.

4. Experiment Plan

Exp A: Baseline (Combined cache-aware, current)

8 combined instances, cache-aware + token-level LB
Same as gpu_ab_combined

Exp B: Adaptive v1 (routing-only, no Mooncake)

8 combined instances, adaptive scheduler with HEAVY_THRESHOLD=20k
HEAVY requests routed to least-decode-load instance
WARM/MEDIUM requests routed by cache-hit + token-level LB

Exp C: Threshold Ablation

Same as Exp B but with HEAVY_THRESHOLD = 10k, 20k, 40k
Find optimal threshold

Exp D: Adaptive v2 (with Mooncake offload) — if v1 shows promise

8 combined+Mooncake instances (kv_role=kv_both)
HEAVY requests: P on least-loaded, KV transfer, D on best-cache-hit
WARM/MEDIUM: co-located, no transfer

Metrics per experiment

TTFT p50/p90 (breakdown by WARM/MEDIUM/HEAVY)
TPOT p50/p90
E2E p50/p90
GPU utilization (5s sampling)
KV cache usage per instance
Error rate
Per-request breakdown (proxy timestamps)

5. Expected Outcomes

Metric	Combined baseline	Adaptive v1	Adaptive v2
TTFT (warm)	Same	Same	Same
TTFT (heavy)	Sometimes slow (P blocks D)	Better (routed away)	Best (P on separate GPU)
TPOT	0.073s	≤0.073s	≤0.073s
KV cache pressure	Low	Low	Low
KV transfer overhead	None	None	Only for heavy (20%)
Complexity	Low	Low	Medium (Mooncake)

6. Relationship to Prior Work

DistServe/Splitwise: Static P/D partition → bad for agentic (KV cache wall)
PPD (Li et al. 2026): "Not All Prefills Are Equal" → same insight, but PPD uses dedicated P nodes. We use all-combined with dynamic offload.
agentic-pd-hybrid KVC v3: 1P+7D with session-aware routing → found overlap scheduler makes prefill TPOT impact negligible on SGLang. Our approach is similar but without dedicated P.
This work: All-combined + adaptive offload = no dedicated nodes, no KV cache wall, selective KV transfer only for heavy requests.

8.5 KiB Raw Blame History