Files
agentic-kvc/microbench/transfer_lifecycle_design.md
Gahow Wang f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle
Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
2026-05-26 00:57:06 +08:00

13 KiB
Raw Blame History

PD Transfer Lifecycle Breakdown Microbenchmark

Goal

Profile the complete request lifecycle under PD disaggregation, with emphasis on the P→D KV transfer stage. Produce a per-phase latency breakdown as a function of three independent variables:

breakdown(prior_context, current_new_tokens, output_length) → {
    routing_ms, p_queue_ms, p_prefill_ms, 
    zmq_handshake_ms, rdma_transfer_ms, transfer_completion_signal_ms,
    d_block_alloc_ms, d_cache_promotion_ms, d_schedule_ms, d_first_decode_ms,
    d_decode_total_ms
}

Background: vLLM PD Transfer Semantics

Transfer is incremental:

  • D uses its local prefix cache for prior turns (blocks with matching hashes)
  • P only transfers the delta: ext_tokens = remote_total - D_local_cache_hits
  • D combines: local prefix cache + remote-transferred blocks + locally-computed remainder

Therefore, prior_context (already cached on D) determines how much P actually transfers.


Hardware & Model

Parameter Value
GPUs 2× H20 96GB (1 for P, 1 for D), NVLink/RDMA connected
Model Qwen3-Coder-30B-A3B-Instruct
TP 1 per instance
Transfer Mooncake (kv_producer / kv_consumer)
enable_prefix_caching true
enable_chunked_prefill true
max_num_batched_tokens 8192
gpu_memory_utilization 0.9

Independent Variables

Variable Symbol Values Meaning
Prior context (D-side cached) C 0, 4k, 16k, 32k, 64k, 100k Tokens from prior turns, already in D's prefix cache
Current new tokens N 512, 2k, 4k, 8k, 16k, 32k Tokens P must prefill and transfer (the delta)
Output length O 1, 32, 128, 512 Decode tokens D generates after receiving KV

Sweep: 6 × 6 × 4 = 144 configurations.

Total input_length per request = C + N.


Lifecycle Phases & Instrumentation

Phase Diagram

Time ─────────────────────────────────────────────────────────────────────►

[Routing] [P Queue] [P Prefill (chunked)] [Transfer] [D Startup] [D Decode]
   t0       t1          t2          t3       t4  t5     t6  t7      t8   t9

t0: Request arrives at proxy/router
t1: Request dispatched to P instance (P receives HTTP request)
t2: P scheduler picks up request (first prefill chunk starts)
t3: P prefill completes (last chunk done, all KV blocks ready)
t4: P sends ZMQ metadata to D (or D sends block alloc to P)
t5: First RDMA write issued
t6: Last RDMA write completes (all blocks landed on D GPU)
t7: D receives completion signal (ZMQ response parsed)
t8: D scheduler promotes request from WAITING_FOR_REMOTE_KVS → schedulable
t9: D first decode token emitted
t10: D final output token emitted

Instrumentation Points

Timestamp Where to instrument Method
t0 Proxy pick_instance() entry Proxy log with time.perf_counter_ns()
t1 Proxy forwards to P (HTTP send complete) Proxy log
t2 P scheduler schedule() — request leaves WAITING vLLM patch: log in scheduler
t3 P request_finished() or save_kv_layer last layer vLLM patch: log in connector record_send_reqs
t4 P send_kv_to_decode: ZMQ metadata received by handler Connector log: before _build_transfer_params
t5 P batch_transfer_sync_write entry Connector log
t6 P batch_transfer_sync_write return Connector log (ret_value == 0)
t7 D process_pulling_result: finished_recving_reqs.add() Connector log
t8 D scheduler _try_promote_blocked_waiting_request success Scheduler log
t9 D first token streamed to client Client-side SSE timestamp
t10 D last token streamed to client Client-side SSE timestamp

Derived Metrics

Metric Formula What it tells us
routing_latency t1 - t0 Proxy overhead
p_queue_time t2 - t1 P scheduling delay
p_prefill_time t3 - t2 Actual prefill compute (chunked)
zmq_handshake t5 - t3 ZMQ coordination overhead (P ready → RDMA starts)
rdma_transfer_time t6 - t5 Pure RDMA data movement
transfer_signal_latency t7 - t6 Completion detection (ZMQ response + asyncio poll)
d_promotion_latency t8 - t7 Scheduler step delay until promotion
d_first_token_latency t9 - t8 D compute startup (1 token forward + sampling)
d_decode_time t10 - t9 Decode generation (O-1 tokens)
transfer_total t7 - t3 End-to-end transfer overhead (the key number)
ttft_overhead_vs_colo t9 - t0 - p_prefill_time Extra latency vs if the same request ran on combined instance

Transfer Internal Breakdown

For the rdma_transfer_time phase, instrument further:

Sub-phase How to measure
build_transfer_params Time _build_transfer_params() call
rdma_write_submit Time from batch_transfer_sync_write entry to first RDMA CQ completion (if available)
rdma_write_total Full batch_transfer_sync_write duration
bytes_transferred sum(lengths) from transfer params
num_rdma_ops len(src_ptrs) (number of RDMA write operations)
effective_bandwidth bytes_transferred / rdma_write_total
num_layers_transferred Count of unique layers in transfer
num_blocks_transferred Count of blocks

Expected relationships:

  • bytes_transferred = num_blocks × block_size_bytes × num_layers
  • block_size_bytes = 16 tokens × 2(KV) × num_kv_heads × head_dim × dtype_size
  • rdma_transfer_time ≈ bytes_transferred / RDMA_bandwidth + per_op_latency × num_ops

Protocol

Setup: Warm D's Prefix Cache

To control prior_context (C), we need D to have prior-turn KV in its local prefix cache:

Phase 0: Seed D's cache
  1. For each config with C > 0:
     - Send a request with C-token prompt directly to D (combined mode, no PD-sep)
     - Let it generate 1 token → D now has C tokens in prefix cache
     - Verify via /metrics that prefix cache utilization increased
  2. Switch D to kv_consumer mode (or keep combined + use kv_transfer_params override)

Alternative: Use D in kv_both mode (combined + Mooncake enabled), then send PD-sep requests with kv_transfer_params that explicitly request remote prefill.

Main Experiment

For C in [0, 4k, 16k, 32k, 64k, 100k]:
    Seed D's prefix cache with C tokens (Phase 0)
    
    For N in [512, 2k, 4k, 8k, 16k, 32k]:
        For O in [1, 32, 128, 512]:
            Construct request:
                input = C_token_prefix + N_random_new_tokens  (total = C+N)
                max_tokens = O
                kv_transfer_params = {do_remote_prefill: true, ...}
            
            Send request through proxy → P → D
            Collect all timestamps (t0..t10)
            Repeat 5 times
            
            Record breakdown
    
    Evict D's cache (restart or send cache-clearing requests)

D-Side Cache Verification

Before each measurement, verify D's cache state:

# Check that D has exactly C tokens cached
resp = httpx.get(f"http://{d_host}:{d_port}/metrics")
# Parse vllm:prefix_cache_hit_rate or gpu_prefix_cache_hit_rate
# Or use internal API to query cached block count

vLLM Instrumentation Patch

Minimal patch to mooncake_connector.py for timestamp collection:

# Add at top
import time
_PROFILE_LOG = []  # or write to file

# In send_kv_to_decode(), around line 800-990:
async def send_kv_to_decode(self, ...):
    t_ready = time.perf_counter_ns()  # P prefill done, ready to send
    
    # ... ZMQ receive metadata from D ...
    t_zmq_recv = time.perf_counter_ns()
    
    # ... build transfer params ...
    t_params_built = time.perf_counter_ns()
    
    ret_value = self.engine.batch_transfer_sync_write(...)
    t_rdma_done = time.perf_counter_ns()
    
    # ... send ZMQ response ...
    t_zmq_sent = time.perf_counter_ns()
    
    _PROFILE_LOG.append({
        "req_id": req_id,
        "bytes": sum(lengths),
        "num_ops": len(src_ptrs),
        "t_ready": t_ready,
        "t_zmq_recv": t_zmq_recv,
        "t_params_built": t_params_built, 
        "t_rdma_done": t_rdma_done,
        "t_zmq_sent": t_zmq_sent,
    })

Similar patches needed in:

  • scheduler.py: Log t_schedule_start, t_promote
  • process_pulling_result(): Log t_recv_complete

Output Format

Per-Request Record (results/lifecycle/C{c}_N{n}_O{o}_rep{r}.json)

{
    "config": {
        "prior_context": 32000,
        "current_new_tokens": 8192,
        "output_length": 128,
        "total_input_length": 40192
    },
    "timestamps_ns": {
        "t0_proxy_recv": 1000000000,
        "t1_proxy_dispatch": 1000050000,
        "t2_p_schedule": 1000200000,
        "t3_p_prefill_done": 1001100000,
        "t4_zmq_metadata": 1001150000,
        "t5_rdma_start": 1001200000,
        "t6_rdma_complete": 1002300000,
        "t7_d_recv_signal": 1002350000,
        "t8_d_promoted": 1002500000,
        "t9_d_first_token": 1002600000,
        "t10_d_last_token": 1003800000
    },
    "breakdown_ms": {
        "routing": 0.05,
        "p_queue": 0.15,
        "p_prefill": 0.90,
        "zmq_handshake": 0.05,
        "rdma_transfer": 1.10,
        "transfer_signal": 0.05,
        "d_promotion": 0.15,
        "d_first_token": 0.10,
        "d_decode": 1.20,
        "transfer_total": 1.20,
        "e2e": 3.80
    },
    "transfer_detail": {
        "bytes_transferred": 268435456,
        "num_rdma_ops": 512,
        "num_blocks": 512,
        "num_layers": 32,
        "build_params_ms": 0.8,
        "rdma_write_ms": 1100.0,
        "effective_bw_gbps": 195.2
    }
}

Aggregated Summary (results/lifecycle/summary.csv)

prior_context,new_tokens,output_length,p_prefill_ms,rdma_transfer_ms,transfer_total_ms,d_decode_ms,e2e_ms,bytes_GB,bw_gbps,ttft_overhead_ms
0,8192,128,890,1100,1200,480,2620,0.268,195,1200
32000,8192,128,890,1100,1200,480,2620,0.268,195,1200
64000,8192,128,890,1100,1200,480,2620,0.268,195,1200
0,32768,128,3200,4400,4500,480,8230,1.073,195,4500

Analysis Deliverables

1. Stacked Bar Chart: Lifecycle Breakdown vs N (new tokens)

X-axis: current_new_tokens Y-axis: Time (ms) Stacked bars: routing | p_queue | p_prefill | zmq | rdma_transfer | signal | d_promotion | d_decode

Separate subplot rows for each prior_context value.

2. Transfer Bandwidth Characterization

Plot effective_bandwidth vs bytes_transferred:

  • Expected: bandwidth increases with transfer size (amortizes per-op latency)
  • Identify the "bandwidth knee" — minimum transfer size for near-peak bandwidth
  • Compare against theoretical 200 Gbps RDMA limit

3. Transfer Cost Model

Fit: rdma_transfer_ms = α + β × bytes_transferred

  • α = per-operation fixed cost (ZMQ + scheduling)
  • β = 1/bandwidth (bytes → time)

4. Overhead vs Co-Located Baseline

For each config, also measure the same request on a combined (no PD-sep) instance:

  • colo_ttft = time from request to first token on combined instance
  • pdsep_overhead = pdsep_ttft - colo_ttft

Plot: overhead as function of (C, N) — when does PD-sep become net-negative?

5. Impact of Prior Context on Transfer Volume

Since transfer is incremental:

  • When C increases (D has more cached), actual bytes_transferred should stay constant (≈ N × per_token_kv_size)
  • Verify this — if NOT constant, there's a bug in incremental transfer logic
  • Plot actual bytes_transferred vs C for fixed N

Risks & Mitigations

Risk Impact Mitigation
Clock skew between P and D processes Wrong cross-instance durations Use single-machine 2-GPU setup, share time.perf_counter_ns() clock
P and D scheduler step async Promotion delayed by step interval Record D's scheduler step frequency, subtract 0.5×step from d_promotion
Prefix cache eviction during experiment C not actually cached Monitor cache metrics, use small enough working set
Mooncake connection pool warmup First transfer slower Discard first 2 repetitions, use reps 3-5
vLLM internal queuing at high C+N OOM or scheduling delays Monitor gpu_cache_usage_perc, keep C+N ≤ 132k

Execution Estimate

Phase Time
vLLM patch development & validation 2 hours
Per configuration (5 reps × ~10s each) ~50s
Full sweep (144 configs × 50s) ~2 hours
Cache seeding overhead (6 prior_context levels) ~30 min
Analysis & plotting 2 hours
Total ~7 hours

Success Criteria

  1. Breakdown is complete: All phases sum to E2E (residual < 5%)
  2. Transfer dominates: rdma_transfer_ms > p_prefill_ms for N ≥ 4k (confirms current bottleneck hypothesis)
  3. Bandwidth model fits: Linear model R² > 0.95
  4. Incremental verified: bytes_transferred independent of prior_context for fixed N
  5. Overhead quantified: Clear threshold (N, C) where PD-sep overhead exceeds co-located execution