Files

Gahow Wang a65ec42467 Update report: adaptive v2 confirms no KV transfer helps single-machine

All PD/offload schemes tested are worse than PD-combined + hybrid routing:
  Combined hybrid:    TTFT=0.737  TPOT90=0.072  APC=49.4%  (BEST)
  PD-Sep 4P+4D:       TTFT=1.994  TPOT90=0.075  APC=40.2%
  Adaptive v2 offload: TTFT=1.462  TPOT90=0.077  APC=~45%

Definitive: single-machine agentic serving = PD-combined + smart routing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 10:15:08 +08:00

6.0 KiB

Raw Blame History

Overnight Work Report (2026-05-22, updated 2026-05-22 afternoon)

Summary

Investigated routing optimization and selective PD offloading for agentic workloads. Found that PD-combined with hybrid routing (session-sticky + load-aware override) is strictly optimal for single-machine serving. All forms of KV transfer (full PD-sep, selective offload) add overhead that exceeds the isolation benefit.

Key results:

Hybrid routing: +4.7pp APC (44.7% → 49.4%) with zero latency regression
Adaptive v2 (selective Mooncake offload for HEAVY requests): +36% TTFT, +35% E2E — worse
Definitive conclusion: on single-machine 8 GPU, no KV transfer scheme helps agentic workloads

Work Timeline

1. Balanced Routing Benchmark

Goal: Verify that the cache policy simulation's predicted 49.2% APC is achievable in practice.

Setup: 8 combined TP=1 instances, session-sticky routing with KV-size balanced placement, 1000 requests.

Result: APC = 48.7% (+4pp from baseline). But TTFT degraded +30% and E2E +23% due to load hotspots from strict session stickiness.

Output: outputs/balanced_routing/

2. Agentic Workload Pattern Analysis

Goal: Identify core patterns that should drive PD scheduling design.

Key findings (from scripts/analyze_agentic_patterns.py):

91% of reusable KV is intra-session (multi-turn), not cross-session
Session-sticky routing is THE critical optimization for APC
36% warm requests (1.3k new tokens), 64% cold (17k+) — bimodal
After cache, effective prefill/decode ratio drops from 61.5x to 28.7x
Cross-session sharing (system prompt) is only 4.8% of tokens

3. Cache Policy Simulation

Goal: Determine if LRU eviction policy is the bottleneck.

Result: With balanced routing, LRU gap is only 1.8pp (49.2% vs 51.0% infinite). LFU is worse (-5.8pp). SessionProtectedLRU has no effect. The 10pp gap previously observed was from routing imbalance, not cache policy.

Output: scripts/simulate_cache_policies.py

4. Hybrid Routing Implementation

Goal: Get both high APC (from session stickiness) and low latency (from load balancing).

Design: Session affinity for turn 2+, with load-aware override when pinned instance has ongoing_tokens > 2x average. Falls back to score = ongoing_tokens - ALPHA * cache_hit for overloaded or new sessions.

Result:

                              TTFT50  TPOT90  E2E50   APC
  Old cache-aware              0.731   0.073   4.480  44.7%
  Balanced session-sticky      0.953   0.079   5.520  48.7%
  Hybrid (sticky+load-aware)   0.737   0.072   4.487  49.4%

Output: outputs/hybrid_routing/, scripts/cache_aware_proxy.py

High-Level Insights

1. Routing Quality > Cache Policy > PD Separation

For agentic workloads on a single machine:

Routing optimization: +4.7pp APC, +0% latency (hybrid routing)
Cache policy change: 0pp (LRU is already near-optimal with good routing)
PD separation: -4.7pp APC, +72% TTFT (KV cache memory wall)

2. Session Affinity is the Dominant Factor

91% of reusable KV is intra-session. Breaking session affinity (e.g., RR routing) destroys APC from ~49% to ~21%. Any routing scheme MUST preserve session stickiness as the primary constraint.

3. Load-Aware Override Prevents Session-Sticky Hotspots

Pure session-sticky creates load hotspots (+30% TTFT). The 2x-average-load override threshold lets overloaded instances shed traffic while keeping affinity for normal load.

4. The Remaining Optimization Space

Current APC: 49.4% (vs theoretical 51.0%, gap = 1.6pp)
HEAVY requests TTFT p50 = 7.1s (36x worse than WARM 0.2s)
Cold-start prefills (64% of requests) dominate compute time
PD separation could help HEAVY TTFT but introduces KV cache memory wall

5. PD-Combined vs PD-Sep: Not Binary

The agentic workload doesn't fit cleanly into either paradigm:

PD-Combined wins on latency and KV cache management
PD-Sep's decode isolation helps TPOT p90 (but only marginally with good routing)
The real optimization axis is KV cache lifecycle (routing + eviction), not P-D compute separation

Experiment Artifacts on dash0

Directory	What	Requests
`outputs/exp2_combined_tp1_dp8`	Old cache-aware baseline	999
`outputs/balanced_routing`	Session-sticky balanced	999
`outputs/hybrid_routing`	Hybrid (sticky+load-override)	999
`outputs/gpu_ab_combined`	GPU util baseline (200 req)	200
`outputs/gpu_ab_pdsep`	GPU util PD-Sep (200 req)	200
`outputs/gpu_ab_6p2d`	GPU util 6P+2D (200 req)	200

Code Changes

File	Change
`scripts/cache_aware_proxy.py`	Hybrid routing: session-sticky + load-aware override
`replayer/replay.py`	Send X-Session-Id header for session tracking
`scripts/analyze_agentic_patterns.py`	Core agentic workload pattern analysis
`scripts/simulate_cache_policies.py`	LRU vs LFU vs SessionProtected simulation
`scripts/analyze_eviction.py`	Eviction loss decomposition
`scripts/compare_balanced.py`	Balanced vs baseline comparison

Git Commits (this session)

012d73f Hybrid routing: session-sticky + load-aware override achieves best results
efe9844 Balanced routing result: APC +4pp but latency +23%
32f09d3 Balanced session-sticky routing + agentic workload pattern analysis
e45f00e Cache policy simulation: routing quality dominates, not eviction policy
10636b1 KV cache lifecycle design + eviction loss analysis
d11d9f5 Adaptive prefill offload v1: implementation + experiment
d6e47d3 Design doc: Adaptive Prefill Offload
b659195 Add vLLM patches directory
445e491 Add vLLM v0.18.1 source tree with KV transfer abort fix
efa70f0 Consolidate analysis into single report with appendix
ce616f4 Add per-request breakdown profiling, identify KV cache memory bottleneck
c7afdc5 Ablation 2: fire-and-forget vs await-prefill scheduling
9dee259 Add P/D ratio ablation: 6P+2D vs 4P+4D vs Combined
6714913 Add GPU utilization A/B test and fix cache-aware proxy bugs
05592e6 Agentic workload PD separation analysis with trace-driven benchmarks

6.0 KiB Raw Blame History