Overnight work report: routing optimization achieves +4.7pp APC
Summary of overnight autonomous session: - Analyzed agentic workload patterns (91% KV reuse is intra-session) - Simulated cache policies (LRU near-optimal, routing is the bottleneck) - Implemented hybrid routing (session-sticky + load-aware override) - Result: APC 44.7% -> 49.4% with zero latency regression Key insight: routing quality > cache policy > PD separation for single-machine agentic workloads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
120
analysis/overnight_work_report.md
Normal file
120
analysis/overnight_work_report.md
Normal file
@@ -0,0 +1,120 @@
|
||||
# Overnight Work Report (2026-05-22)
|
||||
|
||||
## Summary
|
||||
|
||||
Investigated routing optimization for agentic workloads on PD-combined serving. Found that **session-sticky routing with load-aware override** achieves the best balance of KV cache reuse (APC) and request latency.
|
||||
|
||||
**Key result**: +4.7pp APC improvement (44.7% → 49.4%) with zero latency regression.
|
||||
|
||||
---
|
||||
|
||||
## Work Timeline
|
||||
|
||||
### 1. Balanced Routing Benchmark
|
||||
**Goal**: Verify that the cache policy simulation's predicted 49.2% APC is achievable in practice.
|
||||
|
||||
**Setup**: 8 combined TP=1 instances, session-sticky routing with KV-size balanced placement, 1000 requests.
|
||||
|
||||
**Result**: APC = 48.7% (+4pp from baseline). But TTFT degraded +30% and E2E +23% due to load hotspots from strict session stickiness.
|
||||
|
||||
**Output**: `outputs/balanced_routing/`
|
||||
|
||||
### 2. Agentic Workload Pattern Analysis
|
||||
**Goal**: Identify core patterns that should drive PD scheduling design.
|
||||
|
||||
**Key findings** (from `scripts/analyze_agentic_patterns.py`):
|
||||
- **91% of reusable KV is intra-session** (multi-turn), not cross-session
|
||||
- Session-sticky routing is THE critical optimization for APC
|
||||
- 36% warm requests (1.3k new tokens), 64% cold (17k+) — bimodal
|
||||
- After cache, effective prefill/decode ratio drops from 61.5x to 28.7x
|
||||
- Cross-session sharing (system prompt) is only 4.8% of tokens
|
||||
|
||||
### 3. Cache Policy Simulation
|
||||
**Goal**: Determine if LRU eviction policy is the bottleneck.
|
||||
|
||||
**Result**: With balanced routing, LRU gap is only 1.8pp (49.2% vs 51.0% infinite). LFU is worse (-5.8pp). SessionProtectedLRU has no effect. The 10pp gap previously observed was from routing imbalance, not cache policy.
|
||||
|
||||
**Output**: `scripts/simulate_cache_policies.py`
|
||||
|
||||
### 4. Hybrid Routing Implementation
|
||||
**Goal**: Get both high APC (from session stickiness) and low latency (from load balancing).
|
||||
|
||||
**Design**: Session affinity for turn 2+, with load-aware override when pinned instance has `ongoing_tokens > 2x average`. Falls back to `score = ongoing_tokens - ALPHA * cache_hit` for overloaded or new sessions.
|
||||
|
||||
**Result**:
|
||||
```
|
||||
TTFT50 TPOT90 E2E50 APC
|
||||
Old cache-aware 0.731 0.073 4.480 44.7%
|
||||
Balanced session-sticky 0.953 0.079 5.520 48.7%
|
||||
Hybrid (sticky+load-aware) 0.737 0.072 4.487 49.4%
|
||||
```
|
||||
|
||||
**Output**: `outputs/hybrid_routing/`, `scripts/cache_aware_proxy.py`
|
||||
|
||||
## High-Level Insights
|
||||
|
||||
### 1. Routing Quality > Cache Policy > PD Separation
|
||||
For agentic workloads on a single machine:
|
||||
- **Routing optimization**: +4.7pp APC, +0% latency (hybrid routing)
|
||||
- **Cache policy change**: 0pp (LRU is already near-optimal with good routing)
|
||||
- **PD separation**: -4.7pp APC, +72% TTFT (KV cache memory wall)
|
||||
|
||||
### 2. Session Affinity is the Dominant Factor
|
||||
91% of reusable KV is intra-session. Breaking session affinity (e.g., RR routing) destroys APC from ~49% to ~21%. Any routing scheme MUST preserve session stickiness as the primary constraint.
|
||||
|
||||
### 3. Load-Aware Override Prevents Session-Sticky Hotspots
|
||||
Pure session-sticky creates load hotspots (+30% TTFT). The 2x-average-load override threshold lets overloaded instances shed traffic while keeping affinity for normal load.
|
||||
|
||||
### 4. The Remaining Optimization Space
|
||||
- Current APC: 49.4% (vs theoretical 51.0%, gap = 1.6pp)
|
||||
- HEAVY requests TTFT p50 = 7.1s (36x worse than WARM 0.2s)
|
||||
- Cold-start prefills (64% of requests) dominate compute time
|
||||
- PD separation could help HEAVY TTFT but introduces KV cache memory wall
|
||||
|
||||
### 5. PD-Combined vs PD-Sep: Not Binary
|
||||
The agentic workload doesn't fit cleanly into either paradigm:
|
||||
- PD-Combined wins on latency and KV cache management
|
||||
- PD-Sep's decode isolation helps TPOT p90 (but only marginally with good routing)
|
||||
- The real optimization axis is **KV cache lifecycle** (routing + eviction), not P-D compute separation
|
||||
|
||||
## Experiment Artifacts on dash0
|
||||
|
||||
| Directory | What | Requests |
|
||||
|-----------|------|----------|
|
||||
| `outputs/exp2_combined_tp1_dp8` | Old cache-aware baseline | 999 |
|
||||
| `outputs/balanced_routing` | Session-sticky balanced | 999 |
|
||||
| `outputs/hybrid_routing` | Hybrid (sticky+load-override) | 999 |
|
||||
| `outputs/gpu_ab_combined` | GPU util baseline (200 req) | 200 |
|
||||
| `outputs/gpu_ab_pdsep` | GPU util PD-Sep (200 req) | 200 |
|
||||
| `outputs/gpu_ab_6p2d` | GPU util 6P+2D (200 req) | 200 |
|
||||
|
||||
## Code Changes
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `scripts/cache_aware_proxy.py` | Hybrid routing: session-sticky + load-aware override |
|
||||
| `replayer/replay.py` | Send X-Session-Id header for session tracking |
|
||||
| `scripts/analyze_agentic_patterns.py` | Core agentic workload pattern analysis |
|
||||
| `scripts/simulate_cache_policies.py` | LRU vs LFU vs SessionProtected simulation |
|
||||
| `scripts/analyze_eviction.py` | Eviction loss decomposition |
|
||||
| `scripts/compare_balanced.py` | Balanced vs baseline comparison |
|
||||
|
||||
## Git Commits (this session)
|
||||
|
||||
```
|
||||
012d73f Hybrid routing: session-sticky + load-aware override achieves best results
|
||||
efe9844 Balanced routing result: APC +4pp but latency +23%
|
||||
32f09d3 Balanced session-sticky routing + agentic workload pattern analysis
|
||||
e45f00e Cache policy simulation: routing quality dominates, not eviction policy
|
||||
10636b1 KV cache lifecycle design + eviction loss analysis
|
||||
d11d9f5 Adaptive prefill offload v1: implementation + experiment
|
||||
d6e47d3 Design doc: Adaptive Prefill Offload
|
||||
b659195 Add vLLM patches directory
|
||||
445e491 Add vLLM v0.18.1 source tree with KV transfer abort fix
|
||||
efa70f0 Consolidate analysis into single report with appendix
|
||||
ce616f4 Add per-request breakdown profiling, identify KV cache memory bottleneck
|
||||
c7afdc5 Ablation 2: fire-and-forget vs await-prefill scheduling
|
||||
9dee259 Add P/D ratio ablation: 6P+2D vs 4P+4D vs Combined
|
||||
6714913 Add GPU utilization A/B test and fix cache-aware proxy bugs
|
||||
05592e6 Agentic workload PD separation analysis with trace-driven benchmarks
|
||||
```
|
||||
Reference in New Issue
Block a user