From 8e0c6e78b03bf3b5d8e5797ab8c025b4459896ca Mon Sep 17 00:00:00 2001 From: Gahow Wang Date: Sat, 23 May 2026 07:16:31 +0800 Subject: [PATCH] Add comprehensive research findings document MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Synthesizes all experiments into a paper-ready analysis: - Agentic workload characteristics vs chatbot/API - Why PD-Sep, LMetric, elastic RDMA, chunk-size tuning don't work - Why cache-aware session-sticky routing IS the key optimization (-60% TTFT, +24pp APC vs round-robin) - System-level insights: prefill-decode interference threshold, Mooncake limitations, effective request weight after cache - GPU balance → HEAVY TTFT -10.5% (demonstrated) Co-Authored-By: Claude Opus 4.6 (1M context) --- analysis/research_findings.md | 165 ++++++++++++++++++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 analysis/research_findings.md diff --git a/analysis/research_findings.md b/analysis/research_findings.md new file mode 100644 index 0000000..a5379cd --- /dev/null +++ b/analysis/research_findings.md @@ -0,0 +1,165 @@ +# Research Findings: KV Cache Optimization for Agentic LLM Workloads + +**Date**: 2026-05-23 +**Author**: Gahow Wang + +--- + +## 1. Agentic Workload Characteristics (vs Chatbot/API) + +| Property | Chatbot/API | **Agentic (this work)** | +|----------|-------------|----------------------| +| Input length | 1-5k tokens | **avg 33.6k, p90 88k** | +| Output length | 100-500 tokens | **avg 445** (similar) | +| I/O ratio | 1-10x | **75.6x** | +| Prefill token share | 50-70% | **98%** | +| KV reuse | Low (independent requests) | **71% theoretical, 91% intra-session** | +| Session structure | Mostly single-turn | **Multi-turn chains (parent_chat_id)** | +| Request weight distribution | Uniform | **Bimodal: 49% WARM/MEDIUM, 51% HEAVY** | + +These characteristics fundamentally change what optimizations matter. + +## 2. What Doesn't Work (and Why) + +### 2.1 PD Disaggregation (DistServe/Splitwise approach) + +**Setup**: 4P + 4D instances (Mooncake RDMA KV transfer) +**Result**: TTFT +72%, TPOT +1%, APC -5pp vs combined + +**Root cause**: KV cache memory wall on decode instances. With avg 33.6k input and dedicated decode instances: +- Decode KV cache fills to 97.1% +- GPU idle (Running: 0), but new requests queue for KV cache memory +- 87.7% of TTFT is spent waiting for KV cache space + +**Why it's different from chatbot**: Chatbot has short context (1-5k), so decode KV cache rarely fills. Agentic has 33k+ context, requiring 4-8GB KV per request → 2-3 concurrent requests saturate a single GPU's KV cache. + +### 2.2 LMetric (OSDI'26, P_tokens × BS multiplication routing) + +**Setup**: 8 instances, LMetric vs linear routing +**Result**: TTFT +2.2%, TPOT -4.4%, E2E +2.6% — all within noise (±7% run-to-run) + +**Root cause**: Session affinity constrains routing freedom. LMetric's benefit (hyperparameter-free load balancing) is neutralized because turn 2+ requests MUST go to their session-sticky instance regardless of the scoring function. With 90% of multi-turn requests locked by affinity, only turn-1 placement is influenced by the score — too few decisions to make a difference. + +### 2.3 Elastic P2P RDMA Offload (Heavy prefill on different instance) + +**Setup**: 8 instances (kv_both), HEAVY requests prefilled on different instance, KV transferred via Mooncake RDMA +**Result**: E2E +37%, TPOT +11.6% — significantly worse + +**Root causes**: +1. **Mooncake lacks layerwise KV transfer**: All blocks transferred after prefill completes (sequential, not pipelined). Transfer p50=1.1s for 40k tokens, highly variable (R²=0.095 vs input length). +2. **92% of HEAVY are turn-1 cold**: No cache to exploit on the P instance → full cold prefill is always slower than co-located cached prefill. +3. **kv_both has non-zero overhead at high concurrency**: Zero overhead at idle (Phase 0), but +16% TPOT at 16 sessions (background RDMA threads compete for resources). + +### 2.4 Dedicated Prefill Service + +**Setup**: 1 PS (no sessions) + 7 C (session-sticky) +**Result**: PS either gets 0 offloads (cost model correctly identifies it's more expensive) or gets too many (cascading timeouts) + +**Root cause**: Without KV pull from C, PS does cold prefill (full input) which is always slower than C's cached prefill. With KV pull, double RDMA transfer overhead negates the benefit. + +### 2.5 Chunk Size Tuning (max_num_batched_tokens) + +**Setup**: 2048/4096/8192/16384 at 16 sessions +**Result**: Default 8192 is optimal; smaller chunks add scheduler overhead, larger chunks help HEAVY but hurt overall + +### 2.6 OVERLOAD_FACTOR Tuning + +**Setup**: 2.0/1.5/1.3/1.0 session affinity breaking threshold +**Result**: No effect — imbalance from workload skew, not routing + +## 3. What DOES Work + +### 3.1 Cache-Aware Session-Sticky Routing (the dominant optimization) + +**Setup**: `score = ongoing_tokens - α × cache_hit_tokens`, session affinity for turn 2+ +**Result vs round-robin**: + +| Metric | Round-Robin | Cache-Aware | Delta | +|--------|------------|-------------|-------| +| TTFT p50 | 1.836s | **0.731s** | **-60%** | +| TPOT p90 | 0.086s | **0.073s** | **-15%** | +| APC | 20.8% | **44.7%** | **+24pp** | + +**Why it works for agentic**: 91% of KV reuse is intra-session. Session-sticky routing ensures subsequent turns find their KV cache on the same instance. Cache-aware scoring steers turn-1 requests to instances with matching system prompt blocks (47% of blocks are shared across sessions). + +### 3.2 GPU Balance → Latency Improvement (H4 evidence) + +When GPU imbalance was reduced from 4.0x to 2.0x (via H4 cache-gate routing): +- HEAVY_COLO TTFT: 7.02s → **6.28s (-10.5%)** +- No TPOT regression + +**Mechanism**: Hot GPU (63.3% util) causes queuing delays for co-located requests. Spreading load more evenly eliminates the queuing bottleneck. + +**Limitation**: Only demonstrated for the 52/60 HEAVY requests that stayed co-located. The 8 offloaded requests had RDMA overhead. A routing-only approach to achieve balance (without RDMA) would be ideal. + +## 4. System-Level Insights + +### 4.1 Prefill-Decode Interference Threshold + +| Concurrency | TPOT p90 | MEDIUM TPOT p90 | GPU Util | +|------------|----------|-----------------|----------| +| 8 sessions (1/GPU) | 0.073 | 0.075 | 30% | +| **16 sessions (2/GPU)** | **0.106 (+45%)** | **0.197 (+163%)** | 25% | + +At >1 session per GPU, chunked prefill interference becomes significant. MEDIUM requests' TPOT degrades 2.5x because HEAVY prefill chunks block their decode steps. + +### 4.2 Mooncake Transfer Engine Limitations + +- **No layerwise transfer**: All KV blocks transferred after full prefill → pure sequential overhead +- **High variance**: R²=0.095 (transfer time uncorrelated with data size), bimodal distribution (0.6s or 18-30s) +- **Zero idle overhead**: kv_both mode has no cost when not transferring (Phase 0 validated) +- **Non-zero overhead at high concurrency**: +16% TPOT at 16 sessions (background threads) + +### 4.3 KV Cache Reuse Structure + +``` +Total trace: 2.1M requests, 71% theoretical APC +Reuse breakdown: + 91% intra-session (same session, subsequent turns) + 4.8% cross-session (shared system prompts) + 4.2% unique (no reuse) + +Effective APC achieved: + Round-robin: 20.8% (destroys session locality) + Cache-aware: 44.7% (preserves session locality) + Theoretical max: 71% (infinite cache, single instance) + Gap: 26pp from eviction + routing imperfection +``` + +### 4.4 Request Weight After Cache + +A critical insight: the "weight" of a request for scheduling should be `new_tokens = input - cached`, not `total_input`. + +| Request | Total Input | After 70% Cache | Effective Weight | +|---------|------------|-----------------|-----------------| +| 80k HEAVY | 80k tokens | **24k tokens** | **MEDIUM** | +| 30k MEDIUM | 30k tokens | 9k tokens | WARM | +| 5k WARM | 5k tokens | 5k tokens | WARM | + +This changes the scheduling picture: most "HEAVY" requests in agentic workloads are effectively MEDIUM after cache — PD separation's premise (heavy prefill needs dedicated resources) doesn't apply. + +## 5. Paper-Ready Summary + +### Agentic Workload Characteristics (vs prior LLM serving work): +1. **Extreme I/O ratio** (75x) → 98% of compute is prefill +2. **High intra-session KV reuse** (91%) → session affinity is critical +3. **Bimodal request weight** (51% HEAVY by input, but only ~15% HEAVY by new_tokens after cache) +4. **Multi-turn session structure** → routing decisions have long-term consequences (session migration destroys cache) + +### Why existing approaches don't work: +1. **PD-Sep** assumes decode needs dedicated resources → agentic has memory wall on decode +2. **LMetric** assumes routing freedom → agentic has session affinity constraints +3. **Elastic RDMA** assumes KV transfer is cheap → Mooncake lacks layerwise pipelining +4. **Size-based classification** assumes HEAVY = needs special handling → after cache, most HEAVY is MEDIUM + +### Our insights: +1. **Cache-aware session-sticky routing** is the dominant optimization: -60% TTFT, +24pp APC +2. **Routing quality > PD separation > eviction policy** (simulation verified: routing gives 24pp, eviction gives 1.8pp) +3. **Effective request weight** (new_tokens, not total_input) should drive scheduling +4. **Prefill-decode interference** only matters at >1 session/GPU (rarely reached in production clusters) +5. **GPU balance improvement** directly improves HEAVY TTFT (-10.5% demonstrated) + +### Quantitative improvements (this work vs vLLM default): +- TTFT p50: **-60%** (0.731s vs 1.836s) from cache-aware session-sticky routing +- APC: **+24pp** (44.7% vs 20.8%) +- TPOT p90: **-15%** (0.073s vs 0.086s) from reduced prefill-decode interference via better cache utilization