Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired aftercc6e562/ 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
185 lines
9.3 KiB
Markdown
185 lines
9.3 KiB
Markdown
# Research Findings: KV Cache Optimization for Agentic LLM Workloads
|
||
|
||
**Date**: 2026-05-23
|
||
**Author**: Gahow Wang
|
||
|
||
---
|
||
|
||
## 1. Agentic Workload Characteristics (vs Chatbot/API)
|
||
|
||
| Property | Chatbot/API | **Agentic (this work)** |
|
||
|----------|-------------|----------------------|
|
||
| Input length | 1-5k tokens | **avg 33.6k, p90 88k** |
|
||
| Output length | 100-500 tokens | **avg 445** (similar) |
|
||
| I/O ratio | 1-10x | **75.6x** |
|
||
| Prefill token share | 50-70% | **98%** |
|
||
| KV reuse | Low (independent requests) | **71% theoretical, 91% intra-session** |
|
||
| Session structure | Mostly single-turn | **Multi-turn chains (parent_chat_id)** |
|
||
| Request weight distribution | Uniform | **Bimodal: 49% WARM/MEDIUM, 51% HEAVY** |
|
||
|
||
These characteristics fundamentally change what optimizations matter.
|
||
|
||
## 2. What Doesn't Work (and Why)
|
||
|
||
### 2.1 PD Disaggregation (DistServe/Splitwise approach)
|
||
|
||
**Setup**: 4P + 4D instances (Mooncake RDMA KV transfer)
|
||
**Result**: TTFT +72%, TPOT +1%, APC -5pp vs combined
|
||
|
||
**Root cause**: KV cache memory wall on decode instances. With avg 33.6k input and dedicated decode instances:
|
||
- Decode KV cache fills to 97.1%
|
||
- GPU idle (Running: 0), but new requests queue for KV cache memory
|
||
- 87.7% of TTFT is spent waiting for KV cache space
|
||
|
||
**Why it's different from chatbot**: Chatbot has short context (1-5k), so decode KV cache rarely fills. Agentic has 33k+ context, requiring 4-8GB KV per request → 2-3 concurrent requests saturate a single GPU's KV cache.
|
||
|
||
### 2.2 LMetric (OSDI'26, P_tokens × BS multiplication routing)
|
||
|
||
**Setup**: 8 instances, LMetric vs linear routing
|
||
**Result**: TTFT +2.2%, TPOT -4.4%, E2E +2.6% — all within noise (±7% run-to-run)
|
||
|
||
**Root cause (updated)**: LMetric is not "neutralized by affinity
|
||
constraints" — pure `--policy lmetric` runs without session affinity at all.
|
||
The actual reason the LMetric vs linear comparison sits within noise is that
|
||
`P_tokens` already includes `new_uncached_tokens = input_length - cache_hit`,
|
||
which means later turns of a session naturally score lowest on the instance
|
||
that cached their prefix. This gives LMetric an **implicit soft affinity**
|
||
that competes with linear's explicit sticky affinity. The two arrive at
|
||
similar placements through different mechanisms.
|
||
|
||
This is also why explicit migration buys little on top of LMetric: the
|
||
first-order signal driving placement is already cache-derived. See
|
||
`docs/migration-policy-design.md` for how the current hybrid policy uses
|
||
this insight (LMetric base + explicit affinity only when `cache_ratio > 0.5`).
|
||
|
||
**Previous framing (incorrect)**: an earlier draft of this section attributed
|
||
the result to session affinity constraining LMetric's routing freedom. That
|
||
framing assumed `--policy lmetric` inherited the linear-mode session-sticky
|
||
behavior, which it does not (verified in `tests/test_proxy_pick.py`).
|
||
|
||
### 2.3 Elastic P2P RDMA Offload (Heavy prefill on different instance)
|
||
|
||
**Setup**: 8 instances (kv_both), HEAVY requests prefilled on different instance, KV transferred via Mooncake RDMA
|
||
**Result**: E2E +37%, TPOT +11.6% — significantly worse
|
||
|
||
**Root causes**:
|
||
1. **Mooncake lacks layerwise KV transfer**: All blocks transferred after prefill completes (sequential, not pipelined). Transfer p50=1.1s for 40k tokens, highly variable (R²=0.095 vs input length).
|
||
2. **92% of HEAVY are turn-1 cold**: No cache to exploit on the P instance → full cold prefill is always slower than co-located cached prefill.
|
||
3. **kv_both has non-zero overhead at high concurrency**: Zero overhead at idle (Phase 0), but +16% TPOT at 16 sessions (background RDMA threads compete for resources).
|
||
|
||
### 2.4 Dedicated Prefill Service
|
||
|
||
**Setup**: 1 PS (no sessions) + 7 C (session-sticky)
|
||
**Result**: PS either gets 0 offloads (cost model correctly identifies it's more expensive) or gets too many (cascading timeouts)
|
||
|
||
**Root cause**: Without KV pull from C, PS does cold prefill (full input) which is always slower than C's cached prefill. With KV pull, double RDMA transfer overhead negates the benefit.
|
||
|
||
### 2.5 Chunk Size Tuning (max_num_batched_tokens)
|
||
|
||
**Setup**: 2048/4096/8192/16384 at 16 sessions
|
||
**Result**: Default 8192 is optimal; smaller chunks add scheduler overhead, larger chunks help HEAVY but hurt overall
|
||
|
||
### 2.6 OVERLOAD_FACTOR Tuning
|
||
|
||
**Setup**: 2.0/1.5/1.3/1.0 session affinity breaking threshold
|
||
**Result**: No effect — imbalance from workload skew, not routing
|
||
|
||
## 3. What DOES Work
|
||
|
||
### 3.1 Cache-Aware Session-Sticky Routing (the dominant optimization)
|
||
|
||
**Setup**: `score = ongoing_tokens - α × cache_hit_tokens`, session affinity for turn 2+
|
||
**Result vs round-robin**:
|
||
|
||
| Metric | Round-Robin | Cache-Aware | Delta |
|
||
|--------|------------|-------------|-------|
|
||
| TTFT p50 | 1.836s | **0.731s** | **-60%** |
|
||
| TPOT p90 | 0.086s | **0.073s** | **-15%** |
|
||
| APC | 20.8% | **44.7%** | **+24pp** |
|
||
|
||
**Why it works for agentic**: 91% of KV reuse is intra-session. Session-sticky routing ensures subsequent turns find their KV cache on the same instance. Cache-aware scoring steers turn-1 requests to instances with matching system prompt blocks (47% of blocks are shared across sessions).
|
||
|
||
### 3.2 GPU Balance → Latency Improvement (H4 evidence)
|
||
|
||
When GPU imbalance was reduced from 4.0x to 2.0x (via H4 cache-gate routing):
|
||
- HEAVY_COLO TTFT: 7.02s → **6.28s (-10.5%)**
|
||
- No TPOT regression
|
||
|
||
**Mechanism**: Hot GPU (63.3% util) causes queuing delays for co-located requests. Spreading load more evenly eliminates the queuing bottleneck.
|
||
|
||
**Limitation**: Only demonstrated for the 52/60 HEAVY requests that stayed co-located. The 8 offloaded requests had RDMA overhead. A routing-only approach to achieve balance (without RDMA) would be ideal.
|
||
|
||
## 4. System-Level Insights
|
||
|
||
### 4.1 Prefill-Decode Interference Threshold
|
||
|
||
| Concurrency | TPOT p90 | MEDIUM TPOT p90 | GPU Util |
|
||
|------------|----------|-----------------|----------|
|
||
| 8 sessions (1/GPU) | 0.073 | 0.075 | 30% |
|
||
| **16 sessions (2/GPU)** | **0.106 (+45%)** | **0.197 (+163%)** | 25% |
|
||
|
||
At >1 session per GPU, chunked prefill interference becomes significant. MEDIUM requests' TPOT degrades 2.5x because HEAVY prefill chunks block their decode steps.
|
||
|
||
### 4.2 Mooncake Transfer Engine Limitations
|
||
|
||
- **No layerwise transfer**: All KV blocks transferred after full prefill → pure sequential overhead
|
||
- **High variance**: R²=0.095 (transfer time uncorrelated with data size), bimodal distribution (0.6s or 18-30s)
|
||
- **Zero idle overhead**: kv_both mode has no cost when not transferring (Phase 0 validated)
|
||
- **Non-zero overhead at high concurrency**: +16% TPOT at 16 sessions (background threads)
|
||
|
||
### 4.3 KV Cache Reuse Structure
|
||
|
||
```
|
||
Total trace: 2.1M requests, 71% theoretical APC
|
||
Reuse breakdown:
|
||
91% intra-session (same session, subsequent turns)
|
||
4.8% cross-session (shared system prompts)
|
||
4.2% unique (no reuse)
|
||
|
||
Effective APC achieved:
|
||
Round-robin: 20.8% (destroys session locality)
|
||
Cache-aware: 44.7% (preserves session locality)
|
||
Theoretical max: 71% (infinite cache, single instance)
|
||
Gap: 26pp from eviction + routing imperfection
|
||
```
|
||
|
||
### 4.4 Request Weight After Cache
|
||
|
||
A critical insight: the "weight" of a request for scheduling should be `new_tokens = input - cached`, not `total_input`.
|
||
|
||
| Request | Total Input | After 70% Cache | Effective Weight |
|
||
|---------|------------|-----------------|-----------------|
|
||
| 80k HEAVY | 80k tokens | **24k tokens** | **MEDIUM** |
|
||
| 30k MEDIUM | 30k tokens | 9k tokens | WARM |
|
||
| 5k WARM | 5k tokens | 5k tokens | WARM |
|
||
|
||
This changes the scheduling picture: most "HEAVY" requests in agentic workloads are effectively MEDIUM after cache — PD separation's premise (heavy prefill needs dedicated resources) doesn't apply.
|
||
|
||
## 5. Paper-Ready Summary
|
||
|
||
### Agentic Workload Characteristics (vs prior LLM serving work):
|
||
1. **Extreme I/O ratio** (75x) → 98% of compute is prefill
|
||
2. **High intra-session KV reuse** (91%) → session affinity is critical
|
||
3. **Bimodal request weight** (51% HEAVY by input, but only ~15% HEAVY by new_tokens after cache)
|
||
4. **Multi-turn session structure** → routing decisions have long-term consequences (session migration destroys cache)
|
||
|
||
### Why existing approaches don't work:
|
||
1. **PD-Sep** assumes decode needs dedicated resources → agentic has memory wall on decode
|
||
2. **LMetric** matches linear within noise because cache-hit appears in
|
||
`P_tokens` itself, so it already routes later turns back to the cached
|
||
instance via implicit soft affinity — explicit affinity buys little
|
||
3. **Elastic RDMA** assumes KV transfer is cheap → Mooncake lacks layerwise pipelining
|
||
4. **Size-based classification** assumes HEAVY = needs special handling → after cache, most HEAVY is MEDIUM
|
||
|
||
### Our insights:
|
||
1. **Cache-aware session-sticky routing** is the dominant optimization: -60% TTFT, +24pp APC
|
||
2. **Routing quality > PD separation > eviction policy** (simulation verified: routing gives 24pp, eviction gives 1.8pp)
|
||
3. **Effective request weight** (new_tokens, not total_input) should drive scheduling
|
||
4. **Prefill-decode interference** only matters at >1 session/GPU (rarely reached in production clusters)
|
||
5. **GPU balance improvement** directly improves HEAVY TTFT (-10.5% demonstrated)
|
||
|
||
### Quantitative improvements (this work vs vLLM default):
|
||
- TTFT p50: **-60%** (0.731s vs 1.836s) from cache-aware session-sticky routing
|
||
- APC: **+24pp** (44.7% vs 20.8%)
|
||
- TPOT p90: **-15%** (0.073s vs 0.086s) from reduced prefill-decode interference via better cache utilization
|