Files
agentic-kvc/analysis/research_findings.md
Gahow Wang 6a27f75337 Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).

- REPORT.md §1.1 / §3.9: add errata callout and section header noting
  the "Final Design" framing was retired after cc6e562 / 4c583f2;
  point readers to docs/migration-policy-design.md.

- docs/migration-policy-design.md: rewrite. Opens with the current
  hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
  tie-breaker), then a "What Was Retired" commit table, then the old
  Approach A numbers preserved as "Historical Baseline-Mode Comparison".

- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
  LMetric isn't "neutralized by affinity constraints" (pure --policy
  lmetric has no affinity at all); it converges to similar placements
  because P_tokens includes new_uncached_tokens, giving it implicit
  soft affinity.

- analysis/elastic_hypotheses.md: same LMetric correction in the
  "DOESN'T work" summary, plus a footer cross-referencing the current
  routing direction.

- analysis/unified_routing_fix_review.md: track this file (was
  untracked); it is the review handoff cited from the updated docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00

9.3 KiB
Raw Permalink Blame History

Research Findings: KV Cache Optimization for Agentic LLM Workloads

Date: 2026-05-23 Author: Gahow Wang


1. Agentic Workload Characteristics (vs Chatbot/API)

Property Chatbot/API Agentic (this work)
Input length 1-5k tokens avg 33.6k, p90 88k
Output length 100-500 tokens avg 445 (similar)
I/O ratio 1-10x 75.6x
Prefill token share 50-70% 98%
KV reuse Low (independent requests) 71% theoretical, 91% intra-session
Session structure Mostly single-turn Multi-turn chains (parent_chat_id)
Request weight distribution Uniform Bimodal: 49% WARM/MEDIUM, 51% HEAVY

These characteristics fundamentally change what optimizations matter.

2. What Doesn't Work (and Why)

2.1 PD Disaggregation (DistServe/Splitwise approach)

Setup: 4P + 4D instances (Mooncake RDMA KV transfer) Result: TTFT +72%, TPOT +1%, APC -5pp vs combined

Root cause: KV cache memory wall on decode instances. With avg 33.6k input and dedicated decode instances:

  • Decode KV cache fills to 97.1%
  • GPU idle (Running: 0), but new requests queue for KV cache memory
  • 87.7% of TTFT is spent waiting for KV cache space

Why it's different from chatbot: Chatbot has short context (1-5k), so decode KV cache rarely fills. Agentic has 33k+ context, requiring 4-8GB KV per request → 2-3 concurrent requests saturate a single GPU's KV cache.

2.2 LMetric (OSDI'26, P_tokens × BS multiplication routing)

Setup: 8 instances, LMetric vs linear routing Result: TTFT +2.2%, TPOT -4.4%, E2E +2.6% — all within noise (±7% run-to-run)

Root cause (updated): LMetric is not "neutralized by affinity constraints" — pure --policy lmetric runs without session affinity at all. The actual reason the LMetric vs linear comparison sits within noise is that P_tokens already includes new_uncached_tokens = input_length - cache_hit, which means later turns of a session naturally score lowest on the instance that cached their prefix. This gives LMetric an implicit soft affinity that competes with linear's explicit sticky affinity. The two arrive at similar placements through different mechanisms.

This is also why explicit migration buys little on top of LMetric: the first-order signal driving placement is already cache-derived. See docs/migration-policy-design.md for how the current hybrid policy uses this insight (LMetric base + explicit affinity only when cache_ratio > 0.5).

Previous framing (incorrect): an earlier draft of this section attributed the result to session affinity constraining LMetric's routing freedom. That framing assumed --policy lmetric inherited the linear-mode session-sticky behavior, which it does not (verified in tests/test_proxy_pick.py).

2.3 Elastic P2P RDMA Offload (Heavy prefill on different instance)

Setup: 8 instances (kv_both), HEAVY requests prefilled on different instance, KV transferred via Mooncake RDMA Result: E2E +37%, TPOT +11.6% — significantly worse

Root causes:

  1. Mooncake lacks layerwise KV transfer: All blocks transferred after prefill completes (sequential, not pipelined). Transfer p50=1.1s for 40k tokens, highly variable (R²=0.095 vs input length).
  2. 92% of HEAVY are turn-1 cold: No cache to exploit on the P instance → full cold prefill is always slower than co-located cached prefill.
  3. kv_both has non-zero overhead at high concurrency: Zero overhead at idle (Phase 0), but +16% TPOT at 16 sessions (background RDMA threads compete for resources).

2.4 Dedicated Prefill Service

Setup: 1 PS (no sessions) + 7 C (session-sticky) Result: PS either gets 0 offloads (cost model correctly identifies it's more expensive) or gets too many (cascading timeouts)

Root cause: Without KV pull from C, PS does cold prefill (full input) which is always slower than C's cached prefill. With KV pull, double RDMA transfer overhead negates the benefit.

2.5 Chunk Size Tuning (max_num_batched_tokens)

Setup: 2048/4096/8192/16384 at 16 sessions Result: Default 8192 is optimal; smaller chunks add scheduler overhead, larger chunks help HEAVY but hurt overall

2.6 OVERLOAD_FACTOR Tuning

Setup: 2.0/1.5/1.3/1.0 session affinity breaking threshold Result: No effect — imbalance from workload skew, not routing

3. What DOES Work

3.1 Cache-Aware Session-Sticky Routing (the dominant optimization)

Setup: score = ongoing_tokens - α × cache_hit_tokens, session affinity for turn 2+ Result vs round-robin:

Metric Round-Robin Cache-Aware Delta
TTFT p50 1.836s 0.731s -60%
TPOT p90 0.086s 0.073s -15%
APC 20.8% 44.7% +24pp

Why it works for agentic: 91% of KV reuse is intra-session. Session-sticky routing ensures subsequent turns find their KV cache on the same instance. Cache-aware scoring steers turn-1 requests to instances with matching system prompt blocks (47% of blocks are shared across sessions).

3.2 GPU Balance → Latency Improvement (H4 evidence)

When GPU imbalance was reduced from 4.0x to 2.0x (via H4 cache-gate routing):

  • HEAVY_COLO TTFT: 7.02s → 6.28s (-10.5%)
  • No TPOT regression

Mechanism: Hot GPU (63.3% util) causes queuing delays for co-located requests. Spreading load more evenly eliminates the queuing bottleneck.

Limitation: Only demonstrated for the 52/60 HEAVY requests that stayed co-located. The 8 offloaded requests had RDMA overhead. A routing-only approach to achieve balance (without RDMA) would be ideal.

4. System-Level Insights

4.1 Prefill-Decode Interference Threshold

Concurrency TPOT p90 MEDIUM TPOT p90 GPU Util
8 sessions (1/GPU) 0.073 0.075 30%
16 sessions (2/GPU) 0.106 (+45%) 0.197 (+163%) 25%

At >1 session per GPU, chunked prefill interference becomes significant. MEDIUM requests' TPOT degrades 2.5x because HEAVY prefill chunks block their decode steps.

4.2 Mooncake Transfer Engine Limitations

  • No layerwise transfer: All KV blocks transferred after full prefill → pure sequential overhead
  • High variance: R²=0.095 (transfer time uncorrelated with data size), bimodal distribution (0.6s or 18-30s)
  • Zero idle overhead: kv_both mode has no cost when not transferring (Phase 0 validated)
  • Non-zero overhead at high concurrency: +16% TPOT at 16 sessions (background threads)

4.3 KV Cache Reuse Structure

Total trace: 2.1M requests, 71% theoretical APC
Reuse breakdown:
  91% intra-session (same session, subsequent turns)
   4.8% cross-session (shared system prompts)
   4.2% unique (no reuse)

Effective APC achieved:
  Round-robin: 20.8% (destroys session locality)
  Cache-aware: 44.7% (preserves session locality)
  Theoretical max: 71% (infinite cache, single instance)
  Gap: 26pp from eviction + routing imperfection

4.4 Request Weight After Cache

A critical insight: the "weight" of a request for scheduling should be new_tokens = input - cached, not total_input.

Request Total Input After 70% Cache Effective Weight
80k HEAVY 80k tokens 24k tokens MEDIUM
30k MEDIUM 30k tokens 9k tokens WARM
5k WARM 5k tokens 5k tokens WARM

This changes the scheduling picture: most "HEAVY" requests in agentic workloads are effectively MEDIUM after cache — PD separation's premise (heavy prefill needs dedicated resources) doesn't apply.

5. Paper-Ready Summary

Agentic Workload Characteristics (vs prior LLM serving work):

  1. Extreme I/O ratio (75x) → 98% of compute is prefill
  2. High intra-session KV reuse (91%) → session affinity is critical
  3. Bimodal request weight (51% HEAVY by input, but only ~15% HEAVY by new_tokens after cache)
  4. Multi-turn session structure → routing decisions have long-term consequences (session migration destroys cache)

Why existing approaches don't work:

  1. PD-Sep assumes decode needs dedicated resources → agentic has memory wall on decode
  2. LMetric matches linear within noise because cache-hit appears in P_tokens itself, so it already routes later turns back to the cached instance via implicit soft affinity — explicit affinity buys little
  3. Elastic RDMA assumes KV transfer is cheap → Mooncake lacks layerwise pipelining
  4. Size-based classification assumes HEAVY = needs special handling → after cache, most HEAVY is MEDIUM

Our insights:

  1. Cache-aware session-sticky routing is the dominant optimization: -60% TTFT, +24pp APC
  2. Routing quality > PD separation > eviction policy (simulation verified: routing gives 24pp, eviction gives 1.8pp)
  3. Effective request weight (new_tokens, not total_input) should drive scheduling
  4. Prefill-decode interference only matters at >1 session/GPU (rarely reached in production clusters)
  5. GPU balance improvement directly improves HEAVY TTFT (-10.5% demonstrated)

Quantitative improvements (this work vs vLLM default):

  • TTFT p50: -60% (0.731s vs 1.836s) from cache-aware session-sticky routing
  • APC: +24pp (44.7% vs 20.8%)
  • TPOT p90: -15% (0.073s vs 0.086s) from reduced prefill-decode interference via better cache utilization