GPU imbalance analysis + elastic PS verdict + corrected LMetric results

Key findings: - Session-sticky imbalance is 8.6x at 200 req (small-sample artifact) but only 1.24x at 1000 req (moderate, TPOT unaffected) - Elastic PS not justified: interference reduction 0% at 1/GPU, migration reduces imbalance 1.24x→1.18x at 1.5s/event cost - Corrected LMetric (no affinity) matches Linear (sticky) on all metrics (<2%), proving soft affinity from cache-hit scoring works - Updated §3.4 errata, added §8 GPU imbalance + elastic PS analysis Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:11:23 +08:00
parent 3594f7dce0
commit fefbd71ca9
1 changed files with 91 additions and 16 deletions
--- a/REPORT.md
+++ b/REPORT.md
@@ -201,14 +201,18 @@ Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV

 ### 3.4 Routing Policy: Linear vs LMetric (OSDI'26)

-LMetric (`score = P_tokens × BS`) vs linear (`score = ongoing_tokens - α·cache_hit`). Both fresh-restart, same trace.
+LMetric (`score = P_tokens × BS`, pure per-request, no session affinity) vs Linear (`score = ongoing_tokens - α·cache_hit`, session-sticky). Both fresh-restart, same trace.
+
+> **Errata (2026-05-23)**: Prior LMetric implementation incorrectly shared session-sticky logic with Linear. Fixed to pure per-request routing per OSDI'26 spec: `score = (pending_prefill + new_tokens) × num_requests`, no affinity, no overload override. Results below use corrected implementation.

 | Policy | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 | Delta E2E |
 |--------|----------|----------|----------|---------|-----------|
-| Linear | 1.086s | 9.432s | 0.077s | 5.423s | — |
-| LMetric | 1.099s | 9.392s | 0.073s | 5.205s | **-4.0%** |
+| Linear (session-sticky) | 1.073s | 9.347s | 0.073s | 5.119s | — |
+| LMetric (no affinity) | 1.081s | 9.408s | 0.072s | 5.102s | **-0.3%** |

-LMetric provides modest improvement through better load balancing. Routing policy headroom is limited for this workload.
+**Key finding**: LMetric without explicit session affinity matches Linear with session affinity on all metrics (< 2% difference). The cache-hit term in LMetric's scoring (`new_tokens = input - cache_hit`) creates **implicit soft affinity** — instances that already cached a session's prefix get lower P_tokens, naturally attracting subsequent turns. Explicit session-sticky routing is not required; cache-aware load balancing captures it automatically.
+
+APC distribution (LMetric, no affinity): inst_0=60.6%, inst_1=58.3%, inst_2=43.2%, inst_3=28.9%, inst_4=16.6%, inst_5=24.0%, inst_6=13.9%, inst_7=0.0%. Non-uniform but comparable aggregate to Linear's explicit affinity.

 ### 3.5 Errata: Why Prior Cross-Machine A/B Was Invalid

@@ -355,27 +359,98 @@ agentic-kv/
 | `scripts/gpu_monitor.sh` | Sample nvidia-smi to CSV | Pipe to `outputs/<tag>/gpu_util.csv` |
 | `scripts/launch_elastic_p2p.sh` | Launch all 8 kv_both instances + offload proxy | `HEAVY_THRESHOLD`, `MAX_OFFLOAD` env vars |

-## 8. Conclusions & Next Steps
+## 8. GPU Load Imbalance & Elastic Prefill Service Analysis
+
+### 8.1 Load Imbalance Characterization
+
+Session-sticky routing creates token load imbalance across instances. The severity depends on scale:
+
+| Scale | Imbalance | Top 5 sessions | Cause |
+|-------|-----------|----------------|-------|
+| 200 req (143 sessions) | **8.6×** tokens | 49% of all tokens | Small sample, few dominant sessions |
+| 1000 req (668 sessions) | **1.24×** tokens | 29% of all tokens | More sessions → natural averaging |
+
+At 1000 requests, the heaviest instance has 4.5M tokens vs lightest 3.6M (1.24×). Despite this, TPOT is uniform across all instances (0.070–0.077), confirming that prefill-decode interference is minimal at ≤1 session/GPU. The imbalance manifests in **TTFT only**: heaviest 2 instances TTFT p50 = 1.42s vs lightest 2 at 0.83s (1.7× gap).
+
+### 8.2 Session Accumulation Pattern
+
+Agentic workloads produce long-lived sessions with growing context:
+
+| Session | Turns | Total Tokens | Context Growth |
+|---------|-------|-------------|----------------|
+| 1569319 | 36 | 2.32M | 27k → 98k (+2.0k/turn) |
+| 1206593 | 36 | 2.31M | 15k → 106k (+2.6k/turn) |
+| 178176 | 25 | 1.93M | 36k → 95k (+2.5k/turn) |
+
+Top 5 sessions = 29% of all tokens. With session-sticky, these lock their instances, creating persistent load hotspots.
+
+### 8.3 Can Elastic Prefill Service Help?
+
+**Capability 1: Reduce prefill-decode interference (lower TPOT)**
+
+| Concurrency | TPOT p90 | Interference |
+|-------------|----------|--------------|
+| 8 sessions (1/GPU) | 0.073s | Minimal, TPOT uniform across instances |
+| 16 sessions (2/GPU) | 0.106s (+45%) | Significant, MEDIUM TPOT +163% |
+
+Verdict: **No benefit at ≤1 session/GPU** (current setup). Potential benefit at higher concurrency where chunked prefill interrupts decode steps.
+
+**Capability 2: Session migration for load balancing**
+
+Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to new instance for decode + future turns.
+
+Simulation of migration strategies (1000 req):
+
+| Strategy | Imbalance | Migrations | KV Transfer Overhead |
+|----------|-----------|------------|---------------------|
+| No migration | 1.24× | 0 | 0s |
+| Every 10 turns | 1.21× | 10 | 15s |
+| Every 5 turns | 1.20× | 20 | 30s |
+| Every 1 turn | 1.18× | 38 | 57s |
+
+Even aggressive every-turn migration only reduces imbalance from 1.24× to 1.18× — diminishing returns at ~1.5s overhead per migration event.
+
+**Capability 3 (unexpected): Soft affinity replaces hard stickiness**
+
+The corrected LMetric experiment (§3.4) reveals a key insight: **explicit session affinity is unnecessary**. LMetric's cache-hit estimation (`new_tokens = input − cached`) creates implicit soft affinity — instances with cached prefixes score lower, naturally attracting subsequent turns. This achieves the same APC and latency as explicit session-sticky routing, while providing better load balancing automatically.
+
+### 8.4 Elastic PS Verdict
+
+| Aspect | Benefit | Cost | Net |
+|--------|---------|------|-----|
+| TPOT reduction | 0% at 1/GPU | Mooncake overhead | **Negative** |
+| Session migration | 1.24× → 1.18× imbalance | 1.5s/migration KV transfer | **Marginal** |
+| Load balance | Already achieved by soft affinity (LMetric) | N/A | **Not needed** |
+
+Elastic PS is not justified for single-machine agentic workloads at moderate concurrency. The dominant optimization (cache-aware routing, -60% TTFT) works via soft affinity without any KV transfer. The remaining load imbalance (1.24×) is too small for migration to overcome KV transfer costs.
+
+**When elastic PS could become justified:**
+- Multi-machine deployment (no shared GPU memory competition)
+- Higher concurrency (>1 session/GPU sustained) where prefill-decode interference is measurable
+- Cheaper KV transfer (layerwise pipelining, not available in Mooncake 0.3.10)
+
+## 9. Conclusions & Next Steps

 ### Established findings:
 1. Full PD separation is **net negative** for single-machine agentic workloads (KV cache memory wall)
-2. Cache-aware session-sticky routing is the **dominant optimization** (+24pp APC, -60% TTFT vs round-robin)
-3. **Elastic P2P offload does NOT improve single-machine performance** — Mooncake kv_both memory overhead (+11% TPOT, +37% E2E) outweighs prefill isolation benefit under moderate load (200 req)
-4. LMetric (OSDI'26) provides modest **E2E -4%** over linear routing; routing policy headroom is limited
-5. **Experimental methodology matters**: warm vs fresh instances cause 2× TTFT difference; all comparisons must use verified fresh restart
+2. Cache-aware routing is the **dominant optimization** (+24pp APC, -60% TTFT vs round-robin)
+3. **Explicit session affinity is unnecessary** — cache-hit scoring in LMetric creates implicit soft affinity that matches hard session-sticky on all metrics (< 2% difference)
+4. **Elastic P2P offload does NOT improve single-machine performance** — Mooncake overhead outweighs both interference reduction and load-balancing benefits
+5. **GPU load imbalance** from session accumulation is moderate at scale (1.24× at 1000 req) and does not affect TPOT; only TTFT on heavy instances (1.7× gap)
+6. **Experimental methodology matters**: warm vs fresh instances cause 2× TTFT difference

 ### Lessons learned:
- Prior cross-machine A/B (commit `1e86285`) was invalid — warm baseline inflated by 2× due to residual KV cache state
+- Prior cross-machine A/B (commit `1e86285`) was invalid — warm baseline inflated by 2×
+- Prior LMetric implementation was invalid — incorrectly shared session-sticky logic with Linear
 - `kv_role=kv_both` has non-trivial always-on overhead even when P2P transfer is not used
 - Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility

 ### Open problems:
-1. Elastic P2P may help under **sustained high load** (KV cache pressure makes co-located interference worse) — needs 1000-req experiment
-2. Mooncake kv_both memory overhead quantification and potential lazy initialization
-3. Multi-machine elastic (P on different node, no memory competition)
-4. Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)
-5. `scripts/bench.sh` standardized harness to prevent future warm-instance mistakes
+1. **Higher concurrency regime**: At 2+ sessions/GPU, prefill-decode interference becomes significant (+45% TPOT). Does elastic PS help there?
+2. **Multi-machine elastic**: P on a different node eliminates GPU memory competition — the main cost that makes single-machine elastic net negative
+3. **Router state accuracy**: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)
+4. **Layerwise KV transfer**: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation

 ---

-*Updated 2026-05-22. Prior elastic A/B results (commit `1e86285`) invalidated — see §3.5 errata.*
+*Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance + elastic PS analysis added (§8).*