GPU imbalance analysis + elastic PS verdict + corrected LMetric results
Key findings: - Session-sticky imbalance is 8.6x at 200 req (small-sample artifact) but only 1.24x at 1000 req (moderate, TPOT unaffected) - Elastic PS not justified: interference reduction 0% at 1/GPU, migration reduces imbalance 1.24x→1.18x at 1.5s/event cost - Corrected LMetric (no affinity) matches Linear (sticky) on all metrics (<2%), proving soft affinity from cache-hit scoring works - Updated §3.4 errata, added §8 GPU imbalance + elastic PS analysis Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
107
REPORT.md
107
REPORT.md
@@ -201,14 +201,18 @@ Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV
|
||||
|
||||
### 3.4 Routing Policy: Linear vs LMetric (OSDI'26)
|
||||
|
||||
LMetric (`score = P_tokens × BS`) vs linear (`score = ongoing_tokens - α·cache_hit`). Both fresh-restart, same trace.
|
||||
LMetric (`score = P_tokens × BS`, pure per-request, no session affinity) vs Linear (`score = ongoing_tokens - α·cache_hit`, session-sticky). Both fresh-restart, same trace.
|
||||
|
||||
> **Errata (2026-05-23)**: Prior LMetric implementation incorrectly shared session-sticky logic with Linear. Fixed to pure per-request routing per OSDI'26 spec: `score = (pending_prefill + new_tokens) × num_requests`, no affinity, no overload override. Results below use corrected implementation.
|
||||
|
||||
| Policy | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 | Delta E2E |
|
||||
|--------|----------|----------|----------|---------|-----------|
|
||||
| Linear | 1.086s | 9.432s | 0.077s | 5.423s | — |
|
||||
| LMetric | 1.099s | 9.392s | 0.073s | 5.205s | **-4.0%** |
|
||||
| Linear (session-sticky) | 1.073s | 9.347s | 0.073s | 5.119s | — |
|
||||
| LMetric (no affinity) | 1.081s | 9.408s | 0.072s | 5.102s | **-0.3%** |
|
||||
|
||||
LMetric provides modest improvement through better load balancing. Routing policy headroom is limited for this workload.
|
||||
**Key finding**: LMetric without explicit session affinity matches Linear with session affinity on all metrics (< 2% difference). The cache-hit term in LMetric's scoring (`new_tokens = input - cache_hit`) creates **implicit soft affinity** — instances that already cached a session's prefix get lower P_tokens, naturally attracting subsequent turns. Explicit session-sticky routing is not required; cache-aware load balancing captures it automatically.
|
||||
|
||||
APC distribution (LMetric, no affinity): inst_0=60.6%, inst_1=58.3%, inst_2=43.2%, inst_3=28.9%, inst_4=16.6%, inst_5=24.0%, inst_6=13.9%, inst_7=0.0%. Non-uniform but comparable aggregate to Linear's explicit affinity.
|
||||
|
||||
### 3.5 Errata: Why Prior Cross-Machine A/B Was Invalid
|
||||
|
||||
@@ -355,27 +359,98 @@ agentic-kv/
|
||||
| `scripts/gpu_monitor.sh` | Sample nvidia-smi to CSV | Pipe to `outputs/<tag>/gpu_util.csv` |
|
||||
| `scripts/launch_elastic_p2p.sh` | Launch all 8 kv_both instances + offload proxy | `HEAVY_THRESHOLD`, `MAX_OFFLOAD` env vars |
|
||||
|
||||
## 8. Conclusions & Next Steps
|
||||
## 8. GPU Load Imbalance & Elastic Prefill Service Analysis
|
||||
|
||||
### 8.1 Load Imbalance Characterization
|
||||
|
||||
Session-sticky routing creates token load imbalance across instances. The severity depends on scale:
|
||||
|
||||
| Scale | Imbalance | Top 5 sessions | Cause |
|
||||
|-------|-----------|----------------|-------|
|
||||
| 200 req (143 sessions) | **8.6×** tokens | 49% of all tokens | Small sample, few dominant sessions |
|
||||
| 1000 req (668 sessions) | **1.24×** tokens | 29% of all tokens | More sessions → natural averaging |
|
||||
|
||||
At 1000 requests, the heaviest instance has 4.5M tokens vs lightest 3.6M (1.24×). Despite this, TPOT is uniform across all instances (0.070–0.077), confirming that prefill-decode interference is minimal at ≤1 session/GPU. The imbalance manifests in **TTFT only**: heaviest 2 instances TTFT p50 = 1.42s vs lightest 2 at 0.83s (1.7× gap).
|
||||
|
||||
### 8.2 Session Accumulation Pattern
|
||||
|
||||
Agentic workloads produce long-lived sessions with growing context:
|
||||
|
||||
| Session | Turns | Total Tokens | Context Growth |
|
||||
|---------|-------|-------------|----------------|
|
||||
| 1569319 | 36 | 2.32M | 27k → 98k (+2.0k/turn) |
|
||||
| 1206593 | 36 | 2.31M | 15k → 106k (+2.6k/turn) |
|
||||
| 178176 | 25 | 1.93M | 36k → 95k (+2.5k/turn) |
|
||||
|
||||
Top 5 sessions = 29% of all tokens. With session-sticky, these lock their instances, creating persistent load hotspots.
|
||||
|
||||
### 8.3 Can Elastic Prefill Service Help?
|
||||
|
||||
**Capability 1: Reduce prefill-decode interference (lower TPOT)**
|
||||
|
||||
| Concurrency | TPOT p90 | Interference |
|
||||
|-------------|----------|--------------|
|
||||
| 8 sessions (1/GPU) | 0.073s | Minimal, TPOT uniform across instances |
|
||||
| 16 sessions (2/GPU) | 0.106s (+45%) | Significant, MEDIUM TPOT +163% |
|
||||
|
||||
Verdict: **No benefit at ≤1 session/GPU** (current setup). Potential benefit at higher concurrency where chunked prefill interrupts decode steps.
|
||||
|
||||
**Capability 2: Session migration for load balancing**
|
||||
|
||||
Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to new instance for decode + future turns.
|
||||
|
||||
Simulation of migration strategies (1000 req):
|
||||
|
||||
| Strategy | Imbalance | Migrations | KV Transfer Overhead |
|
||||
|----------|-----------|------------|---------------------|
|
||||
| No migration | 1.24× | 0 | 0s |
|
||||
| Every 10 turns | 1.21× | 10 | 15s |
|
||||
| Every 5 turns | 1.20× | 20 | 30s |
|
||||
| Every 1 turn | 1.18× | 38 | 57s |
|
||||
|
||||
Even aggressive every-turn migration only reduces imbalance from 1.24× to 1.18× — diminishing returns at ~1.5s overhead per migration event.
|
||||
|
||||
**Capability 3 (unexpected): Soft affinity replaces hard stickiness**
|
||||
|
||||
The corrected LMetric experiment (§3.4) reveals a key insight: **explicit session affinity is unnecessary**. LMetric's cache-hit estimation (`new_tokens = input − cached`) creates implicit soft affinity — instances with cached prefixes score lower, naturally attracting subsequent turns. This achieves the same APC and latency as explicit session-sticky routing, while providing better load balancing automatically.
|
||||
|
||||
### 8.4 Elastic PS Verdict
|
||||
|
||||
| Aspect | Benefit | Cost | Net |
|
||||
|--------|---------|------|-----|
|
||||
| TPOT reduction | 0% at 1/GPU | Mooncake overhead | **Negative** |
|
||||
| Session migration | 1.24× → 1.18× imbalance | 1.5s/migration KV transfer | **Marginal** |
|
||||
| Load balance | Already achieved by soft affinity (LMetric) | N/A | **Not needed** |
|
||||
|
||||
Elastic PS is not justified for single-machine agentic workloads at moderate concurrency. The dominant optimization (cache-aware routing, -60% TTFT) works via soft affinity without any KV transfer. The remaining load imbalance (1.24×) is too small for migration to overcome KV transfer costs.
|
||||
|
||||
**When elastic PS could become justified:**
|
||||
- Multi-machine deployment (no shared GPU memory competition)
|
||||
- Higher concurrency (>1 session/GPU sustained) where prefill-decode interference is measurable
|
||||
- Cheaper KV transfer (layerwise pipelining, not available in Mooncake 0.3.10)
|
||||
|
||||
## 9. Conclusions & Next Steps
|
||||
|
||||
### Established findings:
|
||||
1. Full PD separation is **net negative** for single-machine agentic workloads (KV cache memory wall)
|
||||
2. Cache-aware session-sticky routing is the **dominant optimization** (+24pp APC, -60% TTFT vs round-robin)
|
||||
3. **Elastic P2P offload does NOT improve single-machine performance** — Mooncake kv_both memory overhead (+11% TPOT, +37% E2E) outweighs prefill isolation benefit under moderate load (200 req)
|
||||
4. LMetric (OSDI'26) provides modest **E2E -4%** over linear routing; routing policy headroom is limited
|
||||
5. **Experimental methodology matters**: warm vs fresh instances cause 2× TTFT difference; all comparisons must use verified fresh restart
|
||||
2. Cache-aware routing is the **dominant optimization** (+24pp APC, -60% TTFT vs round-robin)
|
||||
3. **Explicit session affinity is unnecessary** — cache-hit scoring in LMetric creates implicit soft affinity that matches hard session-sticky on all metrics (< 2% difference)
|
||||
4. **Elastic P2P offload does NOT improve single-machine performance** — Mooncake overhead outweighs both interference reduction and load-balancing benefits
|
||||
5. **GPU load imbalance** from session accumulation is moderate at scale (1.24× at 1000 req) and does not affect TPOT; only TTFT on heavy instances (1.7× gap)
|
||||
6. **Experimental methodology matters**: warm vs fresh instances cause 2× TTFT difference
|
||||
|
||||
### Lessons learned:
|
||||
- Prior cross-machine A/B (commit `1e86285`) was invalid — warm baseline inflated by 2× due to residual KV cache state
|
||||
- Prior cross-machine A/B (commit `1e86285`) was invalid — warm baseline inflated by 2×
|
||||
- Prior LMetric implementation was invalid — incorrectly shared session-sticky logic with Linear
|
||||
- `kv_role=kv_both` has non-trivial always-on overhead even when P2P transfer is not used
|
||||
- Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility
|
||||
|
||||
### Open problems:
|
||||
1. Elastic P2P may help under **sustained high load** (KV cache pressure makes co-located interference worse) — needs 1000-req experiment
|
||||
2. Mooncake kv_both memory overhead quantification and potential lazy initialization
|
||||
3. Multi-machine elastic (P on different node, no memory competition)
|
||||
4. Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)
|
||||
5. `scripts/bench.sh` standardized harness to prevent future warm-instance mistakes
|
||||
1. **Higher concurrency regime**: At 2+ sessions/GPU, prefill-decode interference becomes significant (+45% TPOT). Does elastic PS help there?
|
||||
2. **Multi-machine elastic**: P on a different node eliminates GPU memory competition — the main cost that makes single-machine elastic net negative
|
||||
3. **Router state accuracy**: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)
|
||||
4. **Layerwise KV transfer**: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation
|
||||
|
||||
---
|
||||
|
||||
*Updated 2026-05-22. Prior elastic A/B results (commit `1e86285`) invalidated — see §3.5 errata.*
|
||||
*Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance + elastic PS analysis added (§8).*
|
||||
|
||||
Reference in New Issue
Block a user