Files

Gahow Wang c8ba666517 Benchmark concurrency gap: 1 req/GPU is 10-15x below production

Our --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode
interference that appears at 2/GPU (+38% TPOT) and would dominate at
production load (~15/GPU). Updated §8 to re-evaluate elastic PS at
production concurrency. Next step: --max-inflight-sessions 64 benchmark.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-23 12:16:20 +08:00

25 KiB

Raw Blame History

Milestone Report: Elastic P2P vs PD-Combined Baseline

Date: 2026-05-22 Author: Gahow Wang Status: Phase 1 complete — baseline + elastic validated, system-level analysis done

1. Research Question

For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can selective disaggregation of only heavy requests improve serving latency while preserving KV cache locality?

2. Experimental Setup

2.1 Hardware

Resource	Spec
Machine	dash0 / dash1 (identical config)
GPU	8× NVIDIA H20 96GB HBM, NVLink
Network	4× ConnectX-7 200Gbps RDMA
Storage	cpfs shared storage across machines

2.2 Software

Component	Version	Notes
vLLM	0.18.1 (source in `third_party/vllm/`)	Patched scheduler assert (see `patches/`)
Mooncake	0.3.10	RDMA-based KV transfer between instances
Python	3.x managed by `uv`	`.venv/` at project root
Model	`Qwen3-Coder-30B-A3B-Instruct`	MoE 128 experts top-8, 3B active params
Model path	`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`	Same on dash0 and dash1

2.3 Workload Trace

Property	Value
Source	GLM-5.1 Agentic Coder, production cluster, 2h window
Raw trace	`~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` on dash0
Total requests	2,114,220
Avg input tokens	33,600 (p50=20k, p90=88k)
Avg output tokens	445 (p50=80)
I/O ratio	75.6× aggregate
Prefill token share	98%
KV reuse (intra-session)	91% of reusable blocks
Theoretical max APC	71% (infinite cache, single instance)

Sampled trace for benchmarks: traces/sampled_1000req_seed42.jsonl (1000 requests, seed=42, preserving session structure). For 200-request ablations: replayer --request-limit 200.

2.4 Two Configurations Compared

Baseline: PD-Combined (8× TP=1 DP=8)

8 independent vLLM instances, 1 GPU each, no Mooncake.
All instances do both prefill and decode.
Global scheduler (cache_aware_proxy.py --combined) handles:
  - Session-sticky routing (multi-turn → same instance)
  - Load-aware override (if pinned instance > 2× avg load, redirect)
  - Cache-hit scoring (prefer instance with matching prefix blocks)

Launch:

# On dash0:
for i in $(seq 0 7); do
    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
    vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
        --port $((8000+i)) --tp 1 \
        --enable-prefix-caching --enforce-eager \
        --gpu-memory-utilization 0.9 --max-model-len 200000 \
        > /tmp/ab_base_$i.log 2>&1 &
done

python scripts/cache_aware_proxy.py \
    --combined http://127.0.0.1:800{0..7} --port 9090

Elastic P2P Offload (8× TP=1 kv_both + selective offload)

8 independent vLLM instances, 1 GPU each, all kv_role=kv_both (Mooncake).
Same global scheduler, plus elastic offload logic:
  - Proxy classifies each request: WARM (<5k new), MEDIUM (5-20k), HEAVY (>20k)
  - WARM/MEDIUM: co-located on session-sticky instance (no KV transfer)
  - HEAVY: prefill on a different instance (P), KV via Mooncake RDMA,
    decode on session-sticky instance (D)
  - Cap: max 4 concurrent offloads (MAX_OFFLOAD_INFLIGHT)
  - P instance selection: round-robin with overload skip

Launch:

# On dash1 (or use scripts/launch_elastic_p2p.sh):
for i in $(seq 0 7); do
    VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998+i)) \
    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
    vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
        --port $((8000+i)) --tp 1 \
        --enable-prefix-caching --enforce-eager \
        --gpu-memory-utilization 0.9 --max-model-len 200000 \
        --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
        > /tmp/ab_elastic_$i.log 2>&1 &
    sleep 2  # stagger to avoid NCCL port collision
done

# Wait for bootstrap servers
for bp in $(seq 8998 9005); do
    until curl -s localhost:$bp/query > /dev/null 2>&1; do sleep 2; done
done

python scripts/cache_aware_proxy.py \
    --combined http://127.0.0.1:800{0..7} \
    --bootstrap-ports 8998,8999,9000,9001,9002,9003,9004,9005 \
    --offload --heavy-threshold 20000 --port 9090

2.5 Benchmark Parameters

Parameter	Value
Requests	200 (from sampled 1000-req trace, `--request-limit 200`)
Time scale	20× (compress 2h trace into ~6min)
Max inflight sessions	8
Request timeout	600s
vLLM flags	`--enforce-eager --enable-prefix-caching --max-model-len 200000`
GPU memory util	0.9
Fresh restart	Both configs started from cold (no warm cache)

2.6 Reproducing the Benchmark

# Activate environment
cd ~/agentic-kv && source .venv/bin/activate

# Ensure sampled trace exists
python scripts/sample_trace.py \
    --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
    --output traces/sampled_1000req_seed42.jsonl \
    --target-requests 1000 --seed 42

# Start GPU monitoring (in a separate terminal)
bash scripts/gpu_monitor.sh > outputs/<tag>/gpu_util.csv &

# Run replayer against proxy
python -m replayer \
    --trace traces/sampled_1000req_seed42.jsonl \
    --output outputs/<tag>/metrics.jsonl \
    --endpoint http://localhost:9090 \
    --time-scale 20 --max-inflight-sessions 8 \
    --request-limit 200 -v

# Collect proxy breakdown (elastic only)
curl -s http://localhost:9090/breakdown > outputs/<tag>/breakdown.json

# Collect APC from vLLM logs
for i in $(seq 0 7); do
    grep "Prefix cache hit rate\|External prefix cache hit rate" /tmp/<prefix>_$i.log | tail -2
done

3. Results

Errata (2026-05-22): The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were not freshly restarted — residual KV cache from prior experiments caused 2× TTFT inflation. All results below use verified fresh-restart experiments on the same machine.

3.1 Fair Comparison (all fresh-restart, same machine dash0, 200 req)

Config	OK/N	TTFT p50	TTFT p90	TPOT p90	E2E p50
Baseline (no Mooncake)	198/200	1.075s	9.384s	0.076s	5.075s
LMetric routing	198/200	1.099s	9.392s	0.073s	5.205s
Elastic P2P (kv_both)	195/200	1.018s	11.312s	0.085s	6.977s

3.2 Per-Class Breakdown

Baseline (fresh):

Class	Count	%	TTFT p50	TTFT p90	TPOT p90
WARM (<5k)	46	23%	0.137s	0.262s	0.061s
MEDIUM (5-20k)	50	25%	0.921s	1.846s	0.079s
HEAVY (20-50k)	64	32%	2.660s	6.278s	0.076s
HEAVY (>50k)	38	19%	9.587s	30.415s	0.102s

Elastic P2P (fresh):

Class	Count	%	TTFT p50	TTFT p90	TPOT p90
WARM (<5k)	46	23%	0.142s	0.279s	0.072s
MEDIUM (5-20k)	50	25%	0.766s	1.814s	0.197s
HEAVY (>20k)	99	51%	6.390s	22.668s	0.085s

3.3 Success Rate

Config	OK	Total	Rate	Failure mode
Baseline	198	200	99.0%	RemoteProtocolError (replayer-side)
Elastic P2P	195	200	97.5%	2× RemoteProtocolError + 3× ReadTimeout on >60k

Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV pushed to Mooncake, but D never produced first token (decode scheduler couldn't allocate KV cache space). Prefill timeout fallback (120s → co-located) was never triggered.

3.4 Routing Policy: Linear vs LMetric (OSDI'26)

LMetric (score = P_tokens × BS, pure per-request, no session affinity) vs Linear (score = ongoing_tokens - α·cache_hit, session-sticky). Both fresh-restart, same trace.

Errata (2026-05-23): Prior LMetric implementation incorrectly shared session-sticky logic with Linear. Fixed to pure per-request routing per OSDI'26 spec: score = (pending_prefill + new_tokens) × num_requests, no affinity, no overload override. Results below use corrected implementation.

Policy	TTFT p50	TTFT p90	TPOT p90	E2E p50	Delta E2E
Linear (session-sticky)	1.073s	9.347s	0.073s	5.119s	—
LMetric (no affinity)	1.081s	9.408s	0.072s	5.102s	-0.3%

Key finding: LMetric without explicit session affinity matches Linear with session affinity on all metrics (< 2% difference). The cache-hit term in LMetric's scoring (new_tokens = input - cache_hit) creates implicit soft affinity — instances that already cached a session's prefix get lower P_tokens, naturally attracting subsequent turns. Explicit session-sticky routing is not required; cache-aware load balancing captures it automatically.

APC distribution (LMetric, no affinity): inst_0=60.6%, inst_1=58.3%, inst_2=43.2%, inst_3=28.9%, inst_4=16.6%, inst_5=24.0%, inst_6=13.9%, inst_7=0.0%. Non-uniform but comparable aggregate to Linear's explicit affinity.

3.5 Errata: Why Prior Cross-Machine A/B Was Invalid

The initial comparison (commit 1e86285) reported:

Baseline (dash0): TTFT50=2.383  E2E50=10.232  ← WRONG (warm instances)
Elastic  (dash1): TTFT50=1.315  E2E50=5.708
Delta:                   -45%          -44%    ← INVALID

Evidence that prior baseline was not fresh:

inst_7 APC = 68.3% — impossible from 25 cold-start requests (max ~25%)
WARM TTFT p90 = 3.327s (fresh = 0.262s, 12.7× gap) — indicates KV cache memory pressure from prior experiments
HEAVY TPOT p90 = 0.154s (fresh = 0.076s, 2.0× gap) — heavy prefill-decode interference from full KV cache

The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.

4. System-Level Analysis

4.1 Elastic P2P Does Not Improve Single-Machine Performance

Under fair comparison (same machine, both fresh):

Metric	Baseline	Elastic	Delta
TTFT p50	1.075s	1.018s	-5.3%
TTFT p90	9.384s	11.312s	+20.5%
TPOT p90	0.076s	0.085s	+11.6%
E2E p50	5.075s	6.977s	+37.5%

Elastic is worse on all metrics except TTFT p50. Root causes:

1. Mooncake kv_both memory overhead

Each instance with kv_role=kv_both maintains RDMA buffers + Mooncake bootstrap server, reducing GPU memory available for KV cache. This affects ALL requests (including WARM/MEDIUM that don't use P2P transfer), causing more cache eviction and higher TPOT.

Evidence: MEDIUM TPOT p90 = 0.197s (elastic) vs 0.079s (baseline) — 2.5× worse despite MEDIUM requests not using P2P at all.

2. D-side KV pull failures

3 HEAVY requests completed prefill on P instance successfully but D-side never produced first token. The KV cache on D was too full to allocate space for the transferred blocks. These became 600s timeouts.

3. P2P overhead without proportional benefit

The P2P path adds: prefill queue on P (p50=6.3s) + KV transfer + decode start on D (p50=0.8s). For requests where the D instance isn't under heavy prefill load (which is the case on fresh instances), co-located execution is faster.

4.2 When Elastic P2P Could Help

Elastic P2P is designed for the scenario where D-instance decode is disrupted by co-located heavy prefill. On fresh instances with 200 requests, this contention is moderate. The benefit may emerge under:

Higher sustained load (1000+ concurrent requests)
Longer experiment duration (KV cache fills up, eviction pressure increases)
Multi-machine deployment (P on a different node, no memory competition)

5. Data & Log Locations

5.1 Experiment Outputs (on respective machines)

Directory	Machine	Config	Notes
`outputs/ab_baseline/`	dash0	Combined 8× TP=1	~~Initial A/B~~ (INVALIDATED: warm instances)
`outputs/ab_elastic/`	dash0	Elastic P2P cap=4	~~Initial A/B~~ (INVALIDATED)
`outputs/baseline_stability_fresh/`	dash0	Combined 8× fresh	Canonical baseline (§3.1)
`outputs/elastic_stability_*/`	dash0	Elastic P2P kv_both fresh	Canonical elastic (§3.1)
`outputs/ab_linear/`	dash0	Linear policy, 200 req	§3.4 routing policy comparison
`outputs/ab_lmetric/`	dash0	LMetric policy, 200 req	§3.4 routing policy comparison
`outputs/gpu_ab_combined/`	local	Combined 8× TP=1	Earlier run, has gpu_util.csv
`outputs/gpu_ab_pdsep/`	local	PD-Sep 4P+4D	Earlier run, has gpu_util.csv
`outputs/exp2_combined_tp1_dp8/`	local	Combined 8× TP=1	1000 req, cache-aware
`outputs/exp3_pd_sep_tp1_mooncake/`	local	PD-Sep 4P+4D Mooncake	1000 req

5.2 vLLM Instance Logs

Path pattern	Machine	Config
`/tmp/ab_base_$i.log`	dash0	Baseline instances 0-7
`/tmp/ab_elastic_$i.log`	dash1	Elastic instances 0-7
`/tmp/lmetric_ab_inst_$i.log`	dash0	Linear policy instances 0-7 (§3.6)
`/tmp/lmetric_inst_$i.log`	dash0	LMetric policy instances 0-7 (§3.6)

Logs contain Prefix cache hit rate and External prefix cache hit rate lines for APC extraction.

5.3 Trace Data

Path	Machine	Description
`~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl`	dash0	Full 2h production trace (2.1M requests)
`traces/sampled_1000req_seed42.jsonl`	all	Sampled 1000 requests (gitignored, regenerate with `sample_trace.py`)

5.4 Analysis Documents

File	Content
`analysis/pd_separation_analysis.md`	Main report: PD-Sep vs Combined + Elastic P2P (§5)
`analysis/elastic_offload_design.md`	Elastic P2P design rationale
`analysis/kv_lifecycle_design.md`	KV cache eviction policy analysis
`analysis/adaptive_prefill_offload_design.md`	Initial adaptive offload design (superseded by elastic)

6. Repository Structure

agentic-kv/
├── analysis/                    # Research reports and design docs
│   ├── pd_separation_analysis.md    # Main comprehensive report
│   ├── elastic_offload_design.md    # Elastic P2P design
│   ├── kv_lifecycle_design.md       # Cache eviction analysis
│   └── ...
├── replayer/                    # Trace replay framework
│   ├── __main__.py              # CLI entry: python -m replayer
│   ├── replay.py                # Async replayer (session-aware, SSE streaming)
│   ├── trace.py                 # TraceRequest dataclass, session/hash_id handling
│   └── metrics.py               # RequestMetrics, crash-safe JSONL sink
├── scripts/
│   ├── cache_aware_proxy.py     # Global scheduler (combined + PD-sep + elastic offload)
│   ├── sample_trace.py          # Cluster-to-machine trace sampler
│   ├── launch_vllm.sh           # Launch combined TP=8
│   ├── launch_pd_mooncake.sh    # Launch PD-Sep with Mooncake
│   ├── launch_elastic_p2p.sh    # Launch elastic P2P (8× kv_both + offload proxy)
│   ├── run_experiments.sh       # Full experiment matrix (combined/PD-sep)
│   ├── run_benchmark.sh         # Single benchmark run
│   ├── gpu_monitor.sh           # GPU utilization sampler (5s CSV)
│   ├── compute_roofline.py      # Prefill/decode roofline analysis
│   ├── analyze_*.py             # Various analysis scripts
│   └── compare_*.py             # Experiment comparison scripts
├── patches/
│   ├── 0001-fix-kv-transfer-abort-race.patch
│   └── README.md
├── third_party/vllm/            # vLLM 0.18.1 source (with patch applied)
├── outputs/                     # Experiment results (gitignored)
├── traces/                      # Sampled traces (gitignored)
├── TODO.md                      # Original research goals
└── REPORT.md                    # This milestone report

7. Key Scripts Reference

Script	What it does	Key flags
`scripts/cache_aware_proxy.py`	Global scheduler + elastic offload proxy	`--combined`, `--offload`, `--policy {linear,lmetric}`, `--heavy-threshold`, `--bootstrap-ports`
`scripts/run_lmetric_ab.sh`	A/B: linear vs lmetric routing policy	Runs both experiments with fresh restart
`scripts/run_elastic_stability_test.sh`	Elastic vs baseline with full isolation	Fresh start/stop per experiment
`scripts/bench.sh`	Standard single-experiment harness	`--tag`, `--mode {baseline,elastic}`
`scripts/sample_trace.py`	Sample complete sessions from cluster trace	`--target-requests`, `--seed`
`python -m replayer`	Replay trace against vLLM endpoint	`--time-scale`, `--max-inflight-sessions`, `--request-limit`
`scripts/gpu_monitor.sh`	Sample nvidia-smi to CSV	Pipe to `outputs/<tag>/gpu_util.csv`
`scripts/launch_elastic_p2p.sh`	Launch all 8 kv_both instances + offload proxy	`HEAVY_THRESHOLD`, `MAX_OFFLOAD` env vars

8. GPU Load Imbalance & Elastic Prefill Service Analysis

8.1 Load Imbalance Characterization

Session-sticky routing creates token load imbalance across instances. The severity depends on scale:

Scale	Imbalance	Top 5 sessions	Cause
200 req (143 sessions)	8.6× tokens	49% of all tokens	Small sample, few dominant sessions
1000 req (668 sessions)	1.24× tokens	29% of all tokens	More sessions → natural averaging

At 1000 requests, the heaviest instance has 4.5M tokens vs lightest 3.6M (1.24×). Despite this, TPOT is uniform across all instances (0.070–0.077), confirming that prefill-decode interference is minimal at ≤1 session/GPU. The imbalance manifests in TTFT only: heaviest 2 instances TTFT p50 = 1.42s vs lightest 2 at 0.83s (1.7× gap).

8.2 Session Accumulation Pattern

Agentic workloads produce long-lived sessions with growing context:

Session	Turns	Total Tokens	Context Growth
1569319	36	2.32M	27k → 98k (+2.0k/turn)
1206593	36	2.31M	15k → 106k (+2.6k/turn)
178176	25	1.93M	36k → 95k (+2.5k/turn)

Top 5 sessions = 29% of all tokens. With session-sticky, these lock their instances, creating persistent load hotspots.

8.3 Benchmark Concurrency vs Production Reality

Critical caveat: All prior experiments used --max-inflight-sessions 8 (1 session/GPU). This is 10–15× below production concurrency and masks the interference that elastic PS is designed to solve.

	Our Benchmark	Production Estimate
Concurrent requests/GPU	1–2	8–15
KV cache usage/GPU	26–28% (single req)	80–100%
Prefill-decode interference	Minimal	Significant

KV cache capacity: 281,888 tokens/GPU (25.8 GiB). A single 75k-token request consumes 27% of KV cache. At production concurrency (~15 req/GPU), KV cache is near-full, triggering eviction, cache misses, and prefill queuing — none of which appear in our 1-req/GPU benchmark.

Measured interference scaling:

Concurrency	TPOT p90	vs 8-session
8 sessions (1/GPU)	0.075s	baseline
16 sessions (2/GPU)	0.103s	+38%
Production (~15/GPU)	not tested	expected >>+45%

8.4 Elastic PS: Two Capabilities Re-Evaluated

Capability 1: Reduce prefill-decode interference (lower TPOT)

At 1 req/GPU (our benchmark): no interference, no benefit. But this is an artifact of unrealistically low concurrency. At ≥2 req/GPU, chunked prefill interrupts decode steps, causing TPOT +38–45%. At production concurrency (~15/GPU), multiple HEAVY prefills sharing a GPU with decode requests would cause severe interference. Elastic PS's ability to isolate heavy prefills on separate GPUs directly addresses this.

Capability 2: Session migration for load balancing

Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to a different instance for decode + future turns. This provides two benefits:

Break session lock-in: Agentic sessions grow +2k tokens/turn over 30+ turns. With session-sticky, a 36-turn session (2.3M tokens total) locks its GPU, creating a hotspot. Elastic PS lets the session migrate to a less-loaded GPU while preserving cache on the original (PS does fast cached prefill, new GPU decodes).
Rebalance without cache loss: Unlike breaking affinity (which destroys cache), elastic PS migration preserves the prefix cache on the original instance — the PS re-uses it for fast prefill, then transfers only the new KV to the destination.

Simulation of migration strategies (1000 req, at current low concurrency):

Strategy	Imbalance	Migrations	KV Transfer Overhead
No migration	1.24×	0	0s
Every 10 turns	1.21×	10	15s
Every 5 turns	1.20×	20	30s

At 1 req/GPU, migration benefit is marginal (imbalance is only 1.24×). At production concurrency where imbalance combines with KV cache pressure and interference, the benefit would be substantially larger.

Capability 3: Soft affinity from cache-hit scoring

The corrected LMetric experiment (§3.4) reveals that explicit session affinity is unnecessary. Cache-hit scoring (new_tokens = input − cached) creates implicit soft affinity — instances with cached prefixes score lower, naturally attracting subsequent turns. This matches hard session-sticky on all metrics (< 2% difference) while providing more flexible load balancing.

8.5 Elastic PS Verdict

Aspect	At 1 req/GPU (tested)	At production load (expected)
TPOT reduction	0% (no interference)	Significant (interference scales with concurrency)
Session migration	Marginal (1.24× → 1.20×)	Larger benefit (KV pressure + interference amplify imbalance)
Cache preservation	N/A	Key advantage vs breaking affinity

At our benchmark concurrency (1 req/GPU), elastic PS is not justified — Mooncake overhead exceeds the non-existent interference benefit. But our benchmark is 10–15× below production load. The real question is whether elastic PS helps at production-realistic concurrency (64–128 concurrent sessions, 8–15 req/GPU), where:

Prefill-decode interference is measurable (already +38% TPOT at just 2/GPU)
KV cache pressure causes eviction and queue delays
Session accumulation creates compounding hotspots
Heavy prefills (50–100k tokens) block decode for seconds

Next step: run --max-inflight-sessions 64 benchmark to test elastic PS at production-realistic concurrency.

9. Conclusions & Next Steps

Established findings:

Full PD separation is net negative for single-machine agentic workloads (KV cache memory wall)
Cache-aware routing is the dominant optimization (+24pp APC, -60% TTFT vs round-robin)
Explicit session affinity is unnecessary — cache-hit scoring creates implicit soft affinity that matches hard session-sticky (< 2% difference)
At low concurrency (1 req/GPU), elastic P2P offload adds overhead without benefit
Our benchmark concurrency is 10–15× below production: --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode interference that appears at ≥2 req/GPU (+38% TPOT) and would dominate at production load (~15 req/GPU)
Experimental methodology matters: warm vs fresh instances cause 2× TTFT difference

Lessons learned:

Prior cross-machine A/B (commit 1e86285) was invalid — warm baseline inflated by 2×
Prior LMetric implementation was invalid — incorrectly shared session-sticky logic with Linear
kv_role=kv_both has non-trivial always-on overhead even when P2P transfer is not used
Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility
Benchmark concurrency must match production — sub-production concurrency hides interference effects that drive system design decisions

Open problems (priority ordered):

Production-concurrency benchmark (--max-inflight-sessions 64+): Validate whether prefill-decode interference at 8–15 req/GPU makes elastic PS net-positive
Multi-machine elastic: P on a different node eliminates GPU memory competition — the main cost that makes single-machine elastic net negative
Layerwise KV transfer: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation
Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)

Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance analysis added (§8). Benchmark concurrency gap identified — production-load experiments needed.

25 KiB Raw Blame History Unescape Escape