Files
agentic-kvc/REPORT.md
Gahow Wang 1cd0a18e2c Report §3.8: Document direct KV cache migration architecture + bugs fixed
Complete documentation of bootstrap-triggered PUSH implementation:
hash table sync, token-based lookup, RDMA WRITE path, cost model,
PYTHONHASHSEED requirement, and all 6 bugs fixed during development.

Verified: 640/640 blocks pushed, External APC 80%, TTFT 0.367s
(vs local cache 0.338s, +0.03s overhead).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:52:38 +08:00

33 KiB
Raw Blame History

Milestone Report: Elastic P2P vs PD-Combined Baseline

Date: 2026-05-22 Author: Gahow Wang Status: Phase 1 complete — baseline + elastic validated, system-level analysis done


1. Research Question

For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can selective disaggregation of only heavy requests improve serving latency while preserving KV cache locality?

1.1 Errata / Superseded sections

This report has been revised several times as the methodology matured. The sections below are kept for historical context but their numerical conclusions have been superseded — do not cite them in isolation.

  • §3.1 (initial PD-sep vs PD-combined): ran with the old random sampler + --time-scale compression + --max-inflight-sessions 8. Cross-session KV reuse dropped from 52% → 16%, and per-GPU concurrency was capped at 1 req/GPU. Superseded by §3.6.
  • Earlier "elastic v3" warm-vs-fresh runs: baselines were not restarted between trials, leaving residual KV cache that inflated baseline TTFT ~2×. Superseded by the cold-start results in §3.6/§3.7.
  • Any reference to running --max-inflight-sessions 64+: that flag was removed when replay moved to trace-driven dispatch. The next-step experiment requires restoring the flag first (see FIXES.md §B2 route A) before any production-concurrency numbers can be produced.

The authoritative results are in §3.6 and §3.7.

2. Experimental Setup

2.1 Hardware

Resource Spec
Machine dash0 / dash1 (identical config)
GPU 8× NVIDIA H20 96GB HBM, NVLink
Network 4× ConnectX-7 200Gbps RDMA
Storage cpfs shared storage across machines

2.2 Software

Component Version Notes
vLLM 0.18.1 (source in third_party/vllm/) Patched scheduler assert (see patches/)
Mooncake 0.3.10 RDMA-based KV transfer between instances
Python 3.x managed by uv .venv/ at project root
Model Qwen3-Coder-30B-A3B-Instruct MoE 128 experts top-8, 3B active params
Model path ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct Same on dash0 and dash1

2.3 Workload Trace

Property Value
Source GLM-5.1 Agentic Coder, production cluster, 2h window
Raw trace ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl on dash0
Total requests 2,114,220
Avg input tokens 33,600 (p50=20k, p90=88k)
Avg output tokens 445 (p50=80)
I/O ratio 75.6× aggregate
Prefill token share 98%
KV reuse breakdown 62% intra-session, 38% cross-session (token-level)
Theoretical max APC 67% (infinite cache, single instance, prefix-only)

Sampled trace for benchmarks: traces/w600_r0.0015_st30.jsonl (1214 requests, 688 sessions, 70% multi-turn). Generated with window+thin sampling:

python scripts/sample_trace.py \
    --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
    --output traces/w600_r0.0015_st30.jsonl \
    --sample-ratio 0.0015 --max-single-turn-ratio 0.3 \
    --window-seconds 600 --seed 42
Trace property Value
Sessions 688 (70% multi-turn, avg 4.9 turns)
Requests 1214 (use --request-limit 850 for daily, full for validation)
Avg input tokens 48,776
Trace span 2912s (48.5 min); dense segment 0-990s (850 req)
Peak QPS 1.6 req/s (in dense segment)
Hash block sharing 48.3% (vs 52% full trace)
Theoretical APC 80% (full), 76% (first 850 req)

Sampling methodology (2026-05-23): Prior traces used random session sampling + --time-scale compression + --max-inflight-sessions semaphore, which (a) destroyed cross-session hash block sharing (52% → 16%), (b) artificially limited concurrency to 1 req/GPU, and (c) masked prefill-decode interference. The new approach uses contiguous time-window sampling with session thinning (--max-single-turn-ratio 0.3) to preserve KV reuse patterns, and trace-driven replay with no artificial concurrency limits.

2.4 Two Configurations Compared

Baseline: PD-Combined (8× TP=1 DP=8)

8 independent vLLM instances, 1 GPU each, no Mooncake.
All instances do both prefill and decode.
Global scheduler (cache_aware_proxy.py --combined) handles:
  - Session-sticky routing (multi-turn → same instance)
  - Load-aware override (if pinned instance > 2× avg load, redirect)
  - Cache-hit scoring (prefer instance with matching prefix blocks)

Launch:

# On dash0:
for i in $(seq 0 7); do
    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
    vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
        --port $((8000+i)) --tp 1 \
        --enable-prefix-caching --enforce-eager \
        --gpu-memory-utilization 0.9 --max-model-len 200000 \
        > /tmp/ab_base_$i.log 2>&1 &
done

python scripts/cache_aware_proxy.py \
    --combined http://127.0.0.1:800{0..7} --port 9090

Elastic P2P Offload (8× TP=1 kv_both + selective offload)

8 independent vLLM instances, 1 GPU each, all kv_role=kv_both (Mooncake).
Same global scheduler, plus elastic offload logic:
  - Proxy classifies each request: WARM (<5k new), MEDIUM (5-20k), HEAVY (>20k)
  - WARM/MEDIUM: co-located on session-sticky instance (no KV transfer)
  - HEAVY: prefill on a different instance (P), KV via Mooncake RDMA,
    decode on session-sticky instance (D)
  - Cap: max 4 concurrent offloads (MAX_OFFLOAD_INFLIGHT)
  - P instance selection: round-robin with overload skip

Launch:

# On dash1 (or use scripts/launch_elastic_p2p.sh):
for i in $(seq 0 7); do
    VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998+i)) \
    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
    vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
        --port $((8000+i)) --tp 1 \
        --enable-prefix-caching --enforce-eager \
        --gpu-memory-utilization 0.9 --max-model-len 200000 \
        --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
        > /tmp/ab_elastic_$i.log 2>&1 &
    sleep 2  # stagger to avoid NCCL port collision
done

# Wait for bootstrap servers
for bp in $(seq 8998 9005); do
    until curl -s localhost:$bp/query > /dev/null 2>&1; do sleep 2; done
done

python scripts/cache_aware_proxy.py \
    --combined http://127.0.0.1:800{0..7} \
    --bootstrap-ports 8998,8999,9000,9001,9002,9003,9004,9005 \
    --offload --heavy-threshold 20000 --port 9090

2.5 Benchmark Parameters

Parameter Value
Trace traces/w600_r0.0015_st30.jsonl (window+thin, 70% multi-turn)
Daily iteration --request-limit 850 (~13 min, APC≈76%)
Full validation All 1214 requests (~48 min, APC≈80%)
Replay mode Trace-driven (no session limit, no time compression)
Request timeout 600s
vLLM flags --enforce-eager --enable-prefix-caching --max-model-len 200000
GPU memory util 0.9
Fresh restart Both configs started from cold (no warm cache)

2.6 Reproducing the Benchmark

# Activate environment
cd ~/agentic-kv && source .venv/bin/activate

# Ensure sampled trace exists
python scripts/sample_trace.py \
    --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
    --output traces/sampled_1000req_seed42.jsonl \
    --target-requests 1000 --seed 42

# Run benchmark (daily iteration)
bash scripts/bench.sh --tag my_experiment --mode baseline --policy linear \
    --trace traces/w600_r0.0015_st30.jsonl --requests 850

# Run benchmark (full validation)
bash scripts/bench.sh --tag my_experiment_full --mode baseline --policy linear \
    --trace traces/w600_r0.0015_st30.jsonl

3. Results

Errata (2026-05-22): The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were not freshly restarted — residual KV cache from prior experiments caused 2× TTFT inflation.

Errata (2026-05-23): §3.1 results used artificial concurrency limits (--max-inflight-sessions 8, 1 req/GPU) and random session sampling that destroyed cross-session KV sharing (52% → 16%). See §3.6 for production-realistic results with corrected methodology.

3.1 Legacy Comparison (artificial 1 req/GPU, 200 req)

Config OK/N TTFT p50 TTFT p90 TPOT p90 E2E p50
Baseline (no Mooncake) 198/200 1.075s 9.384s 0.076s 5.075s
LMetric routing 198/200 1.099s 9.392s 0.073s 5.205s
Elastic P2P (kv_both) 195/200 1.018s 11.312s 0.085s 6.977s

3.2 Per-Class Breakdown

Baseline (fresh):

Class Count % TTFT p50 TTFT p90 TPOT p90
WARM (<5k) 46 23% 0.137s 0.262s 0.061s
MEDIUM (5-20k) 50 25% 0.921s 1.846s 0.079s
HEAVY (20-50k) 64 32% 2.660s 6.278s 0.076s
HEAVY (>50k) 38 19% 9.587s 30.415s 0.102s

Elastic P2P (fresh):

Class Count % TTFT p50 TTFT p90 TPOT p90
WARM (<5k) 46 23% 0.142s 0.279s 0.072s
MEDIUM (5-20k) 50 25% 0.766s 1.814s 0.197s
HEAVY (>20k) 99 51% 6.390s 22.668s 0.085s

3.3 Success Rate

Config OK Total Rate Failure mode
Baseline 198 200 99.0% RemoteProtocolError (replayer-side)
Elastic P2P 195 200 97.5% 2× RemoteProtocolError + 3× ReadTimeout on >60k

Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV pushed to Mooncake, but D never produced first token (decode scheduler couldn't allocate KV cache space). Prefill timeout fallback (120s → co-located) was never triggered.

3.4 Routing Policy: Linear vs LMetric (OSDI'26)

LMetric (score = P_tokens × BS, pure per-request, no session affinity) vs Linear (score = ongoing_tokens - α·cache_hit, session-sticky). Both fresh-restart, same trace.

Errata (2026-05-23): Prior LMetric implementation incorrectly shared session-sticky logic with Linear. Fixed to pure per-request routing per OSDI'26 spec: score = (pending_prefill + new_tokens) × num_requests, no affinity, no overload override. Results below use corrected implementation.

Policy TTFT p50 TTFT p90 TPOT p90 E2E p50 Delta E2E
Linear (session-sticky) 1.073s 9.347s 0.073s 5.119s
LMetric (no affinity) 1.081s 9.408s 0.072s 5.102s -0.3%

Key finding: LMetric without explicit session affinity matches Linear with session affinity on all metrics (< 2% difference). The cache-hit term in LMetric's scoring (new_tokens = input - cache_hit) creates implicit soft affinity — instances that already cached a session's prefix get lower P_tokens, naturally attracting subsequent turns. Explicit session-sticky routing is not required; cache-aware load balancing captures it automatically.

APC distribution (LMetric, no affinity): inst_0=60.6%, inst_1=58.3%, inst_2=43.2%, inst_3=28.9%, inst_4=16.6%, inst_5=24.0%, inst_6=13.9%, inst_7=0.0%. Non-uniform but comparable aggregate to Linear's explicit affinity.

3.5 Errata: Why Prior Cross-Machine A/B Was Invalid

The initial comparison (commit 1e86285) reported:

Baseline (dash0): TTFT50=2.383  E2E50=10.232  ← WRONG (warm instances)
Elastic  (dash1): TTFT50=1.315  E2E50=5.708
Delta:                   -45%          -44%    ← INVALID

Evidence that prior baseline was not fresh:

  1. inst_7 APC = 68.3% — impossible from 25 cold-start requests (max ~25%)
  2. WARM TTFT p90 = 3.327s (fresh = 0.262s, 12.7× gap) — indicates KV cache memory pressure from prior experiments
  3. HEAVY TPOT p90 = 0.154s (fresh = 0.076s, 2.0× gap) — heavy prefill-decode interference from full KV cache

The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.

3.6 Production-Realistic Baseline (trace-driven, corrected methodology)

Corrected sampling (window+thin, 70% multi-turn, block sharing 48%) and trace-driven replay (no session limit, no time compression). See §2.3 for trace details.

Linear policy, 912 requests (dense segment), peak QPS ≈ 1.6:

Metric Legacy (§3.1, 1 req/GPU) New (trace-driven) Delta
TTFT mean 1.07s 4.54s +4.2×
TTFT p50 1.08s 0.94s -13%
TTFT p90 9.38s 14.12s +51%
TPOT p50 0.038s 0.070s +84%
TPOT p90 0.073s 0.175s +139%
APC (mean) ~44% 67.5% +23pp
Errors 2/200 (1.0%) 0/912 (0%) better
E2E p50 5.08s 6.98s +37%

Key differences from legacy methodology:

  1. APC 67.5% vs 44%: Window+thin sampling preserves cross-session block sharing (48% vs 16% in legacy random sampling), yielding production-realistic cache hit rates. Per-instance APC ranges 4684%.

  2. TPOT +139% at p90: With trace-driven replay, multiple concurrent requests per GPU create real prefill-decode interference. The legacy 1 req/GPU setup showed TPOT p90=0.073s (no interference), but production-realistic load shows TPOT p90=0.175s. This validates that prefill-decode interference is a real problem at production concurrency.

  3. TTFT p50 improved (-13%) but mean degraded (+4.2×): Higher APC means cached requests get very fast TTFT (p50=0.94s). But concurrent heavy prefills cause queuing for non-cached requests, inflating the mean and p90.

  4. Per-instance APC imbalance (4684%): Routing quality directly determines per-instance APC. The 38pp gap between worst and best instance suggests routing optimization is still the highest-leverage improvement.

Output: outputs/baseline_r0015_st30/ on dash0.

3.7 Elastic PS vs Baseline (production-realistic trace)

850 requests, w600_r0.0015_st30.jsonl, peak QPS≈1.6. Baseline on dash0, elastic on dash1.

Metric Baseline Elastic PS Delta
TTFT mean 4.35s 4.01s -7.8%
TTFT p50 0.94s 0.93s -1%
TPOT p50 0.070 0.071 +2%
TPOT p90 0.162 0.157 -3.1%
E2E p50 6.38s 6.44s +0.9%
APC mean 60.7% 59.9% -0.8pp
Errors 0/850 4/832 4 ReadTimeout

Elastic PS is near-neutral. Root cause analysis:

Problem 1: Offload gate too restrictive — only 17/118 HEAVY requests (14%) were offloaded. 75% of HEAVY requests had cache_ratio=0% (cold Turn 1), failing the cache_ratio >= 0.3 gate. The gate was designed to avoid offloading cold requests (full prefill on P is slower than co-located), but this means 86% of HEAVY prefills still interfere with decode.

Problem 2: Offloaded requests are slower (+50.6%) — HEAVY_OFFLOAD TTFT=19.94s vs HEAVY_COLO=13.25s. Breakdown:

  • Prefill on P: 14.72s (P also queued, no faster than co-located)
  • KV transfer + decode start on D: 5.71s (pure overhead)

Interference is real but unaddressed: 89% of WARM/MEDIUM requests ran concurrently with 1+ HEAVY prefills (up to 60 concurrent). Elastic PS only offloaded 17/118 HEAVY requests — insufficient to reduce interference.

Conclusion: The offload gate (cache_ratio >= 0.3) is correct in principle (cold offload IS slower), but leaves the core problem unsolved. Reducing prefill-decode interference requires either:

  1. Offloading ALL heavy prefills (accepting higher TTFT for offloaded requests in exchange for lower TPOT for all)
  2. Chunked prefill scheduling that yields to decode (vLLM-side optimization)
  3. Dedicated prefill GPUs (full PD separation) if KV memory wall can be solved

Output: outputs/eval_baseline_linear/ on dash0, outputs/eval_elastic_linear/ on dash1.

3.8 Direct KV Cache Migration (Bootstrap-Triggered PUSH)

Architecture: D asks C's bootstrap server to PUSH cached KV blocks directly into D's GPU memory via Mooncake RDMA WRITE. C's vLLM scheduler is NOT involved (0 GPU compute on C). D then does local prefill for new tokens + decode.

Implementation details (vLLM + Mooncake patches):

  1. Hash table sync (scheduler → worker → bootstrap): Each step, scheduler computes delta of BlockPool.cached_block_hash_to_block and syncs to worker's bootstrap server via MooncakeConnectorMetadata.hash_table_updates.

  2. Token-based block lookup: D sends POST /push_blocks with prompt token_ids + D's GPU addresses. C's bootstrap computes block hashes using sha256 + NONE_HASH (same hash function as scheduler), matches against synced hash table.

  3. RDMA PUSH: C's bootstrap calls TransferEngine.batch_transfer_sync_write to push matched KV blocks from C's GPU into D's GPU. This uses the existing RDMA WRITE path (proven reliable), not RDMA READ (which fails on batch_register_memory'd GPU memory due to missing IBV_ACCESS_REMOTE_READ flags).

  4. Cost model: offload when colocated_cost + interference > offload_cost, where interference = prefill_time × min(num_requests, 3) × 0.3. Offload triggers when C has 1+ concurrent request.

  5. Requirements: PYTHONHASHSEED must be set (bench.sh sets PYTHONHASHSEED=42 for elastic mode) to ensure deterministic NONE_HASH across scheduler/worker code paths.

Minimal test verification (scripts/test_direct_read.py):

Metric inst_0 (local cache) inst_1 (RDMA push from inst_0)
Turn 2 TTFT 0.338s 0.367s
Blocks transferred 640/640 matched, push ret=0
External APC 0% 80%

Key bugs fixed during development:

  • NameError: field not imported — missing dataclass import
  • Scheduler assertion crash (assert RequestStatus.is_finished) — partial remote prefill state mismatch
  • Hash mismatch 0/640 — sha256 vs sha256_cbor (default hash algo is sha256, not sha256_cbor)
  • Hash mismatch 0/640 — from X import NONE_HASH creates stale value binding after init_none_hash reassigns the global; fixed with import X; X.NONE_HASH
  • RDMA READ ret=-1 — batch_register_memory only sets IBV_ACCESS_REMOTE_WRITE; switched to bootstrap-triggered PUSH
  • Cost model 0% trigger — removed stale cache_gate_ratio check; added interference penalty

Output: outputs/eval_direct_rdma_v*/ on dash0.

4. System-Level Analysis

4.1 Elastic P2P Does Not Improve Single-Machine Performance

Under fair comparison (same machine, both fresh):

Metric Baseline Elastic Delta
TTFT p50 1.075s 1.018s -5.3%
TTFT p90 9.384s 11.312s +20.5%
TPOT p90 0.076s 0.085s +11.6%
E2E p50 5.075s 6.977s +37.5%

Elastic is worse on all metrics except TTFT p50. Root causes:

1. Mooncake kv_both memory overhead

Each instance with kv_role=kv_both maintains RDMA buffers + Mooncake bootstrap server, reducing GPU memory available for KV cache. This affects ALL requests (including WARM/MEDIUM that don't use P2P transfer), causing more cache eviction and higher TPOT.

Evidence: MEDIUM TPOT p90 = 0.197s (elastic) vs 0.079s (baseline) — 2.5× worse despite MEDIUM requests not using P2P at all.

2. D-side KV pull failures

3 HEAVY requests completed prefill on P instance successfully but D-side never produced first token. The KV cache on D was too full to allocate space for the transferred blocks. These became 600s timeouts.

3. P2P overhead without proportional benefit

The P2P path adds: prefill queue on P (p50=6.3s) + KV transfer + decode start on D (p50=0.8s). For requests where the D instance isn't under heavy prefill load (which is the case on fresh instances), co-located execution is faster.

4.2 When Elastic P2P Could Help

Elastic P2P is designed for the scenario where D-instance decode is disrupted by co-located heavy prefill. On fresh instances with 200 requests, this contention is moderate. The benefit may emerge under:

  • Higher sustained load (1000+ concurrent requests)
  • Longer experiment duration (KV cache fills up, eviction pressure increases)
  • Multi-machine deployment (P on a different node, no memory competition)

5. Data & Log Locations

5.1 Experiment Outputs (on respective machines)

Directory Machine Config Notes
outputs/ab_baseline/ dash0 Combined 8× TP=1 Initial A/B (INVALIDATED: warm instances)
outputs/ab_elastic/ dash0 Elastic P2P cap=4 Initial A/B (INVALIDATED)
outputs/baseline_stability_fresh/ dash0 Combined 8× fresh Canonical baseline (§3.1)
outputs/elastic_stability_*/ dash0 Elastic P2P kv_both fresh Canonical elastic (§3.1)
outputs/ab_linear/ dash0 Linear policy, 200 req §3.4 routing policy comparison
outputs/ab_lmetric/ dash0 LMetric policy, 200 req §3.4 routing policy comparison
outputs/gpu_ab_combined/ local Combined 8× TP=1 Earlier run, has gpu_util.csv
outputs/gpu_ab_pdsep/ local PD-Sep 4P+4D Earlier run, has gpu_util.csv
outputs/exp2_combined_tp1_dp8/ local Combined 8× TP=1 1000 req, cache-aware
outputs/exp3_pd_sep_tp1_mooncake/ local PD-Sep 4P+4D Mooncake 1000 req

5.2 vLLM Instance Logs

Path pattern Machine Config
/tmp/ab_base_$i.log dash0 Baseline instances 0-7
/tmp/ab_elastic_$i.log dash1 Elastic instances 0-7
/tmp/lmetric_ab_inst_$i.log dash0 Linear policy instances 0-7 (§3.6)
/tmp/lmetric_inst_$i.log dash0 LMetric policy instances 0-7 (§3.6)

Logs contain Prefix cache hit rate and External prefix cache hit rate lines for APC extraction.

5.3 Trace Data

Path Machine Description
~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl dash0 Full 2h production trace (2.1M requests)
traces/sampled_1000req_seed42.jsonl all Sampled 1000 requests (gitignored, regenerate with sample_trace.py)

5.4 Analysis Documents

File Content
analysis/pd_separation_analysis.md Main report: PD-Sep vs Combined + Elastic P2P (§5)
analysis/elastic_offload_design.md Elastic P2P design rationale
analysis/kv_lifecycle_design.md KV cache eviction policy analysis
analysis/adaptive_prefill_offload_design.md Initial adaptive offload design (superseded by elastic)

6. Repository Structure

agentic-kv/
├── analysis/                    # Research reports and design docs
│   ├── pd_separation_analysis.md    # Main comprehensive report
│   ├── elastic_offload_design.md    # Elastic P2P design
│   ├── kv_lifecycle_design.md       # Cache eviction analysis
│   └── ...
├── replayer/                    # Trace replay framework
│   ├── __main__.py              # CLI entry: python -m replayer
│   ├── replay.py                # Async replayer (session-aware, SSE streaming)
│   ├── trace.py                 # TraceRequest dataclass, session/hash_id handling
│   └── metrics.py               # RequestMetrics, crash-safe JSONL sink
├── scripts/
│   ├── cache_aware_proxy.py     # Global scheduler (combined + PD-sep + elastic offload)
│   ├── sample_trace.py          # Cluster-to-machine trace sampler
│   ├── launch_vllm.sh           # Launch combined TP=8
│   ├── launch_pd_mooncake.sh    # Launch PD-Sep with Mooncake
│   ├── launch_elastic_p2p.sh    # Launch elastic P2P (8× kv_both + offload proxy)
│   ├── run_experiments.sh       # Full experiment matrix (combined/PD-sep)
│   ├── run_benchmark.sh         # Single benchmark run
│   ├── gpu_monitor.sh           # GPU utilization sampler (5s CSV)
│   ├── compute_roofline.py      # Prefill/decode roofline analysis
│   ├── analyze_*.py             # Various analysis scripts
│   └── compare_*.py             # Experiment comparison scripts
├── patches/
│   ├── 0001-fix-kv-transfer-abort-race.patch
│   └── README.md
├── third_party/vllm/            # vLLM 0.18.1 source (with patch applied)
├── outputs/                     # Experiment results (gitignored)
├── traces/                      # Sampled traces (gitignored)
├── TODO.md                      # Original research goals
└── REPORT.md                    # This milestone report

7. Key Scripts Reference

Script What it does Key flags
scripts/cache_aware_proxy.py Global scheduler + elastic offload proxy --combined, --offload, --policy {linear,lmetric}, --heavy-threshold, --bootstrap-ports
scripts/run_lmetric_ab.sh A/B: linear vs lmetric routing policy Runs both experiments with fresh restart
scripts/run_elastic_stability_test.sh Elastic vs baseline with full isolation Fresh start/stop per experiment
scripts/bench.sh Standard single-experiment harness --tag, --mode {baseline,elastic}
scripts/sample_trace.py Sample complete sessions from cluster trace --target-requests, --seed
python -m replayer Replay trace against vLLM endpoint --time-scale, --max-inflight-sessions, --request-limit
scripts/gpu_monitor.sh Sample nvidia-smi to CSV Pipe to outputs/<tag>/gpu_util.csv
scripts/launch_elastic_p2p.sh Launch all 8 kv_both instances + offload proxy HEAVY_THRESHOLD, MAX_OFFLOAD env vars

8. GPU Load Imbalance & Elastic Prefill Service Analysis

8.1 Load Imbalance Characterization

Session-sticky routing creates token load imbalance across instances. The severity depends on scale:

Scale Imbalance Top 5 sessions Cause
200 req (143 sessions) 8.6× tokens 49% of all tokens Small sample, few dominant sessions
1000 req (668 sessions) 1.24× tokens 29% of all tokens More sessions → natural averaging

At 1000 requests, the heaviest instance has 4.5M tokens vs lightest 3.6M (1.24×). Despite this, TPOT is uniform across all instances (0.0700.077), confirming that prefill-decode interference is minimal at ≤1 session/GPU. The imbalance manifests in TTFT only: heaviest 2 instances TTFT p50 = 1.42s vs lightest 2 at 0.83s (1.7× gap).

8.2 Session Accumulation Pattern

Agentic workloads produce long-lived sessions with growing context:

Session Turns Total Tokens Context Growth
1569319 36 2.32M 27k → 98k (+2.0k/turn)
1206593 36 2.31M 15k → 106k (+2.6k/turn)
178176 25 1.93M 36k → 95k (+2.5k/turn)

Top 5 sessions = 29% of all tokens. With session-sticky, these lock their instances, creating persistent load hotspots.

8.3 Benchmark Concurrency vs Production Reality

Critical caveat: All prior experiments used --max-inflight-sessions 8 (1 session/GPU). This is 1015× below production concurrency and masks the interference that elastic PS is designed to solve.

Our Benchmark Production Estimate
Concurrent requests/GPU 12 815
KV cache usage/GPU 2628% (single req) 80100%
Prefill-decode interference Minimal Significant

KV cache capacity: 281,888 tokens/GPU (25.8 GiB). A single 75k-token request consumes 27% of KV cache. At production concurrency (~15 req/GPU), KV cache is near-full, triggering eviction, cache misses, and prefill queuing — none of which appear in our 1-req/GPU benchmark.

Measured interference scaling:

Concurrency TPOT p90 vs 8-session
8 sessions (1/GPU) 0.075s baseline
16 sessions (2/GPU) 0.103s +38%
Production (~15/GPU) not tested expected >>+45%

8.4 Elastic PS: Two Capabilities Re-Evaluated

Capability 1: Reduce prefill-decode interference (lower TPOT)

At 1 req/GPU (our benchmark): no interference, no benefit. But this is an artifact of unrealistically low concurrency. At ≥2 req/GPU, chunked prefill interrupts decode steps, causing TPOT +3845%. At production concurrency (~15/GPU), multiple HEAVY prefills sharing a GPU with decode requests would cause severe interference. Elastic PS's ability to isolate heavy prefills on separate GPUs directly addresses this.

Capability 2: Session migration for load balancing

Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to a different instance for decode + future turns. This provides two benefits:

  1. Break session lock-in: Agentic sessions grow +2k tokens/turn over 30+ turns. With session-sticky, a 36-turn session (2.3M tokens total) locks its GPU, creating a hotspot. Elastic PS lets the session migrate to a less-loaded GPU while preserving cache on the original (PS does fast cached prefill, new GPU decodes).

  2. Rebalance without cache loss: Unlike breaking affinity (which destroys cache), elastic PS migration preserves the prefix cache on the original instance — the PS re-uses it for fast prefill, then transfers only the new KV to the destination.

Simulation of migration strategies (1000 req, at current low concurrency):

Strategy Imbalance Migrations KV Transfer Overhead
No migration 1.24× 0 0s
Every 10 turns 1.21× 10 15s
Every 5 turns 1.20× 20 30s

At 1 req/GPU, migration benefit is marginal (imbalance is only 1.24×). At production concurrency where imbalance combines with KV cache pressure and interference, the benefit would be substantially larger.

Capability 3: Soft affinity from cache-hit scoring

The corrected LMetric experiment (§3.4) reveals that explicit session affinity is unnecessary. Cache-hit scoring (new_tokens = input cached) creates implicit soft affinity — instances with cached prefixes score lower, naturally attracting subsequent turns. This matches hard session-sticky on all metrics (< 2% difference) while providing more flexible load balancing.

8.5 Elastic PS Verdict

Aspect At 1 req/GPU (tested) At production load (expected)
TPOT reduction 0% (no interference) Significant (interference scales with concurrency)
Session migration Marginal (1.24× → 1.20×) Larger benefit (KV pressure + interference amplify imbalance)
Cache preservation N/A Key advantage vs breaking affinity

At our benchmark concurrency (1 req/GPU), elastic PS is not justified — Mooncake overhead exceeds the non-existent interference benefit. But our benchmark is 1015× below production load. The real question is whether elastic PS helps at production-realistic concurrency (64128 concurrent sessions, 815 req/GPU), where:

  • Prefill-decode interference is measurable (already +38% TPOT at just 2/GPU)
  • KV cache pressure causes eviction and queue delays
  • Session accumulation creates compounding hotspots
  • Heavy prefills (50100k tokens) block decode for seconds

Next step: run --max-inflight-sessions 64 benchmark to test elastic PS at production-realistic concurrency.

9. Conclusions & Next Steps

Established findings:

  1. Full PD separation is net negative for single-machine agentic workloads (KV cache memory wall)
  2. Cache-aware routing is the dominant optimization (+24pp APC, -60% TTFT vs round-robin)
  3. Explicit session affinity is unnecessary — cache-hit scoring creates implicit soft affinity that matches hard session-sticky (< 2% difference)
  4. At low concurrency (1 req/GPU), elastic P2P offload adds overhead without benefit
  5. Our benchmark concurrency is 1015× below production: --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode interference that appears at ≥2 req/GPU (+38% TPOT) and would dominate at production load (~15 req/GPU)
  6. Experimental methodology matters: warm vs fresh instances cause 2× TTFT difference

Lessons learned:

  • Prior cross-machine A/B (commit 1e86285) was invalid — warm baseline inflated by 2×
  • Prior LMetric implementation was invalid — incorrectly shared session-sticky logic with Linear
  • kv_role=kv_both has non-trivial always-on overhead even when P2P transfer is not used
  • Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility
  • Benchmark concurrency must match production — sub-production concurrency hides interference effects that drive system design decisions

Open problems (priority ordered):

  1. Production-concurrency benchmark (--max-inflight-sessions 64+): Validate whether prefill-decode interference at 815 req/GPU makes elastic PS net-positive
  2. Multi-machine elastic: P on a different node eliminates GPU memory competition — the main cost that makes single-machine elastic net negative
  3. Layerwise KV transfer: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation
  4. Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)

Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance analysis added (§8). Benchmark concurrency gap identified — production-load experiments needed.