Files

Gahow Wang 1cd0a18e2c Report §3.8: Document direct KV cache migration architecture + bugs fixed

Complete documentation of bootstrap-triggered PUSH implementation:
hash table sync, token-based lookup, RDMA WRITE path, cost model,
PYTHONHASHSEED requirement, and all 6 bugs fixed during development.

Verified: 640/640 blocks pushed, External APC 80%, TTFT 0.367s
(vs local cache 0.338s, +0.03s overhead).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-24 01:52:38 +08:00

33 KiB

Raw Blame History

Milestone Report: Elastic P2P vs PD-Combined Baseline

Date: 2026-05-22 Author: Gahow Wang Status: Phase 1 complete — baseline + elastic validated, system-level analysis done

1. Research Question

For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can selective disaggregation of only heavy requests improve serving latency while preserving KV cache locality?

1.1 Errata / Superseded sections

This report has been revised several times as the methodology matured. The sections below are kept for historical context but their numerical conclusions have been superseded — do not cite them in isolation.

§3.1 (initial PD-sep vs PD-combined): ran with the old random sampler + --time-scale compression + --max-inflight-sessions 8. Cross-session KV reuse dropped from 52% → 16%, and per-GPU concurrency was capped at 1 req/GPU. Superseded by §3.6.

Earlier "elastic v3" warm-vs-fresh runs: baselines were not restarted between trials, leaving residual KV cache that inflated baseline TTFT ~2×. Superseded by the cold-start results in §3.6/§3.7.

Any reference to running --max-inflight-sessions 64+: that flag was removed when replay moved to trace-driven dispatch. The next-step experiment requires restoring the flag first (see FIXES.md §B2 route A) before any production-concurrency numbers can be produced.

The authoritative results are in §3.6 and §3.7.

2. Experimental Setup

2.1 Hardware

Resource	Spec
Machine	dash0 / dash1 (identical config)
GPU	8× NVIDIA H20 96GB HBM, NVLink
Network	4× ConnectX-7 200Gbps RDMA
Storage	cpfs shared storage across machines

2.2 Software

Component	Version	Notes
vLLM	0.18.1 (source in `third_party/vllm/`)	Patched scheduler assert (see `patches/`)
Mooncake	0.3.10	RDMA-based KV transfer between instances
Python	3.x managed by `uv`	`.venv/` at project root
Model	`Qwen3-Coder-30B-A3B-Instruct`	MoE 128 experts top-8, 3B active params
Model path	`~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct`	Same on dash0 and dash1

2.3 Workload Trace

Property	Value
Source	GLM-5.1 Agentic Coder, production cluster, 2h window
Raw trace	`~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` on dash0
Total requests	2,114,220
Avg input tokens	33,600 (p50=20k, p90=88k)
Avg output tokens	445 (p50=80)
I/O ratio	75.6× aggregate
Prefill token share	98%
KV reuse breakdown	62% intra-session, 38% cross-session (token-level)
Theoretical max APC	67% (infinite cache, single instance, prefix-only)

Sampled trace for benchmarks: traces/w600_r0.0015_st30.jsonl (1214 requests, 688 sessions, 70% multi-turn). Generated with window+thin sampling:

python scripts/sample_trace.py \
    --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
    --output traces/w600_r0.0015_st30.jsonl \
    --sample-ratio 0.0015 --max-single-turn-ratio 0.3 \
    --window-seconds 600 --seed 42

Trace property	Value
Sessions	688 (70% multi-turn, avg 4.9 turns)
Requests	1214 (use `--request-limit 850` for daily, full for validation)
Avg input tokens	48,776
Trace span	2912s (48.5 min); dense segment 0-990s (850 req)
Peak QPS	1.6 req/s (in dense segment)
Hash block sharing	48.3% (vs 52% full trace)
Theoretical APC	80% (full), 76% (first 850 req)

Sampling methodology (2026-05-23): Prior traces used random session sampling + --time-scale compression + --max-inflight-sessions semaphore, which (a) destroyed cross-session hash block sharing (52% → 16%), (b) artificially limited concurrency to 1 req/GPU, and (c) masked prefill-decode interference. The new approach uses contiguous time-window sampling with session thinning (--max-single-turn-ratio 0.3) to preserve KV reuse patterns, and trace-driven replay with no artificial concurrency limits.

2.4 Two Configurations Compared

Baseline: PD-Combined (8× TP=1 DP=8)

8 independent vLLM instances, 1 GPU each, no Mooncake.
All instances do both prefill and decode.
Global scheduler (cache_aware_proxy.py --combined) handles:
  - Session-sticky routing (multi-turn → same instance)
  - Load-aware override (if pinned instance > 2× avg load, redirect)
  - Cache-hit scoring (prefer instance with matching prefix blocks)

Launch:

# On dash0:
for i in $(seq 0 7); do
    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
    vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
        --port $((8000+i)) --tp 1 \
        --enable-prefix-caching --enforce-eager \
        --gpu-memory-utilization 0.9 --max-model-len 200000 \
        > /tmp/ab_base_$i.log 2>&1 &
done

python scripts/cache_aware_proxy.py \
    --combined http://127.0.0.1:800{0..7} --port 9090

Elastic P2P Offload (8× TP=1 kv_both + selective offload)

8 independent vLLM instances, 1 GPU each, all kv_role=kv_both (Mooncake).
Same global scheduler, plus elastic offload logic:
  - Proxy classifies each request: WARM (<5k new), MEDIUM (5-20k), HEAVY (>20k)
  - WARM/MEDIUM: co-located on session-sticky instance (no KV transfer)
  - HEAVY: prefill on a different instance (P), KV via Mooncake RDMA,
    decode on session-sticky instance (D)
  - Cap: max 4 concurrent offloads (MAX_OFFLOAD_INFLIGHT)
  - P instance selection: round-robin with overload skip

Launch:

# On dash1 (or use scripts/launch_elastic_p2p.sh):
for i in $(seq 0 7); do
    VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998+i)) \
    MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
    vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
        --port $((8000+i)) --tp 1 \
        --enable-prefix-caching --enforce-eager \
        --gpu-memory-utilization 0.9 --max-model-len 200000 \
        --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
        > /tmp/ab_elastic_$i.log 2>&1 &
    sleep 2  # stagger to avoid NCCL port collision
done

# Wait for bootstrap servers
for bp in $(seq 8998 9005); do
    until curl -s localhost:$bp/query > /dev/null 2>&1; do sleep 2; done
done

python scripts/cache_aware_proxy.py \
    --combined http://127.0.0.1:800{0..7} \
    --bootstrap-ports 8998,8999,9000,9001,9002,9003,9004,9005 \
    --offload --heavy-threshold 20000 --port 9090

2.5 Benchmark Parameters

Parameter	Value
Trace	`traces/w600_r0.0015_st30.jsonl` (window+thin, 70% multi-turn)
Daily iteration	`--request-limit 850` (~13 min, APC≈76%)
Full validation	All 1214 requests (~48 min, APC≈80%)
Replay mode	Trace-driven (no session limit, no time compression)
Request timeout	600s
vLLM flags	`--enforce-eager --enable-prefix-caching --max-model-len 200000`
GPU memory util	0.9
Fresh restart	Both configs started from cold (no warm cache)

2.6 Reproducing the Benchmark

# Activate environment
cd ~/agentic-kv && source .venv/bin/activate

# Ensure sampled trace exists
python scripts/sample_trace.py \
    --input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
    --output traces/sampled_1000req_seed42.jsonl \
    --target-requests 1000 --seed 42

# Run benchmark (daily iteration)
bash scripts/bench.sh --tag my_experiment --mode baseline --policy linear \
    --trace traces/w600_r0.0015_st30.jsonl --requests 850

# Run benchmark (full validation)
bash scripts/bench.sh --tag my_experiment_full --mode baseline --policy linear \
    --trace traces/w600_r0.0015_st30.jsonl

3. Results

Errata (2026-05-22): The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were not freshly restarted — residual KV cache from prior experiments caused 2× TTFT inflation.

Errata (2026-05-23): §3.1 results used artificial concurrency limits (--max-inflight-sessions 8, 1 req/GPU) and random session sampling that destroyed cross-session KV sharing (52% → 16%). See §3.6 for production-realistic results with corrected methodology.

3.1 Legacy Comparison (artificial 1 req/GPU, 200 req)

Config	OK/N	TTFT p50	TTFT p90	TPOT p90	E2E p50
Baseline (no Mooncake)	198/200	1.075s	9.384s	0.076s	5.075s
LMetric routing	198/200	1.099s	9.392s	0.073s	5.205s
Elastic P2P (kv_both)	195/200	1.018s	11.312s	0.085s	6.977s

3.2 Per-Class Breakdown

Baseline (fresh):

Class	Count	%	TTFT p50	TTFT p90	TPOT p90
WARM (<5k)	46	23%	0.137s	0.262s	0.061s
MEDIUM (5-20k)	50	25%	0.921s	1.846s	0.079s
HEAVY (20-50k)	64	32%	2.660s	6.278s	0.076s
HEAVY (>50k)	38	19%	9.587s	30.415s	0.102s

Elastic P2P (fresh):

Class	Count	%	TTFT p50	TTFT p90	TPOT p90
WARM (<5k)	46	23%	0.142s	0.279s	0.072s
MEDIUM (5-20k)	50	25%	0.766s	1.814s	0.197s
HEAVY (>20k)	99	51%	6.390s	22.668s	0.085s

3.3 Success Rate

Config	OK	Total	Rate	Failure mode
Baseline	198	200	99.0%	RemoteProtocolError (replayer-side)
Elastic P2P	195	200	97.5%	2× RemoteProtocolError + 3× ReadTimeout on >60k

Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV pushed to Mooncake, but D never produced first token (decode scheduler couldn't allocate KV cache space). Prefill timeout fallback (120s → co-located) was never triggered.

3.4 Routing Policy: Linear vs LMetric (OSDI'26)

LMetric (score = P_tokens × BS, pure per-request, no session affinity) vs Linear (score = ongoing_tokens - α·cache_hit, session-sticky). Both fresh-restart, same trace.

Errata (2026-05-23): Prior LMetric implementation incorrectly shared session-sticky logic with Linear. Fixed to pure per-request routing per OSDI'26 spec: score = (pending_prefill + new_tokens) × num_requests, no affinity, no overload override. Results below use corrected implementation.

Policy	TTFT p50	TTFT p90	TPOT p90	E2E p50	Delta E2E
Linear (session-sticky)	1.073s	9.347s	0.073s	5.119s	—
LMetric (no affinity)	1.081s	9.408s	0.072s	5.102s	-0.3%

Key finding: LMetric without explicit session affinity matches Linear with session affinity on all metrics (< 2% difference). The cache-hit term in LMetric's scoring (new_tokens = input - cache_hit) creates implicit soft affinity — instances that already cached a session's prefix get lower P_tokens, naturally attracting subsequent turns. Explicit session-sticky routing is not required; cache-aware load balancing captures it automatically.

APC distribution (LMetric, no affinity): inst_0=60.6%, inst_1=58.3%, inst_2=43.2%, inst_3=28.9%, inst_4=16.6%, inst_5=24.0%, inst_6=13.9%, inst_7=0.0%. Non-uniform but comparable aggregate to Linear's explicit affinity.

3.5 Errata: Why Prior Cross-Machine A/B Was Invalid

The initial comparison (commit 1e86285) reported:

Baseline (dash0): TTFT50=2.383  E2E50=10.232  ← WRONG (warm instances)
Elastic  (dash1): TTFT50=1.315  E2E50=5.708
Delta:                   -45%          -44%    ← INVALID

Evidence that prior baseline was not fresh:

inst_7 APC = 68.3% — impossible from 25 cold-start requests (max ~25%)
WARM TTFT p90 = 3.327s (fresh = 0.262s, 12.7× gap) — indicates KV cache memory pressure from prior experiments
HEAVY TPOT p90 = 0.154s (fresh = 0.076s, 2.0× gap) — heavy prefill-decode interference from full KV cache

The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.

3.6 Production-Realistic Baseline (trace-driven, corrected methodology)

Corrected sampling (window+thin, 70% multi-turn, block sharing 48%) and trace-driven replay (no session limit, no time compression). See §2.3 for trace details.

Linear policy, 912 requests (dense segment), peak QPS ≈ 1.6:

Metric	Legacy (§3.1, 1 req/GPU)	New (trace-driven)	Delta
TTFT mean	1.07s	4.54s	+4.2×
TTFT p50	1.08s	0.94s	-13%
TTFT p90	9.38s	14.12s	+51%
TPOT p50	0.038s	0.070s	+84%
TPOT p90	0.073s	0.175s	+139%
APC (mean)	~44%	67.5%	+23pp
Errors	2/200 (1.0%)	0/912 (0%)	better
E2E p50	5.08s	6.98s	+37%

Key differences from legacy methodology:

APC 67.5% vs 44%: Window+thin sampling preserves cross-session block sharing (48% vs 16% in legacy random sampling), yielding production-realistic cache hit rates. Per-instance APC ranges 46–84%.
TPOT +139% at p90: With trace-driven replay, multiple concurrent requests per GPU create real prefill-decode interference. The legacy 1 req/GPU setup showed TPOT p90=0.073s (no interference), but production-realistic load shows TPOT p90=0.175s. This validates that prefill-decode interference is a real problem at production concurrency.
TTFT p50 improved (-13%) but mean degraded (+4.2×): Higher APC means cached requests get very fast TTFT (p50=0.94s). But concurrent heavy prefills cause queuing for non-cached requests, inflating the mean and p90.
Per-instance APC imbalance (46–84%): Routing quality directly determines per-instance APC. The 38pp gap between worst and best instance suggests routing optimization is still the highest-leverage improvement.

Output: outputs/baseline_r0015_st30/ on dash0.

3.7 Elastic PS vs Baseline (production-realistic trace)

850 requests, w600_r0.0015_st30.jsonl, peak QPS≈1.6. Baseline on dash0, elastic on dash1.

Metric	Baseline	Elastic PS	Delta
TTFT mean	4.35s	4.01s	-7.8%
TTFT p50	0.94s	0.93s	-1%
TPOT p50	0.070	0.071	+2%
TPOT p90	0.162	0.157	-3.1%
E2E p50	6.38s	6.44s	+0.9%
APC mean	60.7%	59.9%	-0.8pp
Errors	0/850	4/832	4 ReadTimeout

Elastic PS is near-neutral. Root cause analysis:

Problem 1: Offload gate too restrictive — only 17/118 HEAVY requests (14%) were offloaded. 75% of HEAVY requests had cache_ratio=0% (cold Turn 1), failing the cache_ratio >= 0.3 gate. The gate was designed to avoid offloading cold requests (full prefill on P is slower than co-located), but this means 86% of HEAVY prefills still interfere with decode.

Problem 2: Offloaded requests are slower (+50.6%) — HEAVY_OFFLOAD TTFT=19.94s vs HEAVY_COLO=13.25s. Breakdown:

Prefill on P: 14.72s (P also queued, no faster than co-located)
KV transfer + decode start on D: 5.71s (pure overhead)

Interference is real but unaddressed: 89% of WARM/MEDIUM requests ran concurrently with 1+ HEAVY prefills (up to 60 concurrent). Elastic PS only offloaded 17/118 HEAVY requests — insufficient to reduce interference.

Conclusion: The offload gate (cache_ratio >= 0.3) is correct in principle (cold offload IS slower), but leaves the core problem unsolved. Reducing prefill-decode interference requires either:

Offloading ALL heavy prefills (accepting higher TTFT for offloaded requests in exchange for lower TPOT for all)
Chunked prefill scheduling that yields to decode (vLLM-side optimization)
Dedicated prefill GPUs (full PD separation) if KV memory wall can be solved

Output: outputs/eval_baseline_linear/ on dash0, outputs/eval_elastic_linear/ on dash1.

3.8 Direct KV Cache Migration (Bootstrap-Triggered PUSH)

Architecture: D asks C's bootstrap server to PUSH cached KV blocks directly into D's GPU memory via Mooncake RDMA WRITE. C's vLLM scheduler is NOT involved (0 GPU compute on C). D then does local prefill for new tokens + decode.

Implementation details (vLLM + Mooncake patches):

Hash table sync (scheduler → worker → bootstrap): Each step, scheduler computes delta of BlockPool.cached_block_hash_to_block and syncs to worker's bootstrap server via MooncakeConnectorMetadata.hash_table_updates.
Token-based block lookup: D sends POST /push_blocks with prompt token_ids + D's GPU addresses. C's bootstrap computes block hashes using sha256 + NONE_HASH (same hash function as scheduler), matches against synced hash table.
RDMA PUSH: C's bootstrap calls TransferEngine.batch_transfer_sync_write to push matched KV blocks from C's GPU into D's GPU. This uses the existing RDMA WRITE path (proven reliable), not RDMA READ (which fails on batch_register_memory'd GPU memory due to missing IBV_ACCESS_REMOTE_READ flags).
Cost model: offload when colocated_cost + interference > offload_cost, where interference = prefill_time × min(num_requests, 3) × 0.3. Offload triggers when C has 1+ concurrent request.
Requirements: PYTHONHASHSEED must be set (bench.sh sets PYTHONHASHSEED=42 for elastic mode) to ensure deterministic NONE_HASH across scheduler/worker code paths.

Minimal test verification (scripts/test_direct_read.py):

Metric	inst_0 (local cache)	inst_1 (RDMA push from inst_0)
Turn 2 TTFT	0.338s	0.367s
Blocks transferred	—	640/640 matched, push ret=0
External APC	0%	80%

Key bugs fixed during development:

NameError: field not imported — missing dataclass import
Scheduler assertion crash (assert RequestStatus.is_finished) — partial remote prefill state mismatch
Hash mismatch 0/640 — sha256 vs sha256_cbor (default hash algo is sha256, not sha256_cbor)
Hash mismatch 0/640 — from X import NONE_HASH creates stale value binding after init_none_hash reassigns the global; fixed with import X; X.NONE_HASH
RDMA READ ret=-1 — batch_register_memory only sets IBV_ACCESS_REMOTE_WRITE; switched to bootstrap-triggered PUSH
Cost model 0% trigger — removed stale cache_gate_ratio check; added interference penalty

Output: outputs/eval_direct_rdma_v*/ on dash0.

4. System-Level Analysis

4.1 Elastic P2P Does Not Improve Single-Machine Performance

Under fair comparison (same machine, both fresh):

Metric	Baseline	Elastic	Delta
TTFT p50	1.075s	1.018s	-5.3%
TTFT p90	9.384s	11.312s	+20.5%
TPOT p90	0.076s	0.085s	+11.6%
E2E p50	5.075s	6.977s	+37.5%

Elastic is worse on all metrics except TTFT p50. Root causes:

1. Mooncake kv_both memory overhead

Each instance with kv_role=kv_both maintains RDMA buffers + Mooncake bootstrap server, reducing GPU memory available for KV cache. This affects ALL requests (including WARM/MEDIUM that don't use P2P transfer), causing more cache eviction and higher TPOT.

Evidence: MEDIUM TPOT p90 = 0.197s (elastic) vs 0.079s (baseline) — 2.5× worse despite MEDIUM requests not using P2P at all.

2. D-side KV pull failures

3 HEAVY requests completed prefill on P instance successfully but D-side never produced first token. The KV cache on D was too full to allocate space for the transferred blocks. These became 600s timeouts.

3. P2P overhead without proportional benefit

The P2P path adds: prefill queue on P (p50=6.3s) + KV transfer + decode start on D (p50=0.8s). For requests where the D instance isn't under heavy prefill load (which is the case on fresh instances), co-located execution is faster.

4.2 When Elastic P2P Could Help

Elastic P2P is designed for the scenario where D-instance decode is disrupted by co-located heavy prefill. On fresh instances with 200 requests, this contention is moderate. The benefit may emerge under:

Higher sustained load (1000+ concurrent requests)
Longer experiment duration (KV cache fills up, eviction pressure increases)
Multi-machine deployment (P on a different node, no memory competition)

5. Data & Log Locations

5.1 Experiment Outputs (on respective machines)

Directory	Machine	Config	Notes
`outputs/ab_baseline/`	dash0	Combined 8× TP=1	~~Initial A/B~~ (INVALIDATED: warm instances)
`outputs/ab_elastic/`	dash0	Elastic P2P cap=4	~~Initial A/B~~ (INVALIDATED)
`outputs/baseline_stability_fresh/`	dash0	Combined 8× fresh	Canonical baseline (§3.1)
`outputs/elastic_stability_*/`	dash0	Elastic P2P kv_both fresh	Canonical elastic (§3.1)
`outputs/ab_linear/`	dash0	Linear policy, 200 req	§3.4 routing policy comparison
`outputs/ab_lmetric/`	dash0	LMetric policy, 200 req	§3.4 routing policy comparison
`outputs/gpu_ab_combined/`	local	Combined 8× TP=1	Earlier run, has gpu_util.csv
`outputs/gpu_ab_pdsep/`	local	PD-Sep 4P+4D	Earlier run, has gpu_util.csv
`outputs/exp2_combined_tp1_dp8/`	local	Combined 8× TP=1	1000 req, cache-aware
`outputs/exp3_pd_sep_tp1_mooncake/`	local	PD-Sep 4P+4D Mooncake	1000 req

5.2 vLLM Instance Logs

Path pattern	Machine	Config
`/tmp/ab_base_$i.log`	dash0	Baseline instances 0-7
`/tmp/ab_elastic_$i.log`	dash1	Elastic instances 0-7
`/tmp/lmetric_ab_inst_$i.log`	dash0	Linear policy instances 0-7 (§3.6)
`/tmp/lmetric_inst_$i.log`	dash0	LMetric policy instances 0-7 (§3.6)

Logs contain Prefix cache hit rate and External prefix cache hit rate lines for APC extraction.

5.3 Trace Data

Path	Machine	Description
`~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl`	dash0	Full 2h production trace (2.1M requests)
`traces/sampled_1000req_seed42.jsonl`	all	Sampled 1000 requests (gitignored, regenerate with `sample_trace.py`)

5.4 Analysis Documents

File	Content
`analysis/pd_separation_analysis.md`	Main report: PD-Sep vs Combined + Elastic P2P (§5)
`analysis/elastic_offload_design.md`	Elastic P2P design rationale
`analysis/kv_lifecycle_design.md`	KV cache eviction policy analysis
`analysis/adaptive_prefill_offload_design.md`	Initial adaptive offload design (superseded by elastic)

6. Repository Structure

agentic-kv/
├── analysis/                    # Research reports and design docs
│   ├── pd_separation_analysis.md    # Main comprehensive report
│   ├── elastic_offload_design.md    # Elastic P2P design
│   ├── kv_lifecycle_design.md       # Cache eviction analysis
│   └── ...
├── replayer/                    # Trace replay framework
│   ├── __main__.py              # CLI entry: python -m replayer
│   ├── replay.py                # Async replayer (session-aware, SSE streaming)
│   ├── trace.py                 # TraceRequest dataclass, session/hash_id handling
│   └── metrics.py               # RequestMetrics, crash-safe JSONL sink
├── scripts/
│   ├── cache_aware_proxy.py     # Global scheduler (combined + PD-sep + elastic offload)
│   ├── sample_trace.py          # Cluster-to-machine trace sampler
│   ├── launch_vllm.sh           # Launch combined TP=8
│   ├── launch_pd_mooncake.sh    # Launch PD-Sep with Mooncake
│   ├── launch_elastic_p2p.sh    # Launch elastic P2P (8× kv_both + offload proxy)
│   ├── run_experiments.sh       # Full experiment matrix (combined/PD-sep)
│   ├── run_benchmark.sh         # Single benchmark run
│   ├── gpu_monitor.sh           # GPU utilization sampler (5s CSV)
│   ├── compute_roofline.py      # Prefill/decode roofline analysis
│   ├── analyze_*.py             # Various analysis scripts
│   └── compare_*.py             # Experiment comparison scripts
├── patches/
│   ├── 0001-fix-kv-transfer-abort-race.patch
│   └── README.md
├── third_party/vllm/            # vLLM 0.18.1 source (with patch applied)
├── outputs/                     # Experiment results (gitignored)
├── traces/                      # Sampled traces (gitignored)
├── TODO.md                      # Original research goals
└── REPORT.md                    # This milestone report

7. Key Scripts Reference

Script	What it does	Key flags
`scripts/cache_aware_proxy.py`	Global scheduler + elastic offload proxy	`--combined`, `--offload`, `--policy {linear,lmetric}`, `--heavy-threshold`, `--bootstrap-ports`
`scripts/run_lmetric_ab.sh`	A/B: linear vs lmetric routing policy	Runs both experiments with fresh restart
`scripts/run_elastic_stability_test.sh`	Elastic vs baseline with full isolation	Fresh start/stop per experiment
`scripts/bench.sh`	Standard single-experiment harness	`--tag`, `--mode {baseline,elastic}`
`scripts/sample_trace.py`	Sample complete sessions from cluster trace	`--target-requests`, `--seed`
`python -m replayer`	Replay trace against vLLM endpoint	`--time-scale`, `--max-inflight-sessions`, `--request-limit`
`scripts/gpu_monitor.sh`	Sample nvidia-smi to CSV	Pipe to `outputs/<tag>/gpu_util.csv`
`scripts/launch_elastic_p2p.sh`	Launch all 8 kv_both instances + offload proxy	`HEAVY_THRESHOLD`, `MAX_OFFLOAD` env vars

8. GPU Load Imbalance & Elastic Prefill Service Analysis

8.1 Load Imbalance Characterization

Session-sticky routing creates token load imbalance across instances. The severity depends on scale:

Scale	Imbalance	Top 5 sessions	Cause
200 req (143 sessions)	8.6× tokens	49% of all tokens	Small sample, few dominant sessions
1000 req (668 sessions)	1.24× tokens	29% of all tokens	More sessions → natural averaging

At 1000 requests, the heaviest instance has 4.5M tokens vs lightest 3.6M (1.24×). Despite this, TPOT is uniform across all instances (0.070–0.077), confirming that prefill-decode interference is minimal at ≤1 session/GPU. The imbalance manifests in TTFT only: heaviest 2 instances TTFT p50 = 1.42s vs lightest 2 at 0.83s (1.7× gap).

8.2 Session Accumulation Pattern

Agentic workloads produce long-lived sessions with growing context:

Session	Turns	Total Tokens	Context Growth
1569319	36	2.32M	27k → 98k (+2.0k/turn)
1206593	36	2.31M	15k → 106k (+2.6k/turn)
178176	25	1.93M	36k → 95k (+2.5k/turn)

Top 5 sessions = 29% of all tokens. With session-sticky, these lock their instances, creating persistent load hotspots.

8.3 Benchmark Concurrency vs Production Reality

Critical caveat: All prior experiments used --max-inflight-sessions 8 (1 session/GPU). This is 10–15× below production concurrency and masks the interference that elastic PS is designed to solve.

	Our Benchmark	Production Estimate
Concurrent requests/GPU	1–2	8–15
KV cache usage/GPU	26–28% (single req)	80–100%
Prefill-decode interference	Minimal	Significant

KV cache capacity: 281,888 tokens/GPU (25.8 GiB). A single 75k-token request consumes 27% of KV cache. At production concurrency (~15 req/GPU), KV cache is near-full, triggering eviction, cache misses, and prefill queuing — none of which appear in our 1-req/GPU benchmark.

Measured interference scaling:

Concurrency	TPOT p90	vs 8-session
8 sessions (1/GPU)	0.075s	baseline
16 sessions (2/GPU)	0.103s	+38%
Production (~15/GPU)	not tested	expected >>+45%

8.4 Elastic PS: Two Capabilities Re-Evaluated

Capability 1: Reduce prefill-decode interference (lower TPOT)

At 1 req/GPU (our benchmark): no interference, no benefit. But this is an artifact of unrealistically low concurrency. At ≥2 req/GPU, chunked prefill interrupts decode steps, causing TPOT +38–45%. At production concurrency (~15/GPU), multiple HEAVY prefills sharing a GPU with decode requests would cause severe interference. Elastic PS's ability to isolate heavy prefills on separate GPUs directly addresses this.

Capability 2: Session migration for load balancing

Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to a different instance for decode + future turns. This provides two benefits:

Break session lock-in: Agentic sessions grow +2k tokens/turn over 30+ turns. With session-sticky, a 36-turn session (2.3M tokens total) locks its GPU, creating a hotspot. Elastic PS lets the session migrate to a less-loaded GPU while preserving cache on the original (PS does fast cached prefill, new GPU decodes).
Rebalance without cache loss: Unlike breaking affinity (which destroys cache), elastic PS migration preserves the prefix cache on the original instance — the PS re-uses it for fast prefill, then transfers only the new KV to the destination.

Simulation of migration strategies (1000 req, at current low concurrency):

Strategy	Imbalance	Migrations	KV Transfer Overhead
No migration	1.24×	0	0s
Every 10 turns	1.21×	10	15s
Every 5 turns	1.20×	20	30s

At 1 req/GPU, migration benefit is marginal (imbalance is only 1.24×). At production concurrency where imbalance combines with KV cache pressure and interference, the benefit would be substantially larger.

Capability 3: Soft affinity from cache-hit scoring

The corrected LMetric experiment (§3.4) reveals that explicit session affinity is unnecessary. Cache-hit scoring (new_tokens = input − cached) creates implicit soft affinity — instances with cached prefixes score lower, naturally attracting subsequent turns. This matches hard session-sticky on all metrics (< 2% difference) while providing more flexible load balancing.

8.5 Elastic PS Verdict

Aspect	At 1 req/GPU (tested)	At production load (expected)
TPOT reduction	0% (no interference)	Significant (interference scales with concurrency)
Session migration	Marginal (1.24× → 1.20×)	Larger benefit (KV pressure + interference amplify imbalance)
Cache preservation	N/A	Key advantage vs breaking affinity

At our benchmark concurrency (1 req/GPU), elastic PS is not justified — Mooncake overhead exceeds the non-existent interference benefit. But our benchmark is 10–15× below production load. The real question is whether elastic PS helps at production-realistic concurrency (64–128 concurrent sessions, 8–15 req/GPU), where:

Prefill-decode interference is measurable (already +38% TPOT at just 2/GPU)
KV cache pressure causes eviction and queue delays
Session accumulation creates compounding hotspots
Heavy prefills (50–100k tokens) block decode for seconds

Next step: run --max-inflight-sessions 64 benchmark to test elastic PS at production-realistic concurrency.

9. Conclusions & Next Steps

Established findings:

Full PD separation is net negative for single-machine agentic workloads (KV cache memory wall)
Cache-aware routing is the dominant optimization (+24pp APC, -60% TTFT vs round-robin)
Explicit session affinity is unnecessary — cache-hit scoring creates implicit soft affinity that matches hard session-sticky (< 2% difference)
At low concurrency (1 req/GPU), elastic P2P offload adds overhead without benefit
Our benchmark concurrency is 10–15× below production: --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode interference that appears at ≥2 req/GPU (+38% TPOT) and would dominate at production load (~15 req/GPU)
Experimental methodology matters: warm vs fresh instances cause 2× TTFT difference

Lessons learned:

Prior cross-machine A/B (commit 1e86285) was invalid — warm baseline inflated by 2×
Prior LMetric implementation was invalid — incorrectly shared session-sticky logic with Linear
kv_role=kv_both has non-trivial always-on overhead even when P2P transfer is not used
Experiment isolation (kill all → verify GPU free → fresh start) is critical for reproducibility
Benchmark concurrency must match production — sub-production concurrency hides interference effects that drive system design decisions

Open problems (priority ordered):

Production-concurrency benchmark (--max-inflight-sessions 64+): Validate whether prefill-decode interference at 8–15 req/GPU makes elastic PS net-positive
Multi-machine elastic: P on a different node eliminates GPU memory competition — the main cost that makes single-machine elastic net negative
Layerwise KV transfer: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation
Router state accuracy: proxy shadow state vs vLLM-internal exact state (TODO: vLLM → Redis → router)

Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance analysis added (§8). Benchmark concurrency gap identified — production-load experiments needed.

33 KiB Raw Blame History Unescape Escape