Files
agentic-kvc/REPORT.md
Gahow Wang 6a27f75337 Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).

- REPORT.md §1.1 / §3.9: add errata callout and section header noting
  the "Final Design" framing was retired after cc6e562 / 4c583f2;
  point readers to docs/migration-policy-design.md.

- docs/migration-policy-design.md: rewrite. Opens with the current
  hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
  tie-breaker), then a "What Was Retired" commit table, then the old
  Approach A numbers preserved as "Historical Baseline-Mode Comparison".

- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
  LMetric isn't "neutralized by affinity constraints" (pure --policy
  lmetric has no affinity at all); it converges to similar placements
  because P_tokens includes new_uncached_tokens, giving it implicit
  soft affinity.

- analysis/elastic_hypotheses.md: same LMetric correction in the
  "DOESN'T work" summary, plus a footer cross-referencing the current
  routing direction.

- analysis/unified_routing_fix_review.md: track this file (was
  untracked); it is the review handoff cited from the updated docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00

665 lines
37 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Milestone Report: Elastic P2P vs PD-Combined Baseline
**Date**: 2026-05-22
**Author**: Gahow Wang
**Status**: Phase 1 complete — baseline + elastic validated, system-level analysis done
---
## 1. Research Question
For agentic LLM workloads (long input, short output, high KV cache reuse), is prefill-decode disaggregation beneficial? If full PD separation hurts (proven in §3), can **selective** disaggregation of only heavy requests improve serving latency while preserving KV cache locality?
## 1.1 Errata / Superseded sections
> This report has been revised several times as the methodology matured.
> The sections below are kept for historical context but their numerical
> conclusions have been **superseded** — do not cite them in isolation.
>
> - **§3.1 (initial PD-sep vs PD-combined)**: ran with the old random
> sampler + `--time-scale` compression + `--max-inflight-sessions 8`.
> Cross-session KV reuse dropped from 52% → 16%, and per-GPU concurrency
> was capped at 1 req/GPU. Superseded by §3.6.
> - **Earlier "elastic v3" warm-vs-fresh runs**: baselines were not
> restarted between trials, leaving residual KV cache that inflated
> baseline TTFT ~2×. Superseded by the cold-start results in §3.6/§3.7.
> - **Any reference to running `--max-inflight-sessions 64+`**: that flag
> was removed when replay moved to trace-driven dispatch. The next-step
> experiment requires restoring the flag first (see `FIXES.md` §B2
> route A) before any production-concurrency numbers can be produced.
> - **§3.9 "Final Design" framing**: the single-argmin + PUSH-migration
> design was retired after `cc6e562` / `4c583f2` showed forced and
> relaxed-gate migration variants both regressed E2E tail. Current
> policy is the hybrid LMetric + high-cache affinity landed in
> `255c8e6`. See the per-section note in §3.9 and the active algorithm
> in `docs/migration-policy-design.md`.
>
> The authoritative results are in **§3.6 and §3.7**.
## 2. Experimental Setup
### 2.1 Hardware
| Resource | Spec |
|----------|------|
| Machine | dash0 / dash1 (identical config) |
| GPU | 8× NVIDIA H20 96GB HBM, NVLink |
| Network | 4× ConnectX-7 200Gbps RDMA |
| Storage | cpfs shared storage across machines |
### 2.2 Software
| Component | Version | Notes |
|-----------|---------|-------|
| vLLM | 0.18.1 (source in `third_party/vllm/`) | Patched scheduler assert (see `patches/`) |
| Mooncake | 0.3.10 | RDMA-based KV transfer between instances |
| Python | 3.x managed by `uv` | `.venv/` at project root |
| Model | `Qwen3-Coder-30B-A3B-Instruct` | MoE 128 experts top-8, 3B active params |
| Model path | `~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct` | Same on dash0 and dash1 |
### 2.3 Workload Trace
| Property | Value |
|----------|-------|
| Source | GLM-5.1 Agentic Coder, production cluster, 2h window |
| Raw trace | `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` on dash0 |
| Total requests | 2,114,220 |
| Avg input tokens | 33,600 (p50=20k, p90=88k) |
| Avg output tokens | 445 (p50=80) |
| I/O ratio | 75.6× aggregate |
| Prefill token share | 98% |
| KV reuse breakdown | 62% intra-session, 38% cross-session (token-level) |
| Theoretical max APC | 67% (infinite cache, single instance, prefix-only) |
**Sampled trace for benchmarks**: `traces/w600_r0.0015_st30.jsonl` (1214 requests, 688 sessions, 70% multi-turn). Generated with window+thin sampling:
```bash
python scripts/sample_trace.py \
--input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
--output traces/w600_r0.0015_st30.jsonl \
--sample-ratio 0.0015 --max-single-turn-ratio 0.3 \
--window-seconds 600 --seed 42
```
| Trace property | Value |
|---------------|-------|
| Sessions | 688 (70% multi-turn, avg 4.9 turns) |
| Requests | 1214 (use `--request-limit 850` for daily, full for validation) |
| Avg input tokens | 48,776 |
| Trace span | 2912s (48.5 min); dense segment 0-990s (850 req) |
| Peak QPS | 1.6 req/s (in dense segment) |
| Hash block sharing | 48.3% (vs 52% full trace) |
| Theoretical APC | 80% (full), 76% (first 850 req) |
> **Sampling methodology (2026-05-23)**: Prior traces used random session sampling + `--time-scale` compression + `--max-inflight-sessions` semaphore, which (a) destroyed cross-session hash block sharing (52% → 16%), (b) artificially limited concurrency to 1 req/GPU, and (c) masked prefill-decode interference. The new approach uses contiguous time-window sampling with session thinning (`--max-single-turn-ratio 0.3`) to preserve KV reuse patterns, and trace-driven replay with no artificial concurrency limits.
### 2.4 Two Configurations Compared
#### Baseline: PD-Combined (8× TP=1 DP=8)
```
8 independent vLLM instances, 1 GPU each, no Mooncake.
All instances do both prefill and decode.
Global scheduler (cache_aware_proxy.py --combined) handles:
- Session-sticky routing (multi-turn → same instance)
- Load-aware override (if pinned instance > 2× avg load, redirect)
- Cache-hit scoring (prefer instance with matching prefix blocks)
```
Launch:
```bash
# On dash0:
for i in $(seq 0 7); do
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
--port $((8000+i)) --tp 1 \
--enable-prefix-caching --enforce-eager \
--gpu-memory-utilization 0.9 --max-model-len 200000 \
> /tmp/ab_base_$i.log 2>&1 &
done
python scripts/cache_aware_proxy.py \
--combined http://127.0.0.1:800{0..7} --port 9090
```
#### Elastic P2P Offload (8× TP=1 kv_both + selective offload)
```
8 independent vLLM instances, 1 GPU each, all kv_role=kv_both (Mooncake).
Same global scheduler, plus elastic offload logic:
- Proxy classifies each request: WARM (<5k new), MEDIUM (5-20k), HEAVY (>20k)
- WARM/MEDIUM: co-located on session-sticky instance (no KV transfer)
- HEAVY: prefill on a different instance (P), KV via Mooncake RDMA,
decode on session-sticky instance (D)
- Cap: max 4 concurrent offloads (MAX_OFFLOAD_INFLIGHT)
- P instance selection: round-robin with overload skip
```
Launch:
```bash
# On dash1 (or use scripts/launch_elastic_p2p.sh):
for i in $(seq 0 7); do
VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998+i)) \
MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
vllm serve ~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
--port $((8000+i)) --tp 1 \
--enable-prefix-caching --enforce-eager \
--gpu-memory-utilization 0.9 --max-model-len 200000 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
> /tmp/ab_elastic_$i.log 2>&1 &
sleep 2 # stagger to avoid NCCL port collision
done
# Wait for bootstrap servers
for bp in $(seq 8998 9005); do
until curl -s localhost:$bp/query > /dev/null 2>&1; do sleep 2; done
done
python scripts/cache_aware_proxy.py \
--combined http://127.0.0.1:800{0..7} \
--bootstrap-ports 8998,8999,9000,9001,9002,9003,9004,9005 \
--offload --heavy-threshold 20000 --port 9090
```
### 2.5 Benchmark Parameters
| Parameter | Value |
|-----------|-------|
| Trace | `traces/w600_r0.0015_st30.jsonl` (window+thin, 70% multi-turn) |
| Daily iteration | `--request-limit 850` (~13 min, APC≈76%) |
| Full validation | All 1214 requests (~48 min, APC≈80%) |
| Replay mode | Trace-driven (no session limit, no time compression) |
| Request timeout | 600s |
| vLLM flags | `--enforce-eager --enable-prefix-caching --max-model-len 200000` |
| GPU memory util | 0.9 |
| Fresh restart | Both configs started from cold (no warm cache) |
### 2.6 Reproducing the Benchmark
```bash
# Activate environment
cd ~/agentic-kv && source .venv/bin/activate
# Ensure sampled trace exists
python scripts/sample_trace.py \
--input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
--output traces/sampled_1000req_seed42.jsonl \
--target-requests 1000 --seed 42
# Run benchmark (daily iteration)
bash scripts/bench.sh --tag my_experiment --mode baseline --policy linear \
--trace traces/w600_r0.0015_st30.jsonl --requests 850
# Run benchmark (full validation)
bash scripts/bench.sh --tag my_experiment_full --mode baseline --policy linear \
--trace traces/w600_r0.0015_st30.jsonl
```
## 3. Results
> **Errata (2026-05-22)**: The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were **not freshly restarted** — residual KV cache from prior experiments caused 2× TTFT inflation.
> **Errata (2026-05-23)**: §3.1 results used artificial concurrency limits (`--max-inflight-sessions 8`, 1 req/GPU) and random session sampling that destroyed cross-session KV sharing (52% → 16%). See §3.6 for production-realistic results with corrected methodology.
### 3.1 Legacy Comparison (artificial 1 req/GPU, 200 req)
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|--------|------|----------|----------|----------|---------|
| **Baseline (no Mooncake)** | **198/200** | **1.075s** | **9.384s** | **0.076s** | **5.075s** |
| LMetric routing | 198/200 | 1.099s | 9.392s | 0.073s | 5.205s |
| Elastic P2P (kv_both) | 195/200 | 1.018s | 11.312s | 0.085s | 6.977s |
### 3.2 Per-Class Breakdown
**Baseline (fresh):**
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|-------|-------|---|----------|----------|----------|
| WARM (<5k) | 46 | 23% | 0.137s | 0.262s | 0.061s |
| MEDIUM (5-20k) | 50 | 25% | 0.921s | 1.846s | 0.079s |
| HEAVY (20-50k) | 64 | 32% | 2.660s | 6.278s | 0.076s |
| HEAVY (>50k) | 38 | 19% | 9.587s | 30.415s | 0.102s |
**Elastic P2P (fresh):**
| Class | Count | % | TTFT p50 | TTFT p90 | TPOT p90 |
|-------|-------|---|----------|----------|----------|
| WARM (<5k) | 46 | 23% | 0.142s | 0.279s | 0.072s |
| MEDIUM (5-20k) | 50 | 25% | 0.766s | 1.814s | 0.197s |
| HEAVY (>20k) | 99 | 51% | 6.390s | 22.668s | 0.085s |
### 3.3 Success Rate
| Config | OK | Total | Rate | Failure mode |
|--------|-----|-------|------|-------------|
| Baseline | 198 | 200 | 99.0% | RemoteProtocolError (replayer-side) |
| Elastic P2P | 195 | 200 | 97.5% | 2× RemoteProtocolError + 3× ReadTimeout on >60k |
Elastic's 3 extra errors are D-side KV pull failures: prefill succeeded on P, KV pushed to Mooncake, but D never produced first token (decode scheduler couldn't allocate KV cache space). Prefill timeout fallback (120s → co-located) was never triggered.
### 3.4 Routing Policy: Linear vs LMetric (OSDI'26)
LMetric (`score = P_tokens × BS`, pure per-request, no session affinity) vs Linear (`score = ongoing_tokens - α·cache_hit`, session-sticky). Both fresh-restart, same trace.
> **Errata (2026-05-23)**: Prior LMetric implementation incorrectly shared session-sticky logic with Linear. Fixed to pure per-request routing per OSDI'26 spec: `score = (pending_prefill + new_tokens) × num_requests`, no affinity, no overload override. Results below use corrected implementation.
| Policy | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 | Delta E2E |
|--------|----------|----------|----------|---------|-----------|
| Linear (session-sticky) | 1.073s | 9.347s | 0.073s | 5.119s | — |
| LMetric (no affinity) | 1.081s | 9.408s | 0.072s | 5.102s | **-0.3%** |
**Key finding**: LMetric without explicit session affinity matches Linear with session affinity on all metrics (< 2% difference). The cache-hit term in LMetric's scoring (`new_tokens = input - cache_hit`) creates **implicit soft affinity** instances that already cached a session's prefix get lower P_tokens, naturally attracting subsequent turns. Explicit session-sticky routing is not required; cache-aware load balancing captures it automatically.
APC distribution (LMetric, no affinity): inst_0=60.6%, inst_1=58.3%, inst_2=43.2%, inst_3=28.9%, inst_4=16.6%, inst_5=24.0%, inst_6=13.9%, inst_7=0.0%. Non-uniform but comparable aggregate to Linear's explicit affinity.
### 3.5 Errata: Why Prior Cross-Machine A/B Was Invalid
The initial comparison (commit `1e86285`) reported:
```
Baseline (dash0): TTFT50=2.383 E2E50=10.232 ← WRONG (warm instances)
Elastic (dash1): TTFT50=1.315 E2E50=5.708
Delta: -45% -44% ← INVALID
```
**Evidence that prior baseline was not fresh:**
1. `inst_7` APC = 68.3% impossible from 25 cold-start requests (max ~25%)
2. WARM TTFT p90 = 3.327s (fresh = 0.262s, 12.7× gap) indicates KV cache memory pressure from prior experiments
3. HEAVY TPOT p90 = 0.154s (fresh = 0.076s, 2.0× gap) heavy prefill-decode interference from full KV cache
The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.
### 3.6 Production-Realistic Baseline (trace-driven, corrected methodology)
> Corrected sampling (window+thin, 70% multi-turn, block sharing 48%) and trace-driven replay (no session limit, no time compression). See §2.3 for trace details.
**Linear policy, 912 requests (dense segment), peak QPS ≈ 1.6:**
| Metric | Legacy 3.1, 1 req/GPU) | **New (trace-driven)** | Delta |
|--------|-------------------------|----------------------|-------|
| TTFT mean | 1.07s | **4.54s** | +4.2× |
| TTFT p50 | 1.08s | **0.94s** | -13% |
| TTFT p90 | 9.38s | **14.12s** | +51% |
| TPOT p50 | 0.038s | **0.070s** | **+84%** |
| TPOT p90 | 0.073s | **0.175s** | **+139%** |
| APC (mean) | ~44% | **67.5%** | **+23pp** |
| Errors | 2/200 (1.0%) | 0/912 (0%) | better |
| E2E p50 | 5.08s | 6.98s | +37% |
**Key differences from legacy methodology:**
1. **APC 67.5% vs 44%**: Window+thin sampling preserves cross-session block sharing (48% vs 16% in legacy random sampling), yielding production-realistic cache hit rates. Per-instance APC ranges 4684%.
2. **TPOT +139% at p90**: With trace-driven replay, multiple concurrent requests per GPU create **real prefill-decode interference**. The legacy 1 req/GPU setup showed TPOT p90=0.073s (no interference), but production-realistic load shows TPOT p90=0.175s. This validates that prefill-decode interference is a real problem at production concurrency.
3. **TTFT p50 improved (-13%) but mean degraded (+4.2×)**: Higher APC means cached requests get very fast TTFT (p50=0.94s). But concurrent heavy prefills cause queuing for non-cached requests, inflating the mean and p90.
4. **Per-instance APC imbalance (4684%)**: Routing quality directly determines per-instance APC. The 38pp gap between worst and best instance suggests routing optimization is still the highest-leverage improvement.
**Output**: `outputs/baseline_r0015_st30/` on dash0.
### 3.7 Elastic PS vs Baseline (production-realistic trace)
850 requests, `w600_r0.0015_st30.jsonl`, peak QPS1.6. Baseline on dash0, elastic on dash1.
| Metric | Baseline | Elastic PS | Delta |
|--------|----------|-----------|-------|
| TTFT mean | 4.35s | 4.01s | -7.8% |
| TTFT p50 | 0.94s | 0.93s | -1% |
| TPOT p50 | 0.070 | 0.071 | +2% |
| TPOT p90 | 0.162 | 0.157 | -3.1% |
| E2E p50 | 6.38s | 6.44s | +0.9% |
| APC mean | 60.7% | 59.9% | -0.8pp |
| Errors | 0/850 | 4/832 | 4 ReadTimeout |
**Elastic PS is near-neutral.** Root cause analysis:
**Problem 1: Offload gate too restrictive** only 17/118 HEAVY requests (14%) were offloaded. 75% of HEAVY requests had `cache_ratio=0%` (cold Turn 1), failing the `cache_ratio >= 0.3` gate. The gate was designed to avoid offloading cold requests (full prefill on P is slower than co-located), but this means 86% of HEAVY prefills still interfere with decode.
**Problem 2: Offloaded requests are slower (+50.6%)** HEAVY_OFFLOAD TTFT=19.94s vs HEAVY_COLO=13.25s. Breakdown:
- Prefill on P: 14.72s (P also queued, no faster than co-located)
- KV transfer + decode start on D: 5.71s (pure overhead)
**Interference is real but unaddressed**: 89% of WARM/MEDIUM requests ran concurrently with 1+ HEAVY prefills (up to 60 concurrent). Elastic PS only offloaded 17/118 HEAVY requests insufficient to reduce interference.
**Conclusion**: The offload gate (`cache_ratio >= 0.3`) is correct in principle (cold offload IS slower), but leaves the core problem unsolved. Reducing prefill-decode interference requires either:
1. Offloading ALL heavy prefills (accepting higher TTFT for offloaded requests in exchange for lower TPOT for all)
2. Chunked prefill scheduling that yields to decode (vLLM-side optimization)
3. Dedicated prefill GPUs (full PD separation) if KV memory wall can be solved
**Output**: `outputs/eval_baseline_linear/` on dash0, `outputs/eval_elastic_linear/` on dash1.
### 3.8 Direct KV Cache Migration (Bootstrap-Triggered PUSH)
**Architecture**: D asks C's bootstrap server to PUSH cached KV blocks directly into D's GPU memory via Mooncake RDMA WRITE. C's vLLM scheduler is NOT involved (0 GPU compute on C). D then does local prefill for new tokens + decode.
**Implementation details** (vLLM + Mooncake patches):
1. **Hash table sync** (scheduler worker bootstrap): Each step, scheduler computes delta of `BlockPool.cached_block_hash_to_block` and syncs to worker's bootstrap server via `MooncakeConnectorMetadata.hash_table_updates`.
2. **Token-based block lookup**: D sends `POST /push_blocks` with prompt `token_ids` + D's GPU addresses. C's bootstrap computes block hashes using `sha256` + `NONE_HASH` (same hash function as scheduler), matches against synced hash table.
3. **RDMA PUSH**: C's bootstrap calls `TransferEngine.batch_transfer_sync_write` to push matched KV blocks from C's GPU into D's GPU. This uses the existing RDMA WRITE path (proven reliable), not RDMA READ (which fails on `batch_register_memory`'d GPU memory due to missing `IBV_ACCESS_REMOTE_READ` flags).
4. **Cost model**: `offload when colocated_cost + interference > offload_cost`, where `interference = prefill_time × min(num_requests, 3) × 0.3`. Offload triggers when C has 1+ concurrent request.
5. **Requirements**: `PYTHONHASHSEED` must be set (bench.sh sets `PYTHONHASHSEED=42` for elastic mode) to ensure deterministic `NONE_HASH` across scheduler/worker code paths.
**Minimal test verification** (`scripts/test_direct_read.py`):
| Metric | inst_0 (local cache) | inst_1 (RDMA push from inst_0) |
|--------|---------------------|-------------------------------|
| Turn 2 TTFT | 0.338s | **0.367s** |
| Blocks transferred | | **640/640 matched, push ret=0** |
| External APC | 0% | **80%** |
**Key bugs fixed during development**:
- `NameError: field not imported` missing dataclass import
- Scheduler assertion crash (`assert RequestStatus.is_finished`) partial remote prefill state mismatch
- Hash mismatch 0/640 `sha256` vs `sha256_cbor` (default hash algo is `sha256`, not `sha256_cbor`)
- Hash mismatch 0/640 `from X import NONE_HASH` creates stale value binding after `init_none_hash` reassigns the global; fixed with `import X; X.NONE_HASH`
- RDMA READ ret=-1 `batch_register_memory` only sets `IBV_ACCESS_REMOTE_WRITE`; switched to bootstrap-triggered PUSH
- Cost model 0% trigger removed stale `cache_gate_ratio` check; added interference penalty
**Output**: `outputs/eval_direct_rdma_v*/` on dash0.
### 3.9 Unified Routing (Historical — superseded)
> **Superseded by git history.** The "single argmin + PUSH migration" design
> described here was implemented in `6b255fa`, refined through
> `5892739` (soft affinity), `2b9eae0` (numbers below), and `4b50c5a`
> (queue/overload-gate fixes). Follow-on attempts to scale migration —
> `e991960`/`5772149` (forced session migration) and `bf4469a` (relaxed
> push gate) — were both reverted (`cc6e562`, `4c583f2`) after they
> regressed E2E tail (57 migrations → HEAVY TTFT p90 15.9s → 59.1s;
> 134 offloads → E2E p90 37s → 82s).
>
> Current implementation is the **hybrid LMetric + high-cache affinity**
> direction landed in `255c8e6`. See `docs/migration-policy-design.md`
> for the active algorithm and `analysis/unified_routing_fix_review.md`
> for the reasoning. The numbers below remain valid for the
> `eval_unified_v3` artifact; do not treat them as the current
> production policy.
Replaced two-phase routing (pick_instance offload gate) with single `argmin(expected_latency)` per instance:
```
latency(D) = queue(D) + prefill(D) + transfer(D)
- Local cache: prefill = (input - local_hit) / throughput, transfer = 0
- PUSH from C: prefill = (input - push_hit) / throughput, transfer = 0.1s
- Cold: prefill = input / throughput, transfer = 0
```
Session affinity as soft preference: use last instance if its cost 2× global best.
Only 2 measured parameters: `prefill_throughput=7000 tok/s`, `rdma_overhead=0.1s`.
**Results (eval_unified_v3, 850/850, 0 errors):**
Baseline = `eval_baseline_linear` (plain vLLM, no Mooncake, linear policy, 850 req, same trace).
| Metric | Baseline (plain) | **Unified v3 (kv_both)** | Delta |
|--------|-----------------|-------------------------|-------|
| TTFT mean | 4.348s | **3.277s** | **-24.6%** |
| TTFT p50 | 0.945s | **0.793s** | **-16.1%** |
| TTFT p90 | 12.468s | **8.472s** | **-32.1%** |
| TTFT p99 | 48.149s | **41.587s** | **-13.6%** |
| TPOT mean | 0.116s | 0.112s | -3.1% |
| TPOT p50 | 0.071s | 0.077s | +8.9% |
| TPOT p90 | 0.177s | 0.198s | +11.7% |
| TPOT p99 | 1.018s | **0.816s** | **-19.9%** |
| E2E mean | 19.10s | 19.81s | +3.7% |
| E2E p50 | 6.443s | **5.599s** | **-13.1%** |
| E2E p90 | 42.27s | 47.48s | +12.3% |
| E2E p99 | 192.2s | 238.0s | +23.8% |
**Routing**: 723 LOCAL + 116 PUSH_MIGRATE (13.8%). All 116 pushes had cache (avg 25k tokens) no cold offloads. The unified cost model naturally avoids cold migration because `cold + RDMA > cold` (RDMA adds overhead without reducing prefill).
**Tradeoff**: TTFT uniformly improves (p50 -16%, p90 -32%). TPOT mixed: p50/p90 worse (+9%/+12%), but p99 improves (-20%) PUSH migration relieves the heaviest tail prefills. **E2E tail degrades significantly** (p90 +12%, p99 +24%): kv_both always-on overhead + PUSH transfer latency on migrated requests inflates E2E for long requests, offsetting the TTFT gain. The p50 benefit (-13%) comes from the majority of LOCAL requests getting faster prefill due to reduced queue contention.
**Output**: `outputs/eval_unified_v3/` on dash0, baseline from `outputs/eval_baseline_linear/`.
## 4. System-Level Analysis
### 4.1 Elastic P2P Does Not Improve Single-Machine Performance
Under fair comparison (same machine, both fresh):
| Metric | Baseline | Elastic | Delta |
|--------|----------|---------|-------|
| TTFT p50 | 1.075s | 1.018s | -5.3% |
| TTFT p90 | 9.384s | 11.312s | +20.5% |
| TPOT p90 | 0.076s | 0.085s | +11.6% |
| E2E p50 | 5.075s | 6.977s | +37.5% |
Elastic is **worse** on all metrics except TTFT p50. Root causes:
**1. Mooncake kv_both memory overhead**
Each instance with `kv_role=kv_both` maintains RDMA buffers + Mooncake bootstrap server, reducing GPU memory available for KV cache. This affects ALL requests (including WARM/MEDIUM that don't use P2P transfer), causing more cache eviction and higher TPOT.
Evidence: MEDIUM TPOT p90 = 0.197s (elastic) vs 0.079s (baseline) **2.5× worse** despite MEDIUM requests not using P2P at all.
**2. D-side KV pull failures**
3 HEAVY requests completed prefill on P instance successfully but D-side never produced first token. The KV cache on D was too full to allocate space for the transferred blocks. These became 600s timeouts.
**3. P2P overhead without proportional benefit**
The P2P path adds: prefill queue on P (p50=6.3s) + KV transfer + decode start on D (p50=0.8s). For requests where the D instance isn't under heavy prefill load (which is the case on fresh instances), co-located execution is faster.
### 4.2 When Elastic P2P Could Help
Elastic P2P is designed for the scenario where D-instance decode is disrupted by co-located heavy prefill. On fresh instances with 200 requests, this contention is moderate. The benefit may emerge under:
- Higher sustained load (1000+ concurrent requests)
- Longer experiment duration (KV cache fills up, eviction pressure increases)
- Multi-machine deployment (P on a different node, no memory competition)
## 5. Data & Log Locations
### 5.1 Experiment Outputs (on respective machines)
| Directory | Machine | Config | Notes |
|-----------|---------|--------|-------|
| `outputs/ab_baseline/` | dash0 | Combined 8× TP=1 | ~~Initial A/B~~ (INVALIDATED: warm instances) |
| `outputs/ab_elastic/` | dash0 | Elastic P2P cap=4 | ~~Initial A/B~~ (INVALIDATED) |
| `outputs/baseline_stability_fresh/` | dash0 | Combined 8× fresh | **Canonical baseline** 3.1) |
| `outputs/elastic_stability_*/` | dash0 | Elastic P2P kv_both fresh | **Canonical elastic** 3.1) |
| `outputs/ab_linear/` | dash0 | Linear policy, 200 req | §3.4 routing policy comparison |
| `outputs/ab_lmetric/` | dash0 | LMetric policy, 200 req | §3.4 routing policy comparison |
| `outputs/gpu_ab_combined/` | local | Combined 8× TP=1 | Earlier run, has gpu_util.csv |
| `outputs/gpu_ab_pdsep/` | local | PD-Sep 4P+4D | Earlier run, has gpu_util.csv |
| `outputs/exp2_combined_tp1_dp8/` | local | Combined 8× TP=1 | 1000 req, cache-aware |
| `outputs/exp3_pd_sep_tp1_mooncake/` | local | PD-Sep 4P+4D Mooncake | 1000 req |
### 5.2 vLLM Instance Logs
| Path pattern | Machine | Config |
|-------------|---------|--------|
| `/tmp/ab_base_$i.log` | dash0 | Baseline instances 0-7 |
| `/tmp/ab_elastic_$i.log` | dash1 | Elastic instances 0-7 |
| `/tmp/lmetric_ab_inst_$i.log` | dash0 | Linear policy instances 0-7 3.6) |
| `/tmp/lmetric_inst_$i.log` | dash0 | LMetric policy instances 0-7 3.6) |
Logs contain `Prefix cache hit rate` and `External prefix cache hit rate` lines for APC extraction.
### 5.3 Trace Data
| Path | Machine | Description |
|------|---------|-------------|
| `~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` | dash0 | Full 2h production trace (2.1M requests) |
| `traces/sampled_1000req_seed42.jsonl` | all | Sampled 1000 requests (gitignored, regenerate with `sample_trace.py`) |
### 5.4 Analysis Documents
| File | Content |
|------|---------|
| `analysis/pd_separation_analysis.md` | Main report: PD-Sep vs Combined + Elastic P2P 5) |
| `analysis/elastic_offload_design.md` | Elastic P2P design rationale |
| `analysis/kv_lifecycle_design.md` | KV cache eviction policy analysis |
| `analysis/adaptive_prefill_offload_design.md` | Initial adaptive offload design (superseded by elastic) |
## 6. Repository Structure
```
agentic-kv/
├── analysis/ # Research reports and design docs
│ ├── pd_separation_analysis.md # Main comprehensive report
│ ├── elastic_offload_design.md # Elastic P2P design
│ ├── kv_lifecycle_design.md # Cache eviction analysis
│ └── ...
├── replayer/ # Trace replay framework
│ ├── __main__.py # CLI entry: python -m replayer
│ ├── replay.py # Async replayer (session-aware, SSE streaming)
│ ├── trace.py # TraceRequest dataclass, session/hash_id handling
│ └── metrics.py # RequestMetrics, crash-safe JSONL sink
├── scripts/
│ ├── cache_aware_proxy.py # Global scheduler (combined + PD-sep + elastic offload)
│ ├── sample_trace.py # Cluster-to-machine trace sampler
│ ├── launch_vllm.sh # Launch combined TP=8
│ ├── launch_pd_mooncake.sh # Launch PD-Sep with Mooncake
│ ├── launch_elastic_p2p.sh # Launch elastic P2P (8× kv_both + offload proxy)
│ ├── run_experiments.sh # Full experiment matrix (combined/PD-sep)
│ ├── run_benchmark.sh # Single benchmark run
│ ├── gpu_monitor.sh # GPU utilization sampler (5s CSV)
│ ├── compute_roofline.py # Prefill/decode roofline analysis
│ ├── analyze_*.py # Various analysis scripts
│ └── compare_*.py # Experiment comparison scripts
├── patches/
│ ├── 0001-fix-kv-transfer-abort-race.patch
│ └── README.md
├── third_party/vllm/ # vLLM 0.18.1 source (with patch applied)
├── outputs/ # Experiment results (gitignored)
├── traces/ # Sampled traces (gitignored)
├── TODO.md # Original research goals
└── REPORT.md # This milestone report
```
## 7. Key Scripts Reference
| Script | What it does | Key flags |
|--------|-------------|-----------|
| `scripts/cache_aware_proxy.py` | Global scheduler + elastic offload proxy | `--combined`, `--offload`, `--policy {linear,lmetric}`, `--heavy-threshold`, `--bootstrap-ports` |
| `scripts/run_lmetric_ab.sh` | A/B: linear vs lmetric routing policy | Runs both experiments with fresh restart |
| `scripts/run_elastic_stability_test.sh` | Elastic vs baseline with full isolation | Fresh start/stop per experiment |
| `scripts/bench.sh` | Standard single-experiment harness | `--tag`, `--mode {baseline,elastic}` |
| `scripts/sample_trace.py` | Sample complete sessions from cluster trace | `--target-requests`, `--seed` |
| `python -m replayer` | Replay trace against vLLM endpoint | `--time-scale`, `--max-inflight-sessions`, `--request-limit` |
| `scripts/gpu_monitor.sh` | Sample nvidia-smi to CSV | Pipe to `outputs/<tag>/gpu_util.csv` |
| `scripts/launch_elastic_p2p.sh` | Launch all 8 kv_both instances + offload proxy | `HEAVY_THRESHOLD`, `MAX_OFFLOAD` env vars |
## 8. GPU Load Imbalance & Elastic Prefill Service Analysis
### 8.1 Load Imbalance Characterization
Session-sticky routing creates token load imbalance across instances. The severity depends on scale:
| Scale | Imbalance | Top 5 sessions | Cause |
|-------|-----------|----------------|-------|
| 200 req (143 sessions) | **8.6×** tokens | 49% of all tokens | Small sample, few dominant sessions |
| 1000 req (668 sessions) | **1.24×** tokens | 29% of all tokens | More sessions natural averaging |
At 1000 requests, the heaviest instance has 4.5M tokens vs lightest 3.6M (1.24×). Despite this, TPOT is uniform across all instances (0.0700.077), confirming that prefill-decode interference is minimal at 1 session/GPU. The imbalance manifests in **TTFT only**: heaviest 2 instances TTFT p50 = 1.42s vs lightest 2 at 0.83s (1.7× gap).
### 8.2 Session Accumulation Pattern
Agentic workloads produce long-lived sessions with growing context:
| Session | Turns | Total Tokens | Context Growth |
|---------|-------|-------------|----------------|
| 1569319 | 36 | 2.32M | 27k 98k (+2.0k/turn) |
| 1206593 | 36 | 2.31M | 15k 106k (+2.6k/turn) |
| 178176 | 25 | 1.93M | 36k 95k (+2.5k/turn) |
Top 5 sessions = 29% of all tokens. With session-sticky, these lock their instances, creating persistent load hotspots.
### 8.3 Benchmark Concurrency vs Production Reality
> **Critical caveat**: All prior experiments used `--max-inflight-sessions 8` (1 session/GPU). This is **1015× below production concurrency** and masks the interference that elastic PS is designed to solve.
| | Our Benchmark | Production Estimate |
|--|---------------|---------------------|
| Concurrent requests/GPU | 12 | **815** |
| KV cache usage/GPU | 2628% (single req) | **80100%** |
| Prefill-decode interference | Minimal | **Significant** |
**KV cache capacity**: 281,888 tokens/GPU (25.8 GiB). A single 75k-token request consumes 27% of KV cache. At production concurrency (~15 req/GPU), KV cache is near-full, triggering eviction, cache misses, and prefill queuing none of which appear in our 1-req/GPU benchmark.
**Measured interference scaling**:
| Concurrency | TPOT p90 | vs 8-session |
|-------------|----------|-------------|
| 8 sessions (1/GPU) | 0.075s | baseline |
| 16 sessions (2/GPU) | 0.103s | **+38%** |
| Production (~15/GPU) | *not tested* | *expected >>+45%* |
### 8.4 Elastic PS: Two Capabilities Re-Evaluated
**Capability 1: Reduce prefill-decode interference (lower TPOT)**
At 1 req/GPU (our benchmark): no interference, no benefit. But this is an artifact of unrealistically low concurrency. At 2 req/GPU, chunked prefill interrupts decode steps, causing TPOT +3845%. At production concurrency (~15/GPU), multiple HEAVY prefills sharing a GPU with decode requests would cause severe interference. Elastic PS's ability to isolate heavy prefills on separate GPUs directly addresses this.
**Capability 2: Session migration for load balancing**
Elastic PS enables mid-session migration: prefill on original instance (cache hit), KV transfer to a different instance for decode + future turns. This provides two benefits:
1. **Break session lock-in**: Agentic sessions grow +2k tokens/turn over 30+ turns. With session-sticky, a 36-turn session (2.3M tokens total) locks its GPU, creating a hotspot. Elastic PS lets the session migrate to a less-loaded GPU while preserving cache on the original (PS does fast cached prefill, new GPU decodes).
2. **Rebalance without cache loss**: Unlike breaking affinity (which destroys cache), elastic PS migration preserves the prefix cache on the original instance the PS re-uses it for fast prefill, then transfers only the new KV to the destination.
Simulation of migration strategies (1000 req, at current low concurrency):
| Strategy | Imbalance | Migrations | KV Transfer Overhead |
|----------|-----------|------------|---------------------|
| No migration | 1.24× | 0 | 0s |
| Every 10 turns | 1.21× | 10 | 15s |
| Every 5 turns | 1.20× | 20 | 30s |
At 1 req/GPU, migration benefit is marginal (imbalance is only 1.24×). At production concurrency where imbalance combines with KV cache pressure and interference, the benefit would be substantially larger.
**Capability 3: Soft affinity from cache-hit scoring**
The corrected LMetric experiment 3.4) reveals that **explicit session affinity is unnecessary**. Cache-hit scoring (`new_tokens = input cached`) creates implicit soft affinity instances with cached prefixes score lower, naturally attracting subsequent turns. This matches hard session-sticky on all metrics (< 2% difference) while providing more flexible load balancing.
### 8.5 Elastic PS Verdict
| Aspect | At 1 req/GPU (tested) | At production load (expected) |
|--------|----------------------|-------------------------------|
| TPOT reduction | 0% (no interference) | **Significant** (interference scales with concurrency) |
| Session migration | Marginal (1.24× 1.20×) | **Larger benefit** (KV pressure + interference amplify imbalance) |
| Cache preservation | N/A | **Key advantage** vs breaking affinity |
**At our benchmark concurrency (1 req/GPU), elastic PS is not justified** Mooncake overhead exceeds the non-existent interference benefit. **But our benchmark is 1015× below production load.** The real question is whether elastic PS helps at production-realistic concurrency (64128 concurrent sessions, 815 req/GPU), where:
- Prefill-decode interference is measurable (already +38% TPOT at just 2/GPU)
- KV cache pressure causes eviction and queue delays
- Session accumulation creates compounding hotspots
- Heavy prefills (50100k tokens) block decode for seconds
**Next step: run `--max-inflight-sessions 64` benchmark** to test elastic PS at production-realistic concurrency.
## 9. Conclusions & Next Steps
### Established findings:
1. Full PD separation is **net negative** for single-machine agentic workloads (KV cache memory wall)
2. Cache-aware routing is the **dominant optimization** (+24pp APC, -60% TTFT vs round-robin)
3. **Explicit session affinity is unnecessary** cache-hit scoring creates implicit soft affinity that matches hard session-sticky (< 2% difference)
4. At low concurrency (1 req/GPU), elastic P2P offload adds overhead without benefit
5. **Our benchmark concurrency is 1015× below production**: `--max-inflight-sessions 8` yields 1 req/GPU, masking prefill-decode interference that appears at 2 req/GPU (+38% TPOT) and would dominate at production load (~15 req/GPU)
6. **Experimental methodology matters**: warm vs fresh instances cause 2× TTFT difference
### Lessons learned:
- Prior cross-machine A/B (commit `1e86285`) was invalid warm baseline inflated by 2×
- Prior LMetric implementation was invalid incorrectly shared session-sticky logic with Linear
- `kv_role=kv_both` has non-trivial always-on overhead even when P2P transfer is not used
- Experiment isolation (kill all verify GPU free fresh start) is critical for reproducibility
- **Benchmark concurrency must match production** sub-production concurrency hides interference effects that drive system design decisions
### Open problems (priority ordered):
1. **Production-concurrency benchmark** (`--max-inflight-sessions 64+`): Validate whether prefill-decode interference at 815 req/GPU makes elastic PS net-positive
2. **Multi-machine elastic**: P on a different node eliminates GPU memory competition the main cost that makes single-machine elastic net negative
3. **Layerwise KV transfer**: Mooncake's block-level transfer after full prefill is the bottleneck. Layerwise pipelining could reduce transfer latency by overlapping with computation
4. **Router state accuracy**: proxy shadow state vs vLLM-internal exact state (TODO: vLLM Redis router)
---
*Updated 2026-05-23. LMetric corrected (§3.4 errata). GPU imbalance analysis added (§8). Benchmark concurrency gap identified — production-load experiments needed.*