Files
agentic-kvc/experiments/elastic_ps_eval.md
Gahow Wang 03e88b30bd Add elastic PS evaluation plan for production-realistic trace
4 experiments: baseline vs elastic × linear vs lmetric
Using corrected trace (w600_r0.0015_st30, 70% multi-turn, APC~76%)
and fixed elastic PS (D accounting, offload cap, cache sync).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 15:56:05 +08:00

4.0 KiB
Raw Permalink Blame History

Elastic PS Evaluation Plan

Goal

Compare baseline (PD-combined) vs elastic PS (selective prefill offload) under production-realistic trace on 8×H20.

Context

The baseline (baseline_r0015_st30, 912 req) shows:

  • TPOT p90=0.175s (vs 0.073s at 1 req/GPU) — prefill-decode interference is real
  • APC=67.5% with per-instance range 4684%
  • 58% of requests are HEAVY (≥20k), consuming 89% of input tokens

Elastic PS offloads HEAVY prefills to a different GPU via Mooncake RDMA, isolating decode from prefill interference. Recent bug fixes:

  • D instance now accounted during prefill phase (prevents D overload)
  • MAX_OFFLOAD_INFLIGHT=4 cap prevents runaway offloads
  • D's proxy cache updated after decode (preserves session cache locality)

Machine

dash0: 8×H20 96GB, NVLink, 4×CX7 200Gbps RDMA. SSH: ssh dash0.

Trace

traces/w600_r0.0015_st30.jsonl on dash0 (1214 requests, 688 sessions, 70% multi-turn). Use --requests 850 for ~13 min wall clock.

Experiments

Experiment 1: Baseline (Linear, PD-combined)

cd ~/agentic-kv && source .venv/bin/activate
bash scripts/bench.sh \
    --tag eval_baseline_linear \
    --mode baseline --policy linear \
    --trace traces/w600_r0.0015_st30.jsonl \
    --requests 850

Experiment 2: Elastic PS (Linear, kv_both + offload)

bash scripts/bench.sh \
    --tag eval_elastic_linear \
    --mode elastic --policy linear \
    --trace traces/w600_r0.0015_st30.jsonl \
    --requests 850

Experiment 3: Baseline (LMetric, PD-combined)

bash scripts/bench.sh \
    --tag eval_baseline_lmetric \
    --mode baseline --policy lmetric \
    --trace traces/w600_r0.0015_st30.jsonl \
    --requests 850

Experiment 4: Elastic PS (LMetric, kv_both + offload)

bash scripts/bench.sh \
    --tag eval_elastic_lmetric \
    --mode elastic --policy lmetric \
    --trace traces/w600_r0.0015_st30.jsonl \
    --requests 850

What to Measure

For each experiment, collect from outputs/<tag>/:

  1. metrics.summary.json: TTFT (mean/p50/p90), TPOT (mean/p50/p90), E2E, success rate
  2. apc.txt: per-instance prefix cache hit rate
  3. breakdown.json: per-request routing class (WARM/MEDIUM/HEAVY_COLO/HEAVY_OFFLOAD/HEAVY_COLO_FALLBACK)
  4. stats.json: per-instance load at end

Analysis

After all 4 experiments, compare:

import json

def summarize(path):
    s = json.load(open(path))
    return {
        "ok": "%d/%d" % (s["success_count"], s["request_count"]),
        "ttft_mean": "%.2f" % s["ttft_stats_s"]["mean"],
        "ttft_p50": "%.2f" % s["ttft_stats_s"]["p50"],
        "ttft_p90": "%.2f" % s["ttft_stats_s"]["p90"],
        "tpot_mean": "%.4f" % s["tpot_stats_s"]["mean"],
        "tpot_p50": "%.4f" % s["tpot_stats_s"]["p50"],
        "tpot_p90": "%.4f" % s["tpot_stats_s"]["p90"],
        "e2e_p50": "%.2f" % s["latency_stats_s"]["p50"],
    }

for tag in ["eval_baseline_linear", "eval_elastic_linear",
            "eval_baseline_lmetric", "eval_elastic_lmetric"]:
    path = "outputs/%s/metrics.summary.json" % tag
    print("%-30s %s" % (tag, summarize(path)))

Key questions:

  1. Does elastic PS reduce TPOT? (expect: yes, by isolating heavy prefills from decode)
  2. Does elastic PS hurt TTFT? (expect: some increase from RDMA overhead on offloaded requests)
  3. What's the net E2E impact? (TPOT improvement vs TTFT overhead)
  4. How many requests actually get offloaded? (check breakdown.json HEAVY_OFFLOAD count)
  5. Does the offload cap (MAX_OFFLOAD=4) get hit? (check breakdown for "cap_reached")
  6. Per-instance APC: does D maintain cache after migration? (compare APC spread)

Expected Results

Based on analysis:

  • HEAVY requests: 58% of total, 89% of tokens
  • TPOT reduction potential: ~66% for WARM/MEDIUM (from 0.11 to 0.038)
  • RDMA overhead: ~1-15s per offloaded request (bimodal)
  • Net: TPOT should improve if offload successfully isolates prefill
  • Risk: Mooncake kv_both memory overhead may negate gains (was +11% TPOT in prior experiment at low concurrency)