diff --git a/experiments/elastic_ps_eval.md b/experiments/elastic_ps_eval.md new file mode 100644 index 0000000..f8bb313 --- /dev/null +++ b/experiments/elastic_ps_eval.md @@ -0,0 +1,120 @@ +# Elastic PS Evaluation Plan + +## Goal + +Compare **baseline (PD-combined)** vs **elastic PS (selective prefill offload)** under production-realistic trace on 8×H20. + +## Context + +The baseline (`baseline_r0015_st30`, 912 req) shows: +- TPOT p90=0.175s (vs 0.073s at 1 req/GPU) — **prefill-decode interference is real** +- APC=67.5% with per-instance range 46–84% +- 58% of requests are HEAVY (≥20k), consuming 89% of input tokens + +Elastic PS offloads HEAVY prefills to a different GPU via Mooncake RDMA, isolating decode from prefill interference. Recent bug fixes: +- D instance now accounted during prefill phase (prevents D overload) +- MAX_OFFLOAD_INFLIGHT=4 cap prevents runaway offloads +- D's proxy cache updated after decode (preserves session cache locality) + +## Machine + +dash0: 8×H20 96GB, NVLink, 4×CX7 200Gbps RDMA. SSH: `ssh dash0`. + +## Trace + +`traces/w600_r0.0015_st30.jsonl` on dash0 (1214 requests, 688 sessions, 70% multi-turn). +Use `--requests 850` for ~13 min wall clock. + +## Experiments + +### Experiment 1: Baseline (Linear, PD-combined) + +```bash +cd ~/agentic-kv && source .venv/bin/activate +bash scripts/bench.sh \ + --tag eval_baseline_linear \ + --mode baseline --policy linear \ + --trace traces/w600_r0.0015_st30.jsonl \ + --requests 850 +``` + +### Experiment 2: Elastic PS (Linear, kv_both + offload) + +```bash +bash scripts/bench.sh \ + --tag eval_elastic_linear \ + --mode elastic --policy linear \ + --trace traces/w600_r0.0015_st30.jsonl \ + --requests 850 +``` + +### Experiment 3: Baseline (LMetric, PD-combined) + +```bash +bash scripts/bench.sh \ + --tag eval_baseline_lmetric \ + --mode baseline --policy lmetric \ + --trace traces/w600_r0.0015_st30.jsonl \ + --requests 850 +``` + +### Experiment 4: Elastic PS (LMetric, kv_both + offload) + +```bash +bash scripts/bench.sh \ + --tag eval_elastic_lmetric \ + --mode elastic --policy lmetric \ + --trace traces/w600_r0.0015_st30.jsonl \ + --requests 850 +``` + +## What to Measure + +For each experiment, collect from `outputs//`: +1. `metrics.summary.json`: TTFT (mean/p50/p90), TPOT (mean/p50/p90), E2E, success rate +2. `apc.txt`: per-instance prefix cache hit rate +3. `breakdown.json`: per-request routing class (WARM/MEDIUM/HEAVY_COLO/HEAVY_OFFLOAD/HEAVY_COLO_FALLBACK) +4. `stats.json`: per-instance load at end + +## Analysis + +After all 4 experiments, compare: + +```python +import json + +def summarize(path): + s = json.load(open(path)) + return { + "ok": "%d/%d" % (s["success_count"], s["request_count"]), + "ttft_mean": "%.2f" % s["ttft_stats_s"]["mean"], + "ttft_p50": "%.2f" % s["ttft_stats_s"]["p50"], + "ttft_p90": "%.2f" % s["ttft_stats_s"]["p90"], + "tpot_mean": "%.4f" % s["tpot_stats_s"]["mean"], + "tpot_p50": "%.4f" % s["tpot_stats_s"]["p50"], + "tpot_p90": "%.4f" % s["tpot_stats_s"]["p90"], + "e2e_p50": "%.2f" % s["latency_stats_s"]["p50"], + } + +for tag in ["eval_baseline_linear", "eval_elastic_linear", + "eval_baseline_lmetric", "eval_elastic_lmetric"]: + path = "outputs/%s/metrics.summary.json" % tag + print("%-30s %s" % (tag, summarize(path))) +``` + +Key questions: +1. Does elastic PS reduce TPOT? (expect: yes, by isolating heavy prefills from decode) +2. Does elastic PS hurt TTFT? (expect: some increase from RDMA overhead on offloaded requests) +3. What's the net E2E impact? (TPOT improvement vs TTFT overhead) +4. How many requests actually get offloaded? (check breakdown.json HEAVY_OFFLOAD count) +5. Does the offload cap (MAX_OFFLOAD=4) get hit? (check breakdown for "cap_reached") +6. Per-instance APC: does D maintain cache after migration? (compare APC spread) + +## Expected Results + +Based on analysis: +- HEAVY requests: 58% of total, 89% of tokens +- TPOT reduction potential: ~66% for WARM/MEDIUM (from 0.11 to 0.038) +- RDMA overhead: ~1-15s per offloaded request (bimodal) +- Net: TPOT should improve if offload successfully isolates prefill +- Risk: Mooncake kv_both memory overhead may negate gains (was +11% TPOT in prior experiment at low concurrency)