Files
agentic-kvc/experiments/elastic_ps_eval.md
Gahow Wang 03e88b30bd Add elastic PS evaluation plan for production-realistic trace
4 experiments: baseline vs elastic × linear vs lmetric
Using corrected trace (w600_r0.0015_st30, 70% multi-turn, APC~76%)
and fixed elastic PS (D accounting, offload cap, cache sync).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 15:56:05 +08:00

121 lines
4.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Elastic PS Evaluation Plan
## Goal
Compare **baseline (PD-combined)** vs **elastic PS (selective prefill offload)** under production-realistic trace on 8×H20.
## Context
The baseline (`baseline_r0015_st30`, 912 req) shows:
- TPOT p90=0.175s (vs 0.073s at 1 req/GPU) — **prefill-decode interference is real**
- APC=67.5% with per-instance range 4684%
- 58% of requests are HEAVY (≥20k), consuming 89% of input tokens
Elastic PS offloads HEAVY prefills to a different GPU via Mooncake RDMA, isolating decode from prefill interference. Recent bug fixes:
- D instance now accounted during prefill phase (prevents D overload)
- MAX_OFFLOAD_INFLIGHT=4 cap prevents runaway offloads
- D's proxy cache updated after decode (preserves session cache locality)
## Machine
dash0: 8×H20 96GB, NVLink, 4×CX7 200Gbps RDMA. SSH: `ssh dash0`.
## Trace
`traces/w600_r0.0015_st30.jsonl` on dash0 (1214 requests, 688 sessions, 70% multi-turn).
Use `--requests 850` for ~13 min wall clock.
## Experiments
### Experiment 1: Baseline (Linear, PD-combined)
```bash
cd ~/agentic-kv && source .venv/bin/activate
bash scripts/bench.sh \
--tag eval_baseline_linear \
--mode baseline --policy linear \
--trace traces/w600_r0.0015_st30.jsonl \
--requests 850
```
### Experiment 2: Elastic PS (Linear, kv_both + offload)
```bash
bash scripts/bench.sh \
--tag eval_elastic_linear \
--mode elastic --policy linear \
--trace traces/w600_r0.0015_st30.jsonl \
--requests 850
```
### Experiment 3: Baseline (LMetric, PD-combined)
```bash
bash scripts/bench.sh \
--tag eval_baseline_lmetric \
--mode baseline --policy lmetric \
--trace traces/w600_r0.0015_st30.jsonl \
--requests 850
```
### Experiment 4: Elastic PS (LMetric, kv_both + offload)
```bash
bash scripts/bench.sh \
--tag eval_elastic_lmetric \
--mode elastic --policy lmetric \
--trace traces/w600_r0.0015_st30.jsonl \
--requests 850
```
## What to Measure
For each experiment, collect from `outputs/<tag>/`:
1. `metrics.summary.json`: TTFT (mean/p50/p90), TPOT (mean/p50/p90), E2E, success rate
2. `apc.txt`: per-instance prefix cache hit rate
3. `breakdown.json`: per-request routing class (WARM/MEDIUM/HEAVY_COLO/HEAVY_OFFLOAD/HEAVY_COLO_FALLBACK)
4. `stats.json`: per-instance load at end
## Analysis
After all 4 experiments, compare:
```python
import json
def summarize(path):
s = json.load(open(path))
return {
"ok": "%d/%d" % (s["success_count"], s["request_count"]),
"ttft_mean": "%.2f" % s["ttft_stats_s"]["mean"],
"ttft_p50": "%.2f" % s["ttft_stats_s"]["p50"],
"ttft_p90": "%.2f" % s["ttft_stats_s"]["p90"],
"tpot_mean": "%.4f" % s["tpot_stats_s"]["mean"],
"tpot_p50": "%.4f" % s["tpot_stats_s"]["p50"],
"tpot_p90": "%.4f" % s["tpot_stats_s"]["p90"],
"e2e_p50": "%.2f" % s["latency_stats_s"]["p50"],
}
for tag in ["eval_baseline_linear", "eval_elastic_linear",
"eval_baseline_lmetric", "eval_elastic_lmetric"]:
path = "outputs/%s/metrics.summary.json" % tag
print("%-30s %s" % (tag, summarize(path)))
```
Key questions:
1. Does elastic PS reduce TPOT? (expect: yes, by isolating heavy prefills from decode)
2. Does elastic PS hurt TTFT? (expect: some increase from RDMA overhead on offloaded requests)
3. What's the net E2E impact? (TPOT improvement vs TTFT overhead)
4. How many requests actually get offloaded? (check breakdown.json HEAVY_OFFLOAD count)
5. Does the offload cap (MAX_OFFLOAD=4) get hit? (check breakdown for "cap_reached")
6. Per-instance APC: does D maintain cache after migration? (compare APC spread)
## Expected Results
Based on analysis:
- HEAVY requests: 58% of total, 89% of tokens
- TPOT reduction potential: ~66% for WARM/MEDIUM (from 0.11 to 0.038)
- RDMA overhead: ~1-15s per offloaded request (bimodal)
- Net: TPOT should improve if offload successfully isolates prefill
- Risk: Mooncake kv_both memory overhead may negate gains (was +11% TPOT in prior experiment at low concurrency)