Add elastic PS evaluation plan for production-realistic trace
4 experiments: baseline vs elastic × linear vs lmetric Using corrected trace (w600_r0.0015_st30, 70% multi-turn, APC~76%) and fixed elastic PS (D accounting, offload cap, cache sync). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
120
experiments/elastic_ps_eval.md
Normal file
120
experiments/elastic_ps_eval.md
Normal file
@@ -0,0 +1,120 @@
|
|||||||
|
# Elastic PS Evaluation Plan
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Compare **baseline (PD-combined)** vs **elastic PS (selective prefill offload)** under production-realistic trace on 8×H20.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The baseline (`baseline_r0015_st30`, 912 req) shows:
|
||||||
|
- TPOT p90=0.175s (vs 0.073s at 1 req/GPU) — **prefill-decode interference is real**
|
||||||
|
- APC=67.5% with per-instance range 46–84%
|
||||||
|
- 58% of requests are HEAVY (≥20k), consuming 89% of input tokens
|
||||||
|
|
||||||
|
Elastic PS offloads HEAVY prefills to a different GPU via Mooncake RDMA, isolating decode from prefill interference. Recent bug fixes:
|
||||||
|
- D instance now accounted during prefill phase (prevents D overload)
|
||||||
|
- MAX_OFFLOAD_INFLIGHT=4 cap prevents runaway offloads
|
||||||
|
- D's proxy cache updated after decode (preserves session cache locality)
|
||||||
|
|
||||||
|
## Machine
|
||||||
|
|
||||||
|
dash0: 8×H20 96GB, NVLink, 4×CX7 200Gbps RDMA. SSH: `ssh dash0`.
|
||||||
|
|
||||||
|
## Trace
|
||||||
|
|
||||||
|
`traces/w600_r0.0015_st30.jsonl` on dash0 (1214 requests, 688 sessions, 70% multi-turn).
|
||||||
|
Use `--requests 850` for ~13 min wall clock.
|
||||||
|
|
||||||
|
## Experiments
|
||||||
|
|
||||||
|
### Experiment 1: Baseline (Linear, PD-combined)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/agentic-kv && source .venv/bin/activate
|
||||||
|
bash scripts/bench.sh \
|
||||||
|
--tag eval_baseline_linear \
|
||||||
|
--mode baseline --policy linear \
|
||||||
|
--trace traces/w600_r0.0015_st30.jsonl \
|
||||||
|
--requests 850
|
||||||
|
```
|
||||||
|
|
||||||
|
### Experiment 2: Elastic PS (Linear, kv_both + offload)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash scripts/bench.sh \
|
||||||
|
--tag eval_elastic_linear \
|
||||||
|
--mode elastic --policy linear \
|
||||||
|
--trace traces/w600_r0.0015_st30.jsonl \
|
||||||
|
--requests 850
|
||||||
|
```
|
||||||
|
|
||||||
|
### Experiment 3: Baseline (LMetric, PD-combined)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash scripts/bench.sh \
|
||||||
|
--tag eval_baseline_lmetric \
|
||||||
|
--mode baseline --policy lmetric \
|
||||||
|
--trace traces/w600_r0.0015_st30.jsonl \
|
||||||
|
--requests 850
|
||||||
|
```
|
||||||
|
|
||||||
|
### Experiment 4: Elastic PS (LMetric, kv_both + offload)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash scripts/bench.sh \
|
||||||
|
--tag eval_elastic_lmetric \
|
||||||
|
--mode elastic --policy lmetric \
|
||||||
|
--trace traces/w600_r0.0015_st30.jsonl \
|
||||||
|
--requests 850
|
||||||
|
```
|
||||||
|
|
||||||
|
## What to Measure
|
||||||
|
|
||||||
|
For each experiment, collect from `outputs/<tag>/`:
|
||||||
|
1. `metrics.summary.json`: TTFT (mean/p50/p90), TPOT (mean/p50/p90), E2E, success rate
|
||||||
|
2. `apc.txt`: per-instance prefix cache hit rate
|
||||||
|
3. `breakdown.json`: per-request routing class (WARM/MEDIUM/HEAVY_COLO/HEAVY_OFFLOAD/HEAVY_COLO_FALLBACK)
|
||||||
|
4. `stats.json`: per-instance load at end
|
||||||
|
|
||||||
|
## Analysis
|
||||||
|
|
||||||
|
After all 4 experiments, compare:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import json
|
||||||
|
|
||||||
|
def summarize(path):
|
||||||
|
s = json.load(open(path))
|
||||||
|
return {
|
||||||
|
"ok": "%d/%d" % (s["success_count"], s["request_count"]),
|
||||||
|
"ttft_mean": "%.2f" % s["ttft_stats_s"]["mean"],
|
||||||
|
"ttft_p50": "%.2f" % s["ttft_stats_s"]["p50"],
|
||||||
|
"ttft_p90": "%.2f" % s["ttft_stats_s"]["p90"],
|
||||||
|
"tpot_mean": "%.4f" % s["tpot_stats_s"]["mean"],
|
||||||
|
"tpot_p50": "%.4f" % s["tpot_stats_s"]["p50"],
|
||||||
|
"tpot_p90": "%.4f" % s["tpot_stats_s"]["p90"],
|
||||||
|
"e2e_p50": "%.2f" % s["latency_stats_s"]["p50"],
|
||||||
|
}
|
||||||
|
|
||||||
|
for tag in ["eval_baseline_linear", "eval_elastic_linear",
|
||||||
|
"eval_baseline_lmetric", "eval_elastic_lmetric"]:
|
||||||
|
path = "outputs/%s/metrics.summary.json" % tag
|
||||||
|
print("%-30s %s" % (tag, summarize(path)))
|
||||||
|
```
|
||||||
|
|
||||||
|
Key questions:
|
||||||
|
1. Does elastic PS reduce TPOT? (expect: yes, by isolating heavy prefills from decode)
|
||||||
|
2. Does elastic PS hurt TTFT? (expect: some increase from RDMA overhead on offloaded requests)
|
||||||
|
3. What's the net E2E impact? (TPOT improvement vs TTFT overhead)
|
||||||
|
4. How many requests actually get offloaded? (check breakdown.json HEAVY_OFFLOAD count)
|
||||||
|
5. Does the offload cap (MAX_OFFLOAD=4) get hit? (check breakdown for "cap_reached")
|
||||||
|
6. Per-instance APC: does D maintain cache after migration? (compare APC spread)
|
||||||
|
|
||||||
|
## Expected Results
|
||||||
|
|
||||||
|
Based on analysis:
|
||||||
|
- HEAVY requests: 58% of total, 89% of tokens
|
||||||
|
- TPOT reduction potential: ~66% for WARM/MEDIUM (from 0.11 to 0.038)
|
||||||
|
- RDMA overhead: ~1-15s per offloaded request (bimodal)
|
||||||
|
- Net: TPOT should improve if offload successfully isolates prefill
|
||||||
|
- Risk: Mooncake kv_both memory overhead may negate gains (was +11% TPOT in prior experiment at low concurrency)
|
||||||
Reference in New Issue
Block a user