Production-realistic baseline: APC 67.5%, TPOT +139% from interference

Updated methodology:
- Window+thin sampling preserves cross-session sharing (48% vs 16%)
- --max-single-turn-ratio 0.3 boosts multi-turn to 70%
- --window-seconds 600 for 10-min contiguous window
- Trace-driven replay (no session limit, no time compression)
- Daily config: --requests 850 (~13 min, APC~76%)

Key result: TPOT p90=0.175s (vs 0.073s in legacy 1-req/GPU setup),
confirming prefill-decode interference is real at production concurrency.
APC 67.5% (vs 44%) from better KV reuse preservation.

Also fixed KV reuse breakdown: 62% intra-session / 38% cross-session
(was incorrectly reported as 91% / 9%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-23 15:44:34 +08:00
parent d8dc9dc0ce
commit bf037594c4
3 changed files with 111 additions and 54 deletions

View File

@@ -42,10 +42,30 @@ For agentic LLM workloads (long input, short output, high KV cache reuse), is pr
| Avg output tokens | 445 (p50=80) | | Avg output tokens | 445 (p50=80) |
| I/O ratio | 75.6× aggregate | | I/O ratio | 75.6× aggregate |
| Prefill token share | 98% | | Prefill token share | 98% |
| KV reuse (intra-session) | 91% of reusable blocks | | KV reuse breakdown | 62% intra-session, 38% cross-session (token-level) |
| Theoretical max APC | 71% (infinite cache, single instance) | | Theoretical max APC | 67% (infinite cache, single instance, prefix-only) |
**Sampled trace for benchmarks**: `traces/sampled_1000req_seed42.jsonl` (1000 requests, seed=42, preserving session structure). For 200-request ablations: replayer `--request-limit 200`. **Sampled trace for benchmarks**: `traces/w600_r0.0015_st30.jsonl` (1214 requests, 688 sessions, 70% multi-turn). Generated with window+thin sampling:
```bash
python scripts/sample_trace.py \
--input ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
--output traces/w600_r0.0015_st30.jsonl \
--sample-ratio 0.0015 --max-single-turn-ratio 0.3 \
--window-seconds 600 --seed 42
```
| Trace property | Value |
|---------------|-------|
| Sessions | 688 (70% multi-turn, avg 4.9 turns) |
| Requests | 1214 (use `--request-limit 850` for daily, full for validation) |
| Avg input tokens | 48,776 |
| Trace span | 2912s (48.5 min); dense segment 0-990s (850 req) |
| Peak QPS | 1.6 req/s (in dense segment) |
| Hash block sharing | 48.3% (vs 52% full trace) |
| Theoretical APC | 80% (full), 76% (first 850 req) |
> **Sampling methodology (2026-05-23)**: Prior traces used random session sampling + `--time-scale` compression + `--max-inflight-sessions` semaphore, which (a) destroyed cross-session hash block sharing (52% → 16%), (b) artificially limited concurrency to 1 req/GPU, and (c) masked prefill-decode interference. The new approach uses contiguous time-window sampling with session thinning (`--max-single-turn-ratio 0.3`) to preserve KV reuse patterns, and trace-driven replay with no artificial concurrency limits.
### 2.4 Two Configurations Compared ### 2.4 Two Configurations Compared
@@ -119,9 +139,10 @@ python scripts/cache_aware_proxy.py \
| Parameter | Value | | Parameter | Value |
|-----------|-------| |-----------|-------|
| Requests | 200 (from sampled 1000-req trace, `--request-limit 200`) | | Trace | `traces/w600_r0.0015_st30.jsonl` (window+thin, 70% multi-turn) |
| Time scale | 20× (compress 2h trace into ~6min) | | Daily iteration | `--request-limit 850` (~13 min, APC≈76%) |
| Max inflight sessions | 8 | | Full validation | All 1214 requests (~48 min, APC≈80%) |
| Replay mode | Trace-driven (no session limit, no time compression) |
| Request timeout | 600s | | Request timeout | 600s |
| vLLM flags | `--enforce-eager --enable-prefix-caching --max-model-len 200000` | | vLLM flags | `--enforce-eager --enable-prefix-caching --max-model-len 200000` |
| GPU memory util | 0.9 | | GPU memory util | 0.9 |
@@ -139,31 +160,22 @@ python scripts/sample_trace.py \
--output traces/sampled_1000req_seed42.jsonl \ --output traces/sampled_1000req_seed42.jsonl \
--target-requests 1000 --seed 42 --target-requests 1000 --seed 42
# Start GPU monitoring (in a separate terminal) # Run benchmark (daily iteration)
bash scripts/gpu_monitor.sh > outputs/<tag>/gpu_util.csv & bash scripts/bench.sh --tag my_experiment --mode baseline --policy linear \
--trace traces/w600_r0.0015_st30.jsonl --requests 850
# Run replayer against proxy # Run benchmark (full validation)
python -m replayer \ bash scripts/bench.sh --tag my_experiment_full --mode baseline --policy linear \
--trace traces/sampled_1000req_seed42.jsonl \ --trace traces/w600_r0.0015_st30.jsonl
--output outputs/<tag>/metrics.jsonl \
--endpoint http://localhost:9090 \
--time-scale 20 --max-inflight-sessions 8 \
--request-limit 200 -v
# Collect proxy breakdown (elastic only)
curl -s http://localhost:9090/breakdown > outputs/<tag>/breakdown.json
# Collect APC from vLLM logs
for i in $(seq 0 7); do
grep "Prefix cache hit rate\|External prefix cache hit rate" /tmp/<prefix>_$i.log | tail -2
done
``` ```
## 3. Results ## 3. Results
> **Errata (2026-05-22)**: The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were **not freshly restarted** — residual KV cache from prior experiments caused 2× TTFT inflation. All results below use verified fresh-restart experiments on the same machine. > **Errata (2026-05-22)**: The initial cross-machine A/B (dash0 baseline vs dash1 elastic) reported -44% E2E improvement. Post-hoc analysis revealed the dash0 baseline instances were **not freshly restarted** — residual KV cache from prior experiments caused 2× TTFT inflation.
### 3.1 Fair Comparison (all fresh-restart, same machine dash0, 200 req) > **Errata (2026-05-23)**: §3.1 results used artificial concurrency limits (`--max-inflight-sessions 8`, 1 req/GPU) and random session sampling that destroyed cross-session KV sharing (52% → 16%). See §3.6 for production-realistic results with corrected methodology.
### 3.1 Legacy Comparison (artificial 1 req/GPU, 200 req)
| Config | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 | | Config | OK/N | TTFT p50 | TTFT p90 | TPOT p90 | E2E p50 |
|--------|------|----------|----------|----------|---------| |--------|------|----------|----------|----------|---------|
@@ -230,6 +242,35 @@ Delta: -45% -44% ← INVALID
The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline. The elastic numbers on dash1 were genuinely fresh. The "improvement" was actually comparing fresh elastic against degraded baseline.
### 3.6 Production-Realistic Baseline (trace-driven, corrected methodology)
> Corrected sampling (window+thin, 70% multi-turn, block sharing 48%) and trace-driven replay (no session limit, no time compression). See §2.3 for trace details.
**Linear policy, 912 requests (dense segment), peak QPS ≈ 1.6:**
| Metric | Legacy 3.1, 1 req/GPU) | **New (trace-driven)** | Delta |
|--------|-------------------------|----------------------|-------|
| TTFT mean | 1.07s | **4.54s** | +4.2× |
| TTFT p50 | 1.08s | **0.94s** | -13% |
| TTFT p90 | 9.38s | **14.12s** | +51% |
| TPOT p50 | 0.038s | **0.070s** | **+84%** |
| TPOT p90 | 0.073s | **0.175s** | **+139%** |
| APC (mean) | ~44% | **67.5%** | **+23pp** |
| Errors | 2/200 (1.0%) | 0/912 (0%) | better |
| E2E p50 | 5.08s | 6.98s | +37% |
**Key differences from legacy methodology:**
1. **APC 67.5% vs 44%**: Window+thin sampling preserves cross-session block sharing (48% vs 16% in legacy random sampling), yielding production-realistic cache hit rates. Per-instance APC ranges 4684%.
2. **TPOT +139% at p90**: With trace-driven replay, multiple concurrent requests per GPU create **real prefill-decode interference**. The legacy 1 req/GPU setup showed TPOT p90=0.073s (no interference), but production-realistic load shows TPOT p90=0.175s. This validates that prefill-decode interference is a real problem at production concurrency.
3. **TTFT p50 improved (-13%) but mean degraded (+4.2×)**: Higher APC means cached requests get very fast TTFT (p50=0.94s). But concurrent heavy prefills cause queuing for non-cached requests, inflating the mean and p90.
4. **Per-instance APC imbalance (4684%)**: Routing quality directly determines per-instance APC. The 38pp gap between worst and best instance suggests routing optimization is still the highest-leverage improvement.
**Output**: `outputs/baseline_r0015_st30/` on dash0.
## 4. System-Level Analysis ## 4. System-Level Analysis
### 4.1 Elastic P2P Does Not Improve Single-Machine Performance ### 4.1 Elastic P2P Does Not Improve Single-Machine Performance

View File

@@ -21,7 +21,7 @@ VENV="${VENV_PATH:-$PROJECT_DIR/.venv/bin}"
PYTHON="$VENV/python" PYTHON="$VENV/python"
VLLM="$VENV/vllm" VLLM="$VENV/vllm"
MODEL="${MODEL_PATH:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}" MODEL="${MODEL_PATH:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
TRACE="$PROJECT_DIR/traces/sampled_1000req_seed42.jsonl" TRACE="${TRACE:-$PROJECT_DIR/traces/sampled_1000req_seed42.jsonl}"
# Defaults # Defaults
TAG="" TAG=""
@@ -44,6 +44,7 @@ while [[ $# -gt 0 ]]; do
--policy) POLICY="$2"; shift 2 ;; --policy) POLICY="$2"; shift 2 ;;
--instances) N_INSTANCES="$2"; shift 2 ;; --instances) N_INSTANCES="$2"; shift 2 ;;
--requests) REQUESTS="$2"; shift 2 ;; --requests) REQUESTS="$2"; shift 2 ;;
--trace) TRACE="$2"; shift 2 ;;
--heavy-threshold) HEAVY_THRESHOLD="$2"; shift 2 ;; --heavy-threshold) HEAVY_THRESHOLD="$2"; shift 2 ;;
--no-offload) NO_OFFLOAD=true; shift ;; --no-offload) NO_OFFLOAD=true; shift ;;
--overload-factor) OVERLOAD_FACTOR_ARG="$2"; shift 2 ;; --overload-factor) OVERLOAD_FACTOR_ARG="$2"; shift 2 ;;

View File

@@ -70,13 +70,15 @@ def sample_sessions(
sample_ratio: float | None = None, sample_ratio: float | None = None,
target_requests: int | None = None, target_requests: int | None = None,
max_single_turn_ratio: float | None = None, max_single_turn_ratio: float | None = None,
window_seconds: float | None = None,
seed: int, seed: int,
) -> list[str]: ) -> list[str]:
"""Sample sessions preserving KV cache reuse.""" """Sample sessions preserving KV cache reuse."""
rng = random.Random(seed) rng = random.Random(seed)
if sample_ratio is not None: if sample_ratio is not None:
selected = _sample_window_then_thin(rows_by_session, sample_ratio, rng) selected = _sample_window_then_thin(rows_by_session, sample_ratio,
window_seconds, rng)
elif target_requests is not None: elif target_requests is not None:
all_sids = list(rows_by_session.keys()) all_sids = list(rows_by_session.keys())
rng.shuffle(all_sids) rng.shuffle(all_sids)
@@ -121,17 +123,18 @@ def _cap_single_turn(
def _sample_window_then_thin( def _sample_window_then_thin(
rows_by_session: dict[str, list[dict]], rows_by_session: dict[str, list[dict]],
ratio: float, ratio: float,
window_seconds: float | None,
rng: random.Random, rng: random.Random,
) -> list[str]: ) -> list[str]:
"""Window + thin sampling that preserves cross-session sharing. """Window + thin sampling that preserves cross-session sharing.
1. Compute first-request timestamp for each session. 1. Compute first-request timestamp for each session.
2. Pick a contiguous time window sized so that 2. Pick a contiguous time window:
window_sessions * thin_ratio ≈ total_sessions * ratio. - If --window-seconds given: use that duration, thin by ratio within it.
thin_ratio is kept >= 0.5 to preserve cross-session sharing. - Otherwise: auto-size so window_sessions * thin_ratio ≈ target.
3. Randomly drop (1 - thin_ratio) of sessions within the window. 3. Keep all sessions whose first request falls within the window.
4. Randomly thin sessions within the window to hit target count.
""" """
# Session start times (timestamp of first request)
session_starts: list[tuple[float, str]] = [] session_starts: list[tuple[float, str]] = []
for sid, rows in rows_by_session.items(): for sid, rows in rows_by_session.items():
t0 = min(float(r["timestamp"]) for r in rows) t0 = min(float(r["timestamp"]) for r in rows)
@@ -140,35 +143,44 @@ def _sample_window_then_thin(
total_sessions = len(session_starts) total_sessions = len(session_starts)
target_n = max(1, int(total_sessions * ratio)) target_n = max(1, int(total_sessions * ratio))
trace_start = session_starts[0][0]
trace_end = session_starts[-1][0]
trace_duration = trace_end - trace_start
# Determine thin_ratio and window size if window_seconds is not None:
# thin_ratio >= 0.5 to preserve sharing; prefer 1.0 if window fits # Fixed window: pick a random start, thin to hit target ratio
# window_sessions = target_n / thin_ratio max_start_t = trace_end - window_seconds
if max_start_t <= trace_start:
win_start_t = trace_start
else:
win_start_t = trace_start + rng.random() * (max_start_t - trace_start)
win_end_t = win_start_t + window_seconds
window_sids = [sid for t, sid in session_starts
if win_start_t <= t <= win_end_t]
# Thin to target
if len(window_sids) > target_n:
thin_ratio = target_n / len(window_sids)
window_sids = [s for s in window_sids if rng.random() < thin_ratio]
return window_sids
# Auto-size window
thin_ratio = min(1.0, max(0.5, ratio * 10)) thin_ratio = min(1.0, max(0.5, ratio * 10))
window_sessions = int(target_n / thin_ratio) window_sessions = min(int(target_n / thin_ratio), total_sessions)
window_sessions = min(window_sessions, total_sessions)
# Pick window start: random position in the trace
max_start = total_sessions - window_sessions max_start = total_sessions - window_sessions
if max_start <= 0: window_start = rng.randint(0, max_start) if max_start > 0 else 0
window_start = 0 window_sids = [sid for _, sid in
else: session_starts[window_start:window_start + window_sessions]]
window_start = rng.randint(0, max_start)
window_sids = [sid for _, sid in session_starts[window_start:window_start + window_sessions]] if thin_ratio < 1.0:
window_sids = [s for s in window_sids if rng.random() < thin_ratio]
# Thin within window if len(window_sids) > target_n * 1.2:
if thin_ratio >= 1.0: rng.shuffle(window_sids)
selected = window_sids window_sids = window_sids[:int(target_n * 1.1)]
else:
selected = [sid for sid in window_sids if rng.random() < thin_ratio]
# Ensure we don't overshoot target by too much return window_sids
if len(selected) > target_n * 1.2:
rng.shuffle(selected)
selected = selected[:int(target_n * 1.1)]
return selected
def build_output( def build_output(
@@ -245,6 +257,8 @@ def main() -> None:
help="Target number of requests (legacy, no sharing preservation)") help="Target number of requests (legacy, no sharing preservation)")
p.add_argument("--max-single-turn-ratio", type=float, default=None, p.add_argument("--max-single-turn-ratio", type=float, default=None,
help="Cap single-turn sessions to this fraction of total (e.g. 0.3)") help="Cap single-turn sessions to this fraction of total (e.g. 0.3)")
p.add_argument("--window-seconds", type=float, default=None,
help="Time window duration in seconds (e.g. 600 for 10 min)")
p.add_argument("--seed", type=int, default=42) p.add_argument("--seed", type=int, default=42)
args = p.parse_args() args = p.parse_args()
@@ -262,6 +276,7 @@ def main() -> None:
sample_ratio=args.sample_ratio, sample_ratio=args.sample_ratio,
target_requests=args.target_requests, target_requests=args.target_requests,
max_single_turn_ratio=args.max_single_turn_ratio, max_single_turn_ratio=args.max_single_turn_ratio,
window_seconds=args.window_seconds,
seed=args.seed, seed=args.seed,
) )