6.1 KiB
RS4 vLLM GPU Smoke
RS4 starts a real serving baseline for ReplayServe. This is separate from the Frontier dummy/patched simulator smoke: it checks whether the Qwen block-hash trace can drive a real vLLM engine with the intended arrival, prompt length, decode length, and prefix reuse patterns.
Setup
- Host:
dash2 - GPU: NVIDIA H20
- Model:
/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B - Runtime: Python 3.12.3, vLLM 0.11.1
- Fixture:
traces/fixtures/coder_100 - Runner:
tools/vllm_synthetic_replay.py - Replay mode: online, trace-relative timestamps preserved
- Prompt mode:
prompt_token_ids, generated synthetically from trace block hashes - Common vLLM knobs:
max_model_len=32768,block_size=16,max_num_batched_tokens=32768,gpu_memory_utilization=0.85, prefix caching on, chunked prefill on
The Qwen trace does not expose original token IDs or text. The runner maps each block hash deterministically to one stable synthetic token block. Equal block hashes therefore produce equal token blocks, preserving arrival, length, and block-prefix sharing patterns, but not original text semantics.
Runs
The first smoke used single-request runs for engine bring-up, 32-request capped
runs for prefix-cache validation, 32-request uncapped runs for a first
real-output baseline, and full coder_100 uncapped runs for the first useful
TP=1/2 comparison.
| run | TP | rows | prompt toks | gen toks | wall s | RPS | prompt tok/s | gen tok/s | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 |
|---|---|---|---|---|---|---|---|---|---|---|---|
tp1_limit1 |
1 | 1 | 1008 | 4 | 1.861 | 0.537 | 541.5 | 2.1 | 1.255/1.255 | 0.007/0.007 | 1.274/1.274 |
tp2_limit1 |
2 | 1 | 1008 | 4 | 2.269 | 0.441 | 444.3 | 1.8 | 1.317/1.317 | 0.008/0.008 | 1.340/1.340 |
tp1_limit32_o8 |
1 | 32 | 120813 | 253 | 11.244 | 2.846 | 10744.4 | 22.5 | 3.974/5.051 | 0.387/1.081 | 7.157/9.817 |
tp2_limit32_o8 |
2 | 32 | 120813 | 253 | 9.071 | 3.528 | 13318.2 | 27.9 | 1.881/3.324 | 0.285/0.727 | 4.368/7.043 |
tp1_limit32_uncapped |
1 | 32 | 120813 | 22209 | 41.874 | 0.764 | 2885.1 | 530.4 | 1.276/1.842 | 0.024/0.102 | 14.366/29.523 |
tp2_limit32_uncapped |
2 | 32 | 120813 | 22209 | 33.588 | 0.953 | 3596.9 | 661.2 | 0.961/1.700 | 0.017/0.071 | 10.786/21.570 |
tp1_coder100_uncapped |
1 | 100 | 474554 | 82479 | 145.351 | 0.688 | 3264.9 | 567.4 | 4.503/29.060 | 0.066/0.621 | 41.841/97.366 |
tp2_coder100_uncapped |
2 | 100 | 474554 | 82479 | 102.001 | 0.980 | 4652.5 | 808.6 | 1.951/10.355 | 0.049/0.262 | 25.678/61.971 |
Artifacts were copied back from dash2 to:
runs/vllm_gpu_smoke_20260624/
That directory is ignored by git. Each run contains summary.json and
request_metrics.csv; the 32-request runs also keep stdout.log.
KV Capacity
vLLM estimated KV capacity from actual H20 memory profiling:
| TP | weights memory | available KV memory | GPU KV cache size | max concurrency at 32768 tokens/request |
|---|---|---|---|---|
| 1 | 56.93 GiB | 22.39 GiB | 244,512 tokens | 7.46x |
| 2 | 28.50 GiB/rank | 50.58 GiB/rank | 1,104,880 tokens | 33.72x |
This satisfies the RS4 requirement that KV capacity comes from the real GPU memory planner rather than a manually fixed block count.
Prefix-Cache Check
For the first 32 coder requests, ReplayServe estimated:
- query blocks: 7,564
- hit blocks: 1,786
- block hit ratio: 0.236118456
- query tokens: 120,813
- hit tokens: 28,576
- token hit ratio: 0.236530837
The vLLM scheduler logs for both TP=1 and TP=2 reported exactly 32 request
starts and computed: token sums of 28,576 in both capped and uncapped runs.
The largest single hit was 11,552 tokens. Examples include:
Request 16 started running, prompt: 12296, computed: 11552
Request 26 started running, prompt: 5836, computed: 4336
Request 30 started running, prompt: 11017, computed: 10768
So this smoke validates the core ReplayServe invariant: identical Qwen block hash prefixes become identical synthetic token prefixes, and vLLM's prefix cache actually reuses them.
For full coder_100, ReplayServe estimated:
- query blocks: 29,705
- hit blocks: 7,447
- block hit ratio: 0.250698536
- query tokens: 474,554
- hit tokens: 119,152
- token hit ratio: 0.251082069
The TP=2 full coder_100 run had no preemptions. Its vLLM computed: sum was
119,152, matching the trace-side estimate exactly. The TP=1 run had 8
preemptions across repeated starts for requests 70, 71, 72, 77, and 94. In that
case, raw computed: sums are not a simple prefix-hit ratio:
| run | starts | unique requests | preemptions | all-start computed | first-start computed | last-start computed | max/request computed | estimated hit tokens |
|---|---|---|---|---|---|---|---|---|
tp1_coder100_uncapped |
108 | 100 | 8 | 180896 | 108560 | 141744 | 141984 | 119152 |
tp2_coder100_uncapped |
100 | 100 | 0 | 119152 | 119152 | 119152 | 119152 | 119152 |
Use tools/analyze_vllm_prefix_log.py to reproduce this parsing.
Reliability Boundary
These numbers are useful for mechanism validation and for seeding simulator calibration. They are not final serving throughput claims because:
- Some bring-up runs capped decode length to 4 or 8 tokens.
- The largest real-output baseline so far is
coder_100, notcoder_2000or the full coder trace. - Synthetic token IDs preserve block identity and length but not original text distribution.
- Prefix reuse in
request_metrics.csvis a trace-side estimate. For real scheduler hit/miss behavior, use vLLMstdout.logcomputed:fields and account for preemption/re-admission. - This run uses H20 and
Qwen3-30B-A3B, while the earlier Frontier smoke used dummy A800/Qwen3-32B plumbing. They should be compared as calibration inputs, not as one-to-one simulator accuracy evidence yet.
Next
- Move to
coder_2000once runtime and queueing cost are acceptable. - Add the vLLM log parser output into the run aggregation summary.
- Compare vLLM real-backend TTFT/TPOT/E2E against Frontier outputs only after selecting a matched model/hardware/profile policy.
See docs/rs4_frontier_h20_tp1_alignment.md for the first Frontier H20 TP1
alignment run against real vLLM TP1.