RS4 vLLM GPU Smoke

RS4 starts a real serving baseline for ReplayServe. This is separate from the Frontier dummy/patched simulator smoke: it checks whether the Qwen block-hash trace can drive a real vLLM engine with the intended arrival, prompt length, decode length, and prefix reuse patterns.

Setup

Host: dash2
GPU: NVIDIA H20
Model: /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B
Runtime: Python 3.12.3, vLLM 0.11.1
Fixture: traces/fixtures/coder_100
Runner: tools/vllm_synthetic_replay.py
Replay mode: online, trace-relative timestamps preserved
Prompt mode: prompt_token_ids, generated synthetically from trace block hashes
Common vLLM knobs: max_model_len=32768, block_size=16, max_num_batched_tokens=32768, gpu_memory_utilization=0.85, prefix caching on, chunked prefill on

The Qwen trace does not expose original token IDs or text. The runner maps each block hash deterministically to one stable synthetic token block. Equal block hashes therefore produce equal token blocks, preserving arrival, length, and block-prefix sharing patterns, but not original text semantics.

Runs

The first smoke used single-request runs for engine bring-up, 32-request capped runs for prefix-cache validation, 32-request uncapped runs for a first real-output baseline, and full coder_100 uncapped runs for the first useful TP=1/2 comparison.

run	TP	rows	prompt toks	gen toks	wall s	RPS	prompt tok/s	gen tok/s	TTFT p50/p95	TPOT p50/p95	E2E p50/p95
`tp1_limit1`	1	1	1008	4	1.861	0.537	541.5	2.1	1.255/1.255	0.007/0.007	1.274/1.274
`tp2_limit1`	2	1	1008	4	2.269	0.441	444.3	1.8	1.317/1.317	0.008/0.008	1.340/1.340
`tp1_limit32_o8`	1	32	120813	253	11.244	2.846	10744.4	22.5	3.974/5.051	0.387/1.081	7.157/9.817
`tp2_limit32_o8`	2	32	120813	253	9.071	3.528	13318.2	27.9	1.881/3.324	0.285/0.727	4.368/7.043
`tp1_limit32_uncapped`	1	32	120813	22209	41.874	0.764	2885.1	530.4	1.276/1.842	0.024/0.102	14.366/29.523
`tp2_limit32_uncapped`	2	32	120813	22209	33.588	0.953	3596.9	661.2	0.961/1.700	0.017/0.071	10.786/21.570
`tp1_coder100_uncapped`	1	100	474554	82479	145.351	0.688	3264.9	567.4	4.503/29.060	0.066/0.621	41.841/97.366
`tp2_coder100_uncapped`	2	100	474554	82479	102.001	0.980	4652.5	808.6	1.951/10.355	0.049/0.262	25.678/61.971

Artifacts were copied back from dash2 to:

runs/vllm_gpu_smoke_20260624/

That directory is ignored by git. Each run contains summary.json and request_metrics.csv; the 32-request runs also keep stdout.log.

KV Capacity

vLLM estimated KV capacity from actual H20 memory profiling:

TP	weights memory	available KV memory	GPU KV cache size	max concurrency at 32768 tokens/request
1	56.93 GiB	22.39 GiB	244,512 tokens	7.46x
2	28.50 GiB/rank	50.58 GiB/rank	1,104,880 tokens	33.72x

This satisfies the RS4 requirement that KV capacity comes from the real GPU memory planner rather than a manually fixed block count.

Prefix-Cache Check

For the first 32 coder requests, ReplayServe estimated:

query blocks: 7,564
hit blocks: 1,786
block hit ratio: 0.236118456
query tokens: 120,813
hit tokens: 28,576
token hit ratio: 0.236530837

The vLLM scheduler logs for both TP=1 and TP=2 reported exactly 32 request starts and computed: token sums of 28,576 in both capped and uncapped runs. The largest single hit was 11,552 tokens. Examples include:

Request 16 started running, prompt: 12296, computed: 11552
Request 26 started running, prompt: 5836, computed: 4336
Request 30 started running, prompt: 11017, computed: 10768

So this smoke validates the core ReplayServe invariant: identical Qwen block hash prefixes become identical synthetic token prefixes, and vLLM's prefix cache actually reuses them.

For full coder_100, ReplayServe estimated:

query blocks: 29,705
hit blocks: 7,447
block hit ratio: 0.250698536
query tokens: 474,554
hit tokens: 119,152
token hit ratio: 0.251082069

The TP=2 full coder_100 run had no preemptions. Its vLLM computed: sum was 119,152, matching the trace-side estimate exactly. The TP=1 run had 8 preemptions across repeated starts for requests 70, 71, 72, 77, and 94. In that case, raw computed: sums are not a simple prefix-hit ratio:

run	starts	unique requests	preemptions	all-start computed	first-start computed	last-start computed	max/request computed	estimated hit tokens
`tp1_coder100_uncapped`	108	100	8	180896	108560	141744	141984	119152
`tp2_coder100_uncapped`	100	100	0	119152	119152	119152	119152	119152

Use tools/analyze_vllm_prefix_log.py to reproduce this parsing.

Reliability Boundary

These numbers are useful for mechanism validation and for seeding simulator calibration. They are not final serving throughput claims because:

Some bring-up runs capped decode length to 4 or 8 tokens.
The largest real-output baseline so far is coder_100, not coder_2000 or the full coder trace.
Synthetic token IDs preserve block identity and length but not original text distribution.
Prefix reuse in request_metrics.csv is a trace-side estimate. For real scheduler hit/miss behavior, use vLLM stdout.log computed: fields and account for preemption/re-admission.
This run uses H20 and Qwen3-30B-A3B, while the earlier Frontier smoke used dummy A800/Qwen3-32B plumbing. They should be compared as calibration inputs, not as one-to-one simulator accuracy evidence yet.

Move to coder_2000 once runtime and queueing cost are acceptable.
Add the vLLM log parser output into the run aggregation summary.
Compare vLLM real-backend TTFT/TPOT/E2E against Frontier outputs only after selecting a matched model/hardware/profile policy.

See docs/rs4_frontier_h20_tp1_alignment.md for the first Frontier H20 TP1 alignment run against real vLLM TP1.

6.1 KiB Raw Blame History