# RS4 vLLM GPU Smoke RS4 starts a real serving baseline for ReplayServe. This is separate from the Frontier dummy/patched simulator smoke: it checks whether the Qwen block-hash trace can drive a real vLLM engine with the intended arrival, prompt length, decode length, and prefix reuse patterns. ## Setup - Host: `dash2` - GPU: NVIDIA H20 - Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B` - Runtime: Python 3.12.3, vLLM 0.11.1 - Fixture: `traces/fixtures/coder_100` - Runner: `tools/vllm_synthetic_replay.py` - Replay mode: online, trace-relative timestamps preserved - Prompt mode: `prompt_token_ids`, generated synthetically from trace block hashes - Common vLLM knobs: `max_model_len=32768`, `block_size=16`, `max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`, prefix caching on, chunked prefill on The Qwen trace does not expose original token IDs or text. The runner maps each block hash deterministically to one stable synthetic token block. Equal block hashes therefore produce equal token blocks, preserving arrival, length, and block-prefix sharing patterns, but not original text semantics. ## Runs The first smoke used single-request runs for engine bring-up, 32-request capped runs for prefix-cache validation, 32-request uncapped runs for a first real-output baseline, and full `coder_100` uncapped runs for the first useful TP=1/2 comparison. | run | TP | rows | prompt toks | gen toks | wall s | RPS | prompt tok/s | gen tok/s | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | `tp1_limit1` | 1 | 1 | 1008 | 4 | 1.861 | 0.537 | 541.5 | 2.1 | 1.255/1.255 | 0.007/0.007 | 1.274/1.274 | | `tp2_limit1` | 2 | 1 | 1008 | 4 | 2.269 | 0.441 | 444.3 | 1.8 | 1.317/1.317 | 0.008/0.008 | 1.340/1.340 | | `tp1_limit32_o8` | 1 | 32 | 120813 | 253 | 11.244 | 2.846 | 10744.4 | 22.5 | 3.974/5.051 | 0.387/1.081 | 7.157/9.817 | | `tp2_limit32_o8` | 2 | 32 | 120813 | 253 | 9.071 | 3.528 | 13318.2 | 27.9 | 1.881/3.324 | 0.285/0.727 | 4.368/7.043 | | `tp1_limit32_uncapped` | 1 | 32 | 120813 | 22209 | 41.874 | 0.764 | 2885.1 | 530.4 | 1.276/1.842 | 0.024/0.102 | 14.366/29.523 | | `tp2_limit32_uncapped` | 2 | 32 | 120813 | 22209 | 33.588 | 0.953 | 3596.9 | 661.2 | 0.961/1.700 | 0.017/0.071 | 10.786/21.570 | | `tp1_coder100_uncapped` | 1 | 100 | 474554 | 82479 | 145.351 | 0.688 | 3264.9 | 567.4 | 4.503/29.060 | 0.066/0.621 | 41.841/97.366 | | `tp2_coder100_uncapped` | 2 | 100 | 474554 | 82479 | 102.001 | 0.980 | 4652.5 | 808.6 | 1.951/10.355 | 0.049/0.262 | 25.678/61.971 | Artifacts were copied back from dash2 to: ```text runs/vllm_gpu_smoke_20260624/ ``` That directory is ignored by git. Each run contains `summary.json` and `request_metrics.csv`; the 32-request runs also keep `stdout.log`. ## KV Capacity vLLM estimated KV capacity from actual H20 memory profiling: | TP | weights memory | available KV memory | GPU KV cache size | max concurrency at 32768 tokens/request | |---:|---:|---:|---:|---:| | 1 | 56.93 GiB | 22.39 GiB | 244,512 tokens | 7.46x | | 2 | 28.50 GiB/rank | 50.58 GiB/rank | 1,104,880 tokens | 33.72x | This satisfies the RS4 requirement that KV capacity comes from the real GPU memory planner rather than a manually fixed block count. ## Prefix-Cache Check For the first 32 coder requests, ReplayServe estimated: - query blocks: 7,564 - hit blocks: 1,786 - block hit ratio: 0.236118456 - query tokens: 120,813 - hit tokens: 28,576 - token hit ratio: 0.236530837 The vLLM scheduler logs for both TP=1 and TP=2 reported exactly 32 request starts and `computed:` token sums of 28,576 in both capped and uncapped runs. The largest single hit was 11,552 tokens. Examples include: ```text Request 16 started running, prompt: 12296, computed: 11552 Request 26 started running, prompt: 5836, computed: 4336 Request 30 started running, prompt: 11017, computed: 10768 ``` So this smoke validates the core ReplayServe invariant: identical Qwen block hash prefixes become identical synthetic token prefixes, and vLLM's prefix cache actually reuses them. For full `coder_100`, ReplayServe estimated: - query blocks: 29,705 - hit blocks: 7,447 - block hit ratio: 0.250698536 - query tokens: 474,554 - hit tokens: 119,152 - token hit ratio: 0.251082069 The TP=2 full `coder_100` run had no preemptions. Its vLLM `computed:` sum was 119,152, matching the trace-side estimate exactly. The TP=1 run had 8 preemptions across repeated starts for requests 70, 71, 72, 77, and 94. In that case, raw `computed:` sums are not a simple prefix-hit ratio: | run | starts | unique requests | preemptions | all-start computed | first-start computed | last-start computed | max/request computed | estimated hit tokens | |---|---:|---:|---:|---:|---:|---:|---:|---:| | `tp1_coder100_uncapped` | 108 | 100 | 8 | 180896 | 108560 | 141744 | 141984 | 119152 | | `tp2_coder100_uncapped` | 100 | 100 | 0 | 119152 | 119152 | 119152 | 119152 | 119152 | Use `tools/analyze_vllm_prefix_log.py` to reproduce this parsing. ## Reliability Boundary These numbers are useful for mechanism validation and for seeding simulator calibration. They are not final serving throughput claims because: - Some bring-up runs capped decode length to 4 or 8 tokens. - The largest real-output baseline so far is `coder_100`, not `coder_2000` or the full coder trace. - Synthetic token IDs preserve block identity and length but not original text distribution. - Prefix reuse in `request_metrics.csv` is a trace-side estimate. For real scheduler hit/miss behavior, use vLLM `stdout.log` `computed:` fields and account for preemption/re-admission. - This run uses H20 and `Qwen3-30B-A3B`, while the earlier Frontier smoke used dummy A800/Qwen3-32B plumbing. They should be compared as calibration inputs, not as one-to-one simulator accuracy evidence yet. ## Next - Move to `coder_2000` once runtime and queueing cost are acceptable. - Add the vLLM log parser output into the run aggregation summary. - Compare vLLM real-backend TTFT/TPOT/E2E against Frontier outputs only after selecting a matched model/hardware/profile policy. See `docs/rs4_frontier_h20_tp1_alignment.md` for the first Frontier H20 TP1 alignment run against real vLLM TP1.