Files
replaysim/docs/rs4_vllm_gpu_smoke.md

6.1 KiB

RS4 vLLM GPU Smoke

RS4 starts a real serving baseline for ReplayServe. This is separate from the Frontier dummy/patched simulator smoke: it checks whether the Qwen block-hash trace can drive a real vLLM engine with the intended arrival, prompt length, decode length, and prefix reuse patterns.

Setup

  • Host: dash2
  • GPU: NVIDIA H20
  • Model: /home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B
  • Runtime: Python 3.12.3, vLLM 0.11.1
  • Fixture: traces/fixtures/coder_100
  • Runner: tools/vllm_synthetic_replay.py
  • Replay mode: online, trace-relative timestamps preserved
  • Prompt mode: prompt_token_ids, generated synthetically from trace block hashes
  • Common vLLM knobs: max_model_len=32768, block_size=16, max_num_batched_tokens=32768, gpu_memory_utilization=0.85, prefix caching on, chunked prefill on

The Qwen trace does not expose original token IDs or text. The runner maps each block hash deterministically to one stable synthetic token block. Equal block hashes therefore produce equal token blocks, preserving arrival, length, and block-prefix sharing patterns, but not original text semantics.

Runs

The first smoke used single-request runs for engine bring-up, 32-request capped runs for prefix-cache validation, 32-request uncapped runs for a first real-output baseline, and full coder_100 uncapped runs for the first useful TP=1/2 comparison.

run TP rows prompt toks gen toks wall s RPS prompt tok/s gen tok/s TTFT p50/p95 TPOT p50/p95 E2E p50/p95
tp1_limit1 1 1 1008 4 1.861 0.537 541.5 2.1 1.255/1.255 0.007/0.007 1.274/1.274
tp2_limit1 2 1 1008 4 2.269 0.441 444.3 1.8 1.317/1.317 0.008/0.008 1.340/1.340
tp1_limit32_o8 1 32 120813 253 11.244 2.846 10744.4 22.5 3.974/5.051 0.387/1.081 7.157/9.817
tp2_limit32_o8 2 32 120813 253 9.071 3.528 13318.2 27.9 1.881/3.324 0.285/0.727 4.368/7.043
tp1_limit32_uncapped 1 32 120813 22209 41.874 0.764 2885.1 530.4 1.276/1.842 0.024/0.102 14.366/29.523
tp2_limit32_uncapped 2 32 120813 22209 33.588 0.953 3596.9 661.2 0.961/1.700 0.017/0.071 10.786/21.570
tp1_coder100_uncapped 1 100 474554 82479 145.351 0.688 3264.9 567.4 4.503/29.060 0.066/0.621 41.841/97.366
tp2_coder100_uncapped 2 100 474554 82479 102.001 0.980 4652.5 808.6 1.951/10.355 0.049/0.262 25.678/61.971

Artifacts were copied back from dash2 to:

runs/vllm_gpu_smoke_20260624/

That directory is ignored by git. Each run contains summary.json and request_metrics.csv; the 32-request runs also keep stdout.log.

KV Capacity

vLLM estimated KV capacity from actual H20 memory profiling:

TP weights memory available KV memory GPU KV cache size max concurrency at 32768 tokens/request
1 56.93 GiB 22.39 GiB 244,512 tokens 7.46x
2 28.50 GiB/rank 50.58 GiB/rank 1,104,880 tokens 33.72x

This satisfies the RS4 requirement that KV capacity comes from the real GPU memory planner rather than a manually fixed block count.

Prefix-Cache Check

For the first 32 coder requests, ReplayServe estimated:

  • query blocks: 7,564
  • hit blocks: 1,786
  • block hit ratio: 0.236118456
  • query tokens: 120,813
  • hit tokens: 28,576
  • token hit ratio: 0.236530837

The vLLM scheduler logs for both TP=1 and TP=2 reported exactly 32 request starts and computed: token sums of 28,576 in both capped and uncapped runs. The largest single hit was 11,552 tokens. Examples include:

Request 16 started running, prompt: 12296, computed: 11552
Request 26 started running, prompt: 5836, computed: 4336
Request 30 started running, prompt: 11017, computed: 10768

So this smoke validates the core ReplayServe invariant: identical Qwen block hash prefixes become identical synthetic token prefixes, and vLLM's prefix cache actually reuses them.

For full coder_100, ReplayServe estimated:

  • query blocks: 29,705
  • hit blocks: 7,447
  • block hit ratio: 0.250698536
  • query tokens: 474,554
  • hit tokens: 119,152
  • token hit ratio: 0.251082069

The TP=2 full coder_100 run had no preemptions. Its vLLM computed: sum was 119,152, matching the trace-side estimate exactly. The TP=1 run had 8 preemptions across repeated starts for requests 70, 71, 72, 77, and 94. In that case, raw computed: sums are not a simple prefix-hit ratio:

run starts unique requests preemptions all-start computed first-start computed last-start computed max/request computed estimated hit tokens
tp1_coder100_uncapped 108 100 8 180896 108560 141744 141984 119152
tp2_coder100_uncapped 100 100 0 119152 119152 119152 119152 119152

Use tools/analyze_vllm_prefix_log.py to reproduce this parsing.

Reliability Boundary

These numbers are useful for mechanism validation and for seeding simulator calibration. They are not final serving throughput claims because:

  • Some bring-up runs capped decode length to 4 or 8 tokens.
  • The largest real-output baseline so far is coder_100, not coder_2000 or the full coder trace.
  • Synthetic token IDs preserve block identity and length but not original text distribution.
  • Prefix reuse in request_metrics.csv is a trace-side estimate. For real scheduler hit/miss behavior, use vLLM stdout.log computed: fields and account for preemption/re-admission.
  • This run uses H20 and Qwen3-30B-A3B, while the earlier Frontier smoke used dummy A800/Qwen3-32B plumbing. They should be compared as calibration inputs, not as one-to-one simulator accuracy evidence yet.

Next

  • Move to coder_2000 once runtime and queueing cost are acceptable.
  • Add the vLLM log parser output into the run aggregation summary.
  • Compare vLLM real-backend TTFT/TPOT/E2E against Frontier outputs only after selecting a matched model/hardware/profile policy.

See docs/rs4_frontier_h20_tp1_alignment.md for the first Frontier H20 TP1 alignment run against real vLLM TP1.