replaysim/docs/rs4_vllm_gpu_smoke.md

# RS4 vLLM GPU Smoke

RS4 starts a real serving baseline for ReplayServe. This is separate from the
Frontier dummy/patched simulator smoke: it checks whether the Qwen block-hash
trace can drive a real vLLM engine with the intended arrival, prompt length,
decode length, and prefix reuse patterns.

## Setup

- Host: `dash2`
- GPU: NVIDIA H20
- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
- Runtime: Python 3.12.3, vLLM 0.11.1
- Fixture: `traces/fixtures/coder_100`
- Runner: `tools/vllm_synthetic_replay.py`
- Replay mode: online, trace-relative timestamps preserved
- Prompt mode: `prompt_token_ids`, generated synthetically from trace block
  hashes
- Common vLLM knobs: `max_model_len=32768`, `block_size=16`,
  `max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
  prefix caching on, chunked prefill on

The Qwen trace does not expose original token IDs or text. The runner maps each
block hash deterministically to one stable synthetic token block. Equal block
hashes therefore produce equal token blocks, preserving arrival, length, and
block-prefix sharing patterns, but not original text semantics.

## Runs

The first smoke used single-request runs for engine bring-up, 32-request capped
runs for prefix-cache validation, 32-request uncapped runs for a first
real-output baseline, and full `coder_100` uncapped runs for the first useful
TP=1/2 comparison.

| run | TP | rows | prompt toks | gen toks | wall s | RPS | prompt tok/s | gen tok/s | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| `tp1_limit1` | 1 | 1 | 1008 | 4 | 1.861 | 0.537 | 541.5 | 2.1 | 1.255/1.255 | 0.007/0.007 | 1.274/1.274 |
| `tp2_limit1` | 2 | 1 | 1008 | 4 | 2.269 | 0.441 | 444.3 | 1.8 | 1.317/1.317 | 0.008/0.008 | 1.340/1.340 |
| `tp1_limit32_o8` | 1 | 32 | 120813 | 253 | 11.244 | 2.846 | 10744.4 | 22.5 | 3.974/5.051 | 0.387/1.081 | 7.157/9.817 |
| `tp2_limit32_o8` | 2 | 32 | 120813 | 253 | 9.071 | 3.528 | 13318.2 | 27.9 | 1.881/3.324 | 0.285/0.727 | 4.368/7.043 |
| `tp1_limit32_uncapped` | 1 | 32 | 120813 | 22209 | 41.874 | 0.764 | 2885.1 | 530.4 | 1.276/1.842 | 0.024/0.102 | 14.366/29.523 |
| `tp2_limit32_uncapped` | 2 | 32 | 120813 | 22209 | 33.588 | 0.953 | 3596.9 | 661.2 | 0.961/1.700 | 0.017/0.071 | 10.786/21.570 |
| `tp1_coder100_uncapped` | 1 | 100 | 474554 | 82479 | 145.351 | 0.688 | 3264.9 | 567.4 | 4.503/29.060 | 0.066/0.621 | 41.841/97.366 |
| `tp2_coder100_uncapped` | 2 | 100 | 474554 | 82479 | 102.001 | 0.980 | 4652.5 | 808.6 | 1.951/10.355 | 0.049/0.262 | 25.678/61.971 |

Artifacts were copied back from dash2 to:

```text
runs/vllm_gpu_smoke_20260624/
```

That directory is ignored by git. Each run contains `summary.json` and
`request_metrics.csv`; the 32-request runs also keep `stdout.log`.

## KV Capacity

vLLM estimated KV capacity from actual H20 memory profiling:

| TP | weights memory | available KV memory | GPU KV cache size | max concurrency at 32768 tokens/request |
|---:|---:|---:|---:|---:|
| 1 | 56.93 GiB | 22.39 GiB | 244,512 tokens | 7.46x |
| 2 | 28.50 GiB/rank | 50.58 GiB/rank | 1,104,880 tokens | 33.72x |

This satisfies the RS4 requirement that KV capacity comes from the real GPU
memory planner rather than a manually fixed block count.

## Prefix-Cache Check

For the first 32 coder requests, ReplayServe estimated:

- query blocks: 7,564
- hit blocks: 1,786
- block hit ratio: 0.236118456
- query tokens: 120,813
- hit tokens: 28,576
- token hit ratio: 0.236530837

The vLLM scheduler logs for both TP=1 and TP=2 reported exactly 32 request
starts and `computed:` token sums of 28,576 in both capped and uncapped runs.
The largest single hit was 11,552 tokens. Examples include:

```text
Request 16 started running, prompt: 12296, computed: 11552
Request 26 started running, prompt: 5836, computed: 4336
Request 30 started running, prompt: 11017, computed: 10768
```

So this smoke validates the core ReplayServe invariant: identical Qwen block
hash prefixes become identical synthetic token prefixes, and vLLM's prefix cache
actually reuses them.

For full `coder_100`, ReplayServe estimated:

- query blocks: 29,705
- hit blocks: 7,447
- block hit ratio: 0.250698536
- query tokens: 474,554
- hit tokens: 119,152
- token hit ratio: 0.251082069

The TP=2 full `coder_100` run had no preemptions. Its vLLM `computed:` sum was
119,152, matching the trace-side estimate exactly. The TP=1 run had 8
preemptions across repeated starts for requests 70, 71, 72, 77, and 94. In that
case, raw `computed:` sums are not a simple prefix-hit ratio:

| run | starts | unique requests | preemptions | all-start computed | first-start computed | last-start computed | max/request computed | estimated hit tokens |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `tp1_coder100_uncapped` | 108 | 100 | 8 | 180896 | 108560 | 141744 | 141984 | 119152 |
| `tp2_coder100_uncapped` | 100 | 100 | 0 | 119152 | 119152 | 119152 | 119152 | 119152 |

Use `tools/analyze_vllm_prefix_log.py` to reproduce this parsing.

## Reliability Boundary

These numbers are useful for mechanism validation and for seeding simulator
calibration. They are not final serving throughput claims because:

- Some bring-up runs capped decode length to 4 or 8 tokens.
- The largest real-output baseline so far is `coder_100`, not `coder_2000` or
  the full coder trace.
- Synthetic token IDs preserve block identity and length but not original text
  distribution.
- Prefix reuse in `request_metrics.csv` is a trace-side estimate. For real
  scheduler hit/miss behavior, use vLLM `stdout.log` `computed:` fields and
  account for preemption/re-admission.
- This run uses H20 and `Qwen3-30B-A3B`, while the earlier Frontier smoke used
  dummy A800/Qwen3-32B plumbing. They should be compared as calibration inputs,
  not as one-to-one simulator accuracy evidence yet.

## Next

- Move to `coder_2000` once runtime and queueing cost are acceptable.
- Add the vLLM log parser output into the run aggregation summary.
- Compare vLLM real-backend TTFT/TPOT/E2E against Frontier outputs only after
  selecting a matched model/hardware/profile policy.

See `docs/rs4_frontier_h20_tp1_alignment.md` for the first Frontier H20 TP1
alignment run against real vLLM TP1.