139 lines
6.1 KiB
Markdown
139 lines
6.1 KiB
Markdown
# RS4 vLLM GPU Smoke
|
|
|
|
RS4 starts a real serving baseline for ReplayServe. This is separate from the
|
|
Frontier dummy/patched simulator smoke: it checks whether the Qwen block-hash
|
|
trace can drive a real vLLM engine with the intended arrival, prompt length,
|
|
decode length, and prefix reuse patterns.
|
|
|
|
## Setup
|
|
|
|
- Host: `dash2`
|
|
- GPU: NVIDIA H20
|
|
- Model: `/home/admin/cpfs/wjh/models/Qwen/Qwen3-30B-A3B`
|
|
- Runtime: Python 3.12.3, vLLM 0.11.1
|
|
- Fixture: `traces/fixtures/coder_100`
|
|
- Runner: `tools/vllm_synthetic_replay.py`
|
|
- Replay mode: online, trace-relative timestamps preserved
|
|
- Prompt mode: `prompt_token_ids`, generated synthetically from trace block
|
|
hashes
|
|
- Common vLLM knobs: `max_model_len=32768`, `block_size=16`,
|
|
`max_num_batched_tokens=32768`, `gpu_memory_utilization=0.85`,
|
|
prefix caching on, chunked prefill on
|
|
|
|
The Qwen trace does not expose original token IDs or text. The runner maps each
|
|
block hash deterministically to one stable synthetic token block. Equal block
|
|
hashes therefore produce equal token blocks, preserving arrival, length, and
|
|
block-prefix sharing patterns, but not original text semantics.
|
|
|
|
## Runs
|
|
|
|
The first smoke used single-request runs for engine bring-up, 32-request capped
|
|
runs for prefix-cache validation, 32-request uncapped runs for a first
|
|
real-output baseline, and full `coder_100` uncapped runs for the first useful
|
|
TP=1/2 comparison.
|
|
|
|
| run | TP | rows | prompt toks | gen toks | wall s | RPS | prompt tok/s | gen tok/s | TTFT p50/p95 | TPOT p50/p95 | E2E p50/p95 |
|
|
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
|
| `tp1_limit1` | 1 | 1 | 1008 | 4 | 1.861 | 0.537 | 541.5 | 2.1 | 1.255/1.255 | 0.007/0.007 | 1.274/1.274 |
|
|
| `tp2_limit1` | 2 | 1 | 1008 | 4 | 2.269 | 0.441 | 444.3 | 1.8 | 1.317/1.317 | 0.008/0.008 | 1.340/1.340 |
|
|
| `tp1_limit32_o8` | 1 | 32 | 120813 | 253 | 11.244 | 2.846 | 10744.4 | 22.5 | 3.974/5.051 | 0.387/1.081 | 7.157/9.817 |
|
|
| `tp2_limit32_o8` | 2 | 32 | 120813 | 253 | 9.071 | 3.528 | 13318.2 | 27.9 | 1.881/3.324 | 0.285/0.727 | 4.368/7.043 |
|
|
| `tp1_limit32_uncapped` | 1 | 32 | 120813 | 22209 | 41.874 | 0.764 | 2885.1 | 530.4 | 1.276/1.842 | 0.024/0.102 | 14.366/29.523 |
|
|
| `tp2_limit32_uncapped` | 2 | 32 | 120813 | 22209 | 33.588 | 0.953 | 3596.9 | 661.2 | 0.961/1.700 | 0.017/0.071 | 10.786/21.570 |
|
|
| `tp1_coder100_uncapped` | 1 | 100 | 474554 | 82479 | 145.351 | 0.688 | 3264.9 | 567.4 | 4.503/29.060 | 0.066/0.621 | 41.841/97.366 |
|
|
| `tp2_coder100_uncapped` | 2 | 100 | 474554 | 82479 | 102.001 | 0.980 | 4652.5 | 808.6 | 1.951/10.355 | 0.049/0.262 | 25.678/61.971 |
|
|
|
|
Artifacts were copied back from dash2 to:
|
|
|
|
```text
|
|
runs/vllm_gpu_smoke_20260624/
|
|
```
|
|
|
|
That directory is ignored by git. Each run contains `summary.json` and
|
|
`request_metrics.csv`; the 32-request runs also keep `stdout.log`.
|
|
|
|
## KV Capacity
|
|
|
|
vLLM estimated KV capacity from actual H20 memory profiling:
|
|
|
|
| TP | weights memory | available KV memory | GPU KV cache size | max concurrency at 32768 tokens/request |
|
|
|---:|---:|---:|---:|---:|
|
|
| 1 | 56.93 GiB | 22.39 GiB | 244,512 tokens | 7.46x |
|
|
| 2 | 28.50 GiB/rank | 50.58 GiB/rank | 1,104,880 tokens | 33.72x |
|
|
|
|
This satisfies the RS4 requirement that KV capacity comes from the real GPU
|
|
memory planner rather than a manually fixed block count.
|
|
|
|
## Prefix-Cache Check
|
|
|
|
For the first 32 coder requests, ReplayServe estimated:
|
|
|
|
- query blocks: 7,564
|
|
- hit blocks: 1,786
|
|
- block hit ratio: 0.236118456
|
|
- query tokens: 120,813
|
|
- hit tokens: 28,576
|
|
- token hit ratio: 0.236530837
|
|
|
|
The vLLM scheduler logs for both TP=1 and TP=2 reported exactly 32 request
|
|
starts and `computed:` token sums of 28,576 in both capped and uncapped runs.
|
|
The largest single hit was 11,552 tokens. Examples include:
|
|
|
|
```text
|
|
Request 16 started running, prompt: 12296, computed: 11552
|
|
Request 26 started running, prompt: 5836, computed: 4336
|
|
Request 30 started running, prompt: 11017, computed: 10768
|
|
```
|
|
|
|
So this smoke validates the core ReplayServe invariant: identical Qwen block
|
|
hash prefixes become identical synthetic token prefixes, and vLLM's prefix cache
|
|
actually reuses them.
|
|
|
|
For full `coder_100`, ReplayServe estimated:
|
|
|
|
- query blocks: 29,705
|
|
- hit blocks: 7,447
|
|
- block hit ratio: 0.250698536
|
|
- query tokens: 474,554
|
|
- hit tokens: 119,152
|
|
- token hit ratio: 0.251082069
|
|
|
|
The TP=2 full `coder_100` run had no preemptions. Its vLLM `computed:` sum was
|
|
119,152, matching the trace-side estimate exactly. The TP=1 run had 8
|
|
preemptions across repeated starts for requests 70, 71, 72, 77, and 94. In that
|
|
case, raw `computed:` sums are not a simple prefix-hit ratio:
|
|
|
|
| run | starts | unique requests | preemptions | all-start computed | first-start computed | last-start computed | max/request computed | estimated hit tokens |
|
|
|---|---:|---:|---:|---:|---:|---:|---:|---:|
|
|
| `tp1_coder100_uncapped` | 108 | 100 | 8 | 180896 | 108560 | 141744 | 141984 | 119152 |
|
|
| `tp2_coder100_uncapped` | 100 | 100 | 0 | 119152 | 119152 | 119152 | 119152 | 119152 |
|
|
|
|
Use `tools/analyze_vllm_prefix_log.py` to reproduce this parsing.
|
|
|
|
## Reliability Boundary
|
|
|
|
These numbers are useful for mechanism validation and for seeding simulator
|
|
calibration. They are not final serving throughput claims because:
|
|
|
|
- Some bring-up runs capped decode length to 4 or 8 tokens.
|
|
- The largest real-output baseline so far is `coder_100`, not `coder_2000` or
|
|
the full coder trace.
|
|
- Synthetic token IDs preserve block identity and length but not original text
|
|
distribution.
|
|
- Prefix reuse in `request_metrics.csv` is a trace-side estimate. For real
|
|
scheduler hit/miss behavior, use vLLM `stdout.log` `computed:` fields and
|
|
account for preemption/re-admission.
|
|
- This run uses H20 and `Qwen3-30B-A3B`, while the earlier Frontier smoke used
|
|
dummy A800/Qwen3-32B plumbing. They should be compared as calibration inputs,
|
|
not as one-to-one simulator accuracy evidence yet.
|
|
|
|
## Next
|
|
|
|
- Move to `coder_2000` once runtime and queueing cost are acceptable.
|
|
- Add the vLLM log parser output into the run aggregation summary.
|
|
- Compare vLLM real-backend TTFT/TPOT/E2E against Frontier outputs only after
|
|
selecting a matched model/hardware/profile policy.
|
|
|
|
See `docs/rs4_frontier_h20_tp1_alignment.md` for the first Frontier H20 TP1
|
|
alignment run against real vLLM TP1.
|