164 lines
5.9 KiB
Markdown
164 lines
5.9 KiB
Markdown
# RS1 Frontier Smoke
|
|
|
|
RS1 runs Frontier trace replay as a plumbing smoke for the Qwen coder fixtures
|
|
generated in RS0. It checks that Frontier can consume ReplayServe's Frontier CSV,
|
|
preserve online arrivals, run vLLM v1 prefix caching, and emit request/system
|
|
metrics. It does not make latency or throughput claims.
|
|
|
|
## Fixed Configuration
|
|
|
|
- `simulation_mode=online`
|
|
- `sys_arch=co-location`
|
|
- `cluster_scheduler=sticky_round_robin`
|
|
- `replica_scheduler=vllm_v1`
|
|
- `device=a800`
|
|
- `network_device=a800_dgx`
|
|
- `model_name=Qwen/Qwen3-32B`
|
|
- `attn_tensor_parallel_size=2`
|
|
- dummy execution predictor, 1 ms per model execution
|
|
- analytical communication backend
|
|
- `trace_request_generator_config_max_tokens=32768`
|
|
- prefix caching enabled
|
|
- block size 16
|
|
- chunked prefill enabled
|
|
- batch cap 128
|
|
- max batch tokens 32768
|
|
- `num_blocks_mode=memory_planner`
|
|
- `gpu_memory_utilization=0.9`
|
|
- `non_kv_cache_overhead_bytes=0`
|
|
|
|
The memory planner point uses Frontier's A800 device config
|
|
(`total_memory_gb=80`) and analytical parameter memory. The non-KV overhead is
|
|
set to 0 for this smoke, so the derived KV block count is a permissive plumbing
|
|
budget, not a calibrated serving budget.
|
|
|
|
Frontier also ships an `a800_pairwise_nvlink` network profile, but
|
|
`replica_config_network_device` is used to construct a node SKU in the current
|
|
co-location path. This checkout has `A800_DGX` as a node SKU and does not have an
|
|
`A800_PAIRWISE_NVLINK` node SKU, so RS1 uses `a800_dgx`.
|
|
|
|
## Reproduce
|
|
|
|
From `/home/gahow/phd/replayserve`:
|
|
|
|
```bash
|
|
PIP_CACHE_DIR=/home/gahow/phd/replayserve/.cache/pip python3 -m pip install \
|
|
--target /home/gahow/phd/replayserve/.deps/python \
|
|
'ddsketch>=3.0,<4' 'fasteners>=0.19,<1' 'numpy>=1.23' 'pandas>=1.5' \
|
|
'plotly>=5.0' 'pyyaml>=6.0' 'scikit-learn>=1.1' 'scipy>=1.9' 'tqdm>=4.64'
|
|
scripts/run_frontier_smoke.sh coder_100
|
|
scripts/run_frontier_smoke.sh coder_2000
|
|
```
|
|
|
|
Each run writes:
|
|
|
|
- `runs/rs1/<fixture>/command.txt`
|
|
- `runs/rs1/<fixture>/stdout.log`
|
|
- `runs/rs1/<fixture>/stderr.log`
|
|
- `runs/rs1/<fixture>/exit_code.txt`
|
|
- `runs/rs1/<fixture>/runtime_seconds.txt`
|
|
- `runs/rs1/<fixture>/frontier_metrics/.../config.json`
|
|
- `runs/rs1/<fixture>/frontier_metrics/.../system_metrics.json`
|
|
- `runs/rs1/<fixture>/frontier_metrics/.../request_metrics.csv`
|
|
- `runs/rs1/<fixture>/postprocess_summary.json`
|
|
- `runs/rs1/<fixture>/postprocess_summary.md`
|
|
|
|
## Current Results
|
|
|
|
Initial local attempt with `network_device=a800_pairwise_nvlink` failed during
|
|
config reconstruction:
|
|
|
|
```text
|
|
ValueError: [BaseNodeSKUConfig] Invalid type string: a800_pairwise_nvlink
|
|
```
|
|
|
|
The preserved failed run context is under
|
|
`runs/rs1/coder_100_failed_a800_pairwise_nvlink/`.
|
|
|
|
The first `a800_dgx` attempt failed because the base Python environment lacked
|
|
Frontier runtime dependencies:
|
|
|
|
```text
|
|
ModuleNotFoundError: No module named 'plotly'
|
|
```
|
|
|
|
Dependencies were installed into ReplayServe-local `.deps/python` with pip
|
|
`--target`; Frontier source was not installed or modified.
|
|
|
|
### coder_100
|
|
|
|
Status: passed.
|
|
|
|
- Run dir: `runs/rs1/coder_100/`
|
|
- Runtime: 7 seconds
|
|
- Metrics dir:
|
|
`runs/rs1/coder_100/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_100/`
|
|
- Frontier block-level prefix hit ratio: `0.04948661841440835`
|
|
- ReplayServe token-weighted prefix hit ratio: `0.04956232588915065`
|
|
- Frontier total query blocks: `29705`
|
|
- Frontier total hit blocks: `1470`
|
|
- ReplayServe total query tokens: `474554`
|
|
- ReplayServe total hit tokens: `23520`
|
|
- Memory planner mode: `memory_planner`
|
|
- GPU memory utilization: `0.9`
|
|
- A800 memory budget: `80 GiB * 0.9 = 77309411328 bytes`
|
|
- Qwen3-32B TP2 analytical weight shard estimate:
|
|
`28940697600 bytes` (`26.953125 GiB`)
|
|
- Non-KV overhead assumption: `0 bytes`
|
|
- Available KV budget under this smoke assumption: `48368713728 bytes`
|
|
- Derived KV blocks: `36902`
|
|
- Preemption events: `0`
|
|
- Allocation/preemption/OOM log lines: `0`
|
|
|
|
The derived KV block count is recomputed by ReplayServe postprocess with the
|
|
same formula as Frontier `MemoryPlanner.get_num_blocks` because this run did
|
|
not emit Frontier's `[MEMORY_STATE]` line in stdout/stderr.
|
|
|
|
### coder_2000
|
|
|
|
Status: blocked by Frontier runtime error under the fixed RS1 configuration.
|
|
|
|
- Run dir: `runs/rs1/coder_2000/`
|
|
- Runtime: 4 seconds
|
|
- Config:
|
|
`runs/rs1/coder_2000/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_2000/config.json`
|
|
- Failure summary: `runs/rs1/coder_2000/failure_summary.md`
|
|
|
|
Frontier failed during vLLM v1 prefix-cache scheduling:
|
|
|
|
```text
|
|
ValueError: Request 194 already scheduled.
|
|
```
|
|
|
|
The traceback reaches
|
|
`vllm_v1_engine_replica_scheduler.py:3185`, where the scheduler calls
|
|
`request.on_cache_hit(prefix_cached_tokens)`, and then
|
|
`request.py:505`, where `Request.on_cache_hit()` rejects cache-hit updates after
|
|
the request has already been scheduled.
|
|
|
|
No Frontier source changes were made. RS1 stops here rather than changing
|
|
scheduler knobs, because disabling prefix caching or chunked prefill would no
|
|
longer match the fixed smoke point.
|
|
|
|
## Metric Semantics
|
|
|
|
Frontier reports prefix-cache hits at block granularity. ReplayServe postprocess
|
|
uses `sidecar.jsonl` to weight each request's first `hit_blocks` by
|
|
`block_token_counts`, so a hit on a partial final block contributes the true
|
|
partial token count rather than 16 tokens.
|
|
|
|
If Frontier omits `request_cached_prefill_tokens`,
|
|
`request_prefix_cache_query_blocks`, or `request_prefix_cache_hit_blocks` from
|
|
`request_metrics.csv`, ReplayServe cannot compute token-weighted hit ratio from
|
|
that run without additional simulator instrumentation.
|
|
|
|
## Limitations
|
|
|
|
- Frontier's public A800 compute profiles in the checked source do not include a
|
|
dense `Qwen/Qwen3-32B` profile.
|
|
- Dummy execution predictor is enabled, so TTFT, TPOT, E2E latency, and
|
|
throughput are only pipeline smoke outputs.
|
|
- Memory planner uses analytical parameter memory and a 0-byte non-KV overhead
|
|
assumption. The derived KV capacity must be replaced by calibrated overhead or
|
|
runtime profiling before interpreting capacity pressure.
|