# RS1 Frontier Smoke RS1 runs Frontier trace replay as a plumbing smoke for the Qwen coder fixtures generated in RS0. It checks that Frontier can consume ReplayServe's Frontier CSV, preserve online arrivals, run vLLM v1 prefix caching, and emit request/system metrics. It does not make latency or throughput claims. ## Fixed Configuration - `simulation_mode=online` - `sys_arch=co-location` - `cluster_scheduler=sticky_round_robin` - `replica_scheduler=vllm_v1` - `device=a800` - `network_device=a800_dgx` - `model_name=Qwen/Qwen3-32B` - `attn_tensor_parallel_size=2` - dummy execution predictor, 1 ms per model execution - analytical communication backend - `trace_request_generator_config_max_tokens=32768` - prefix caching enabled - block size 16 - chunked prefill enabled - batch cap 128 - max batch tokens 32768 - `num_blocks_mode=memory_planner` - `gpu_memory_utilization=0.9` - `non_kv_cache_overhead_bytes=0` The memory planner point uses Frontier's A800 device config (`total_memory_gb=80`) and analytical parameter memory. The non-KV overhead is set to 0 for this smoke, so the derived KV block count is a permissive plumbing budget, not a calibrated serving budget. Frontier also ships an `a800_pairwise_nvlink` network profile, but `replica_config_network_device` is used to construct a node SKU in the current co-location path. This checkout has `A800_DGX` as a node SKU and does not have an `A800_PAIRWISE_NVLINK` node SKU, so RS1 uses `a800_dgx`. ## Reproduce From `/home/gahow/phd/replayserve`: ```bash PIP_CACHE_DIR=/home/gahow/phd/replayserve/.cache/pip python3 -m pip install \ --target /home/gahow/phd/replayserve/.deps/python \ 'ddsketch>=3.0,<4' 'fasteners>=0.19,<1' 'numpy>=1.23' 'pandas>=1.5' \ 'plotly>=5.0' 'pyyaml>=6.0' 'scikit-learn>=1.1' 'scipy>=1.9' 'tqdm>=4.64' scripts/run_frontier_smoke.sh coder_100 scripts/run_frontier_smoke.sh coder_2000 ``` Each run writes: - `runs/rs1//command.txt` - `runs/rs1//stdout.log` - `runs/rs1//stderr.log` - `runs/rs1//exit_code.txt` - `runs/rs1//runtime_seconds.txt` - `runs/rs1//frontier_metrics/.../config.json` - `runs/rs1//frontier_metrics/.../system_metrics.json` - `runs/rs1//frontier_metrics/.../request_metrics.csv` - `runs/rs1//postprocess_summary.json` - `runs/rs1//postprocess_summary.md` ## Current Results Initial local attempt with `network_device=a800_pairwise_nvlink` failed during config reconstruction: ```text ValueError: [BaseNodeSKUConfig] Invalid type string: a800_pairwise_nvlink ``` The preserved failed run context is under `runs/rs1/coder_100_failed_a800_pairwise_nvlink/`. The first `a800_dgx` attempt failed because the base Python environment lacked Frontier runtime dependencies: ```text ModuleNotFoundError: No module named 'plotly' ``` Dependencies were installed into ReplayServe-local `.deps/python` with pip `--target`; Frontier source was not installed or modified. ### coder_100 Status: passed. - Run dir: `runs/rs1/coder_100/` - Runtime: 7 seconds - Metrics dir: `runs/rs1/coder_100/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_100/` - Frontier block-level prefix hit ratio: `0.04948661841440835` - ReplayServe token-weighted prefix hit ratio: `0.04956232588915065` - Frontier total query blocks: `29705` - Frontier total hit blocks: `1470` - ReplayServe total query tokens: `474554` - ReplayServe total hit tokens: `23520` - Memory planner mode: `memory_planner` - GPU memory utilization: `0.9` - A800 memory budget: `80 GiB * 0.9 = 77309411328 bytes` - Qwen3-32B TP2 analytical weight shard estimate: `28940697600 bytes` (`26.953125 GiB`) - Non-KV overhead assumption: `0 bytes` - Available KV budget under this smoke assumption: `48368713728 bytes` - Derived KV blocks: `36902` - Preemption events: `0` - Allocation/preemption/OOM log lines: `0` The derived KV block count is recomputed by ReplayServe postprocess with the same formula as Frontier `MemoryPlanner.get_num_blocks` because this run did not emit Frontier's `[MEMORY_STATE]` line in stdout/stderr. ### coder_2000 Status: blocked by Frontier runtime error under the fixed RS1 configuration. - Run dir: `runs/rs1/coder_2000/` - Runtime: 4 seconds - Config: `runs/rs1/coder_2000/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_2000/config.json` - Failure summary: `runs/rs1/coder_2000/failure_summary.md` Frontier failed during vLLM v1 prefix-cache scheduling: ```text ValueError: Request 194 already scheduled. ``` The traceback reaches `vllm_v1_engine_replica_scheduler.py:3185`, where the scheduler calls `request.on_cache_hit(prefix_cached_tokens)`, and then `request.py:505`, where `Request.on_cache_hit()` rejects cache-hit updates after the request has already been scheduled. No Frontier source changes were made. RS1 stops here rather than changing scheduler knobs, because disabling prefix caching or chunked prefill would no longer match the fixed smoke point. ## Metric Semantics Frontier reports prefix-cache hits at block granularity. ReplayServe postprocess uses `sidecar.jsonl` to weight each request's first `hit_blocks` by `block_token_counts`, so a hit on a partial final block contributes the true partial token count rather than 16 tokens. If Frontier omits `request_cached_prefill_tokens`, `request_prefix_cache_query_blocks`, or `request_prefix_cache_hit_blocks` from `request_metrics.csv`, ReplayServe cannot compute token-weighted hit ratio from that run without additional simulator instrumentation. ## Limitations - Frontier's public A800 compute profiles in the checked source do not include a dense `Qwen/Qwen3-32B` profile. - Dummy execution predictor is enabled, so TTFT, TPOT, E2E latency, and throughput are only pipeline smoke outputs. - Memory planner uses analytical parameter memory and a 0-byte non-KV overhead assumption. The derived KV capacity must be replaced by calibrated overhead or runtime profiling before interpreting capacity pressure.