RS1 Frontier Smoke

RS1 runs Frontier trace replay as a plumbing smoke for the Qwen coder fixtures generated in RS0. It checks that Frontier can consume ReplayServe's Frontier CSV, preserve online arrivals, run vLLM v1 prefix caching, and emit request/system metrics. It does not make latency or throughput claims.

Fixed Configuration

simulation_mode=online
sys_arch=co-location
cluster_scheduler=sticky_round_robin
replica_scheduler=vllm_v1
device=a800
network_device=a800_dgx
model_name=Qwen/Qwen3-32B
attn_tensor_parallel_size=2
dummy execution predictor, 1 ms per model execution
analytical communication backend
trace_request_generator_config_max_tokens=32768
prefix caching enabled
block size 16
chunked prefill enabled
batch cap 128
max batch tokens 32768
num_blocks_mode=memory_planner
gpu_memory_utilization=0.9
non_kv_cache_overhead_bytes=0

The memory planner point uses Frontier's A800 device config (total_memory_gb=80) and analytical parameter memory. The non-KV overhead is set to 0 for this smoke, so the derived KV block count is a permissive plumbing budget, not a calibrated serving budget.

Frontier also ships an a800_pairwise_nvlink network profile, but replica_config_network_device is used to construct a node SKU in the current co-location path. This checkout has A800_DGX as a node SKU and does not have an A800_PAIRWISE_NVLINK node SKU, so RS1 uses a800_dgx.

Reproduce

From /home/gahow/phd/replayserve:

PIP_CACHE_DIR=/home/gahow/phd/replayserve/.cache/pip python3 -m pip install \
  --target /home/gahow/phd/replayserve/.deps/python \
  'ddsketch>=3.0,<4' 'fasteners>=0.19,<1' 'numpy>=1.23' 'pandas>=1.5' \
  'plotly>=5.0' 'pyyaml>=6.0' 'scikit-learn>=1.1' 'scipy>=1.9' 'tqdm>=4.64'
scripts/run_frontier_smoke.sh coder_100
scripts/run_frontier_smoke.sh coder_2000

Each run writes:

runs/rs1/<fixture>/command.txt
runs/rs1/<fixture>/stdout.log
runs/rs1/<fixture>/stderr.log
runs/rs1/<fixture>/exit_code.txt
runs/rs1/<fixture>/runtime_seconds.txt
runs/rs1/<fixture>/frontier_metrics/.../config.json
runs/rs1/<fixture>/frontier_metrics/.../system_metrics.json
runs/rs1/<fixture>/frontier_metrics/.../request_metrics.csv
runs/rs1/<fixture>/postprocess_summary.json
runs/rs1/<fixture>/postprocess_summary.md

Current Results

Initial local attempt with network_device=a800_pairwise_nvlink failed during config reconstruction:

ValueError: [BaseNodeSKUConfig] Invalid type string: a800_pairwise_nvlink

The preserved failed run context is under runs/rs1/coder_100_failed_a800_pairwise_nvlink/.

The first a800_dgx attempt failed because the base Python environment lacked Frontier runtime dependencies:

ModuleNotFoundError: No module named 'plotly'

Dependencies were installed into ReplayServe-local .deps/python with pip --target; Frontier source was not installed or modified.

coder_100

Status: passed.

Run dir: runs/rs1/coder_100/
Runtime: 7 seconds
Metrics dir: runs/rs1/coder_100/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_100/
Frontier block-level prefix hit ratio: 0.04948661841440835
ReplayServe token-weighted prefix hit ratio: 0.04956232588915065
Frontier total query blocks: 29705
Frontier total hit blocks: 1470
ReplayServe total query tokens: 474554
ReplayServe total hit tokens: 23520
Memory planner mode: memory_planner
GPU memory utilization: 0.9
A800 memory budget: 80 GiB * 0.9 = 77309411328 bytes
Qwen3-32B TP2 analytical weight shard estimate: 28940697600 bytes (26.953125 GiB)
Non-KV overhead assumption: 0 bytes
Available KV budget under this smoke assumption: 48368713728 bytes
Derived KV blocks: 36902
Preemption events: 0
Allocation/preemption/OOM log lines: 0

The derived KV block count is recomputed by ReplayServe postprocess with the same formula as Frontier MemoryPlanner.get_num_blocks because this run did not emit Frontier's [MEMORY_STATE] line in stdout/stderr.

coder_2000

Status: blocked by Frontier runtime error under the fixed RS1 configuration.

Run dir: runs/rs1/coder_2000/
Runtime: 4 seconds
Config: runs/rs1/coder_2000/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_2000/config.json
Failure summary: runs/rs1/coder_2000/failure_summary.md

Frontier failed during vLLM v1 prefix-cache scheduling:

ValueError: Request 194 already scheduled.

The traceback reaches vllm_v1_engine_replica_scheduler.py:3185, where the scheduler calls request.on_cache_hit(prefix_cached_tokens), and then request.py:505, where Request.on_cache_hit() rejects cache-hit updates after the request has already been scheduled.

No Frontier source changes were made. RS1 stops here rather than changing scheduler knobs, because disabling prefix caching or chunked prefill would no longer match the fixed smoke point.

Metric Semantics

Frontier reports prefix-cache hits at block granularity. ReplayServe postprocess uses sidecar.jsonl to weight each request's first hit_blocks by block_token_counts, so a hit on a partial final block contributes the true partial token count rather than 16 tokens.

If Frontier omits request_cached_prefill_tokens, request_prefix_cache_query_blocks, or request_prefix_cache_hit_blocks from request_metrics.csv, ReplayServe cannot compute token-weighted hit ratio from that run without additional simulator instrumentation.

Limitations

Frontier's public A800 compute profiles in the checked source do not include a dense Qwen/Qwen3-32B profile.
Dummy execution predictor is enabled, so TTFT, TPOT, E2E latency, and throughput are only pipeline smoke outputs.
Memory planner uses analytical parameter memory and a 0-byte non-KV overhead assumption. The derived KV capacity must be replaced by calibrated overhead or runtime profiling before interpreting capacity pressure.

5.9 KiB Raw Permalink Blame History