Files
replaysim/docs/rs1_frontier_smoke.md

5.9 KiB

RS1 Frontier Smoke

RS1 runs Frontier trace replay as a plumbing smoke for the Qwen coder fixtures generated in RS0. It checks that Frontier can consume ReplayServe's Frontier CSV, preserve online arrivals, run vLLM v1 prefix caching, and emit request/system metrics. It does not make latency or throughput claims.

Fixed Configuration

  • simulation_mode=online
  • sys_arch=co-location
  • cluster_scheduler=sticky_round_robin
  • replica_scheduler=vllm_v1
  • device=a800
  • network_device=a800_dgx
  • model_name=Qwen/Qwen3-32B
  • attn_tensor_parallel_size=2
  • dummy execution predictor, 1 ms per model execution
  • analytical communication backend
  • trace_request_generator_config_max_tokens=32768
  • prefix caching enabled
  • block size 16
  • chunked prefill enabled
  • batch cap 128
  • max batch tokens 32768
  • num_blocks_mode=memory_planner
  • gpu_memory_utilization=0.9
  • non_kv_cache_overhead_bytes=0

The memory planner point uses Frontier's A800 device config (total_memory_gb=80) and analytical parameter memory. The non-KV overhead is set to 0 for this smoke, so the derived KV block count is a permissive plumbing budget, not a calibrated serving budget.

Frontier also ships an a800_pairwise_nvlink network profile, but replica_config_network_device is used to construct a node SKU in the current co-location path. This checkout has A800_DGX as a node SKU and does not have an A800_PAIRWISE_NVLINK node SKU, so RS1 uses a800_dgx.

Reproduce

From /home/gahow/phd/replayserve:

PIP_CACHE_DIR=/home/gahow/phd/replayserve/.cache/pip python3 -m pip install \
  --target /home/gahow/phd/replayserve/.deps/python \
  'ddsketch>=3.0,<4' 'fasteners>=0.19,<1' 'numpy>=1.23' 'pandas>=1.5' \
  'plotly>=5.0' 'pyyaml>=6.0' 'scikit-learn>=1.1' 'scipy>=1.9' 'tqdm>=4.64'
scripts/run_frontier_smoke.sh coder_100
scripts/run_frontier_smoke.sh coder_2000

Each run writes:

  • runs/rs1/<fixture>/command.txt
  • runs/rs1/<fixture>/stdout.log
  • runs/rs1/<fixture>/stderr.log
  • runs/rs1/<fixture>/exit_code.txt
  • runs/rs1/<fixture>/runtime_seconds.txt
  • runs/rs1/<fixture>/frontier_metrics/.../config.json
  • runs/rs1/<fixture>/frontier_metrics/.../system_metrics.json
  • runs/rs1/<fixture>/frontier_metrics/.../request_metrics.csv
  • runs/rs1/<fixture>/postprocess_summary.json
  • runs/rs1/<fixture>/postprocess_summary.md

Current Results

Initial local attempt with network_device=a800_pairwise_nvlink failed during config reconstruction:

ValueError: [BaseNodeSKUConfig] Invalid type string: a800_pairwise_nvlink

The preserved failed run context is under runs/rs1/coder_100_failed_a800_pairwise_nvlink/.

The first a800_dgx attempt failed because the base Python environment lacked Frontier runtime dependencies:

ModuleNotFoundError: No module named 'plotly'

Dependencies were installed into ReplayServe-local .deps/python with pip --target; Frontier source was not installed or modified.

coder_100

Status: passed.

  • Run dir: runs/rs1/coder_100/
  • Runtime: 7 seconds
  • Metrics dir: runs/rs1/coder_100/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_100/
  • Frontier block-level prefix hit ratio: 0.04948661841440835
  • ReplayServe token-weighted prefix hit ratio: 0.04956232588915065
  • Frontier total query blocks: 29705
  • Frontier total hit blocks: 1470
  • ReplayServe total query tokens: 474554
  • ReplayServe total hit tokens: 23520
  • Memory planner mode: memory_planner
  • GPU memory utilization: 0.9
  • A800 memory budget: 80 GiB * 0.9 = 77309411328 bytes
  • Qwen3-32B TP2 analytical weight shard estimate: 28940697600 bytes (26.953125 GiB)
  • Non-KV overhead assumption: 0 bytes
  • Available KV budget under this smoke assumption: 48368713728 bytes
  • Derived KV blocks: 36902
  • Preemption events: 0
  • Allocation/preemption/OOM log lines: 0

The derived KV block count is recomputed by ReplayServe postprocess with the same formula as Frontier MemoryPlanner.get_num_blocks because this run did not emit Frontier's [MEMORY_STATE] line in stdout/stderr.

coder_2000

Status: blocked by Frontier runtime error under the fixed RS1 configuration.

  • Run dir: runs/rs1/coder_2000/
  • Runtime: 4 seconds
  • Config: runs/rs1/coder_2000/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_2000/config.json
  • Failure summary: runs/rs1/coder_2000/failure_summary.md

Frontier failed during vLLM v1 prefix-cache scheduling:

ValueError: Request 194 already scheduled.

The traceback reaches vllm_v1_engine_replica_scheduler.py:3185, where the scheduler calls request.on_cache_hit(prefix_cached_tokens), and then request.py:505, where Request.on_cache_hit() rejects cache-hit updates after the request has already been scheduled.

No Frontier source changes were made. RS1 stops here rather than changing scheduler knobs, because disabling prefix caching or chunked prefill would no longer match the fixed smoke point.

Metric Semantics

Frontier reports prefix-cache hits at block granularity. ReplayServe postprocess uses sidecar.jsonl to weight each request's first hit_blocks by block_token_counts, so a hit on a partial final block contributes the true partial token count rather than 16 tokens.

If Frontier omits request_cached_prefill_tokens, request_prefix_cache_query_blocks, or request_prefix_cache_hit_blocks from request_metrics.csv, ReplayServe cannot compute token-weighted hit ratio from that run without additional simulator instrumentation.

Limitations

  • Frontier's public A800 compute profiles in the checked source do not include a dense Qwen/Qwen3-32B profile.
  • Dummy execution predictor is enabled, so TTFT, TPOT, E2E latency, and throughput are only pipeline smoke outputs.
  • Memory planner uses analytical parameter memory and a 0-byte non-KV overhead assumption. The derived KV capacity must be replaced by calibrated overhead or runtime profiling before interpreting capacity pressure.