replaysim/docs/rs1_frontier_smoke.md

# RS1 Frontier Smoke

RS1 runs Frontier trace replay as a plumbing smoke for the Qwen coder fixtures
generated in RS0. It checks that Frontier can consume ReplayServe's Frontier CSV,
preserve online arrivals, run vLLM v1 prefix caching, and emit request/system
metrics. It does not make latency or throughput claims.

## Fixed Configuration

- `simulation_mode=online`
- `sys_arch=co-location`
- `cluster_scheduler=sticky_round_robin`
- `replica_scheduler=vllm_v1`
- `device=a800`
- `network_device=a800_dgx`
- `model_name=Qwen/Qwen3-32B`
- `attn_tensor_parallel_size=2`
- dummy execution predictor, 1 ms per model execution
- analytical communication backend
- `trace_request_generator_config_max_tokens=32768`
- prefix caching enabled
- block size 16
- chunked prefill enabled
- batch cap 128
- max batch tokens 32768
- `num_blocks_mode=memory_planner`
- `gpu_memory_utilization=0.9`
- `non_kv_cache_overhead_bytes=0`

The memory planner point uses Frontier's A800 device config
(`total_memory_gb=80`) and analytical parameter memory. The non-KV overhead is
set to 0 for this smoke, so the derived KV block count is a permissive plumbing
budget, not a calibrated serving budget.

Frontier also ships an `a800_pairwise_nvlink` network profile, but
`replica_config_network_device` is used to construct a node SKU in the current
co-location path. This checkout has `A800_DGX` as a node SKU and does not have an
`A800_PAIRWISE_NVLINK` node SKU, so RS1 uses `a800_dgx`.

## Reproduce

From `/home/gahow/phd/replayserve`:

```bash
PIP_CACHE_DIR=/home/gahow/phd/replayserve/.cache/pip python3 -m pip install \
  --target /home/gahow/phd/replayserve/.deps/python \
  'ddsketch>=3.0,<4' 'fasteners>=0.19,<1' 'numpy>=1.23' 'pandas>=1.5' \
  'plotly>=5.0' 'pyyaml>=6.0' 'scikit-learn>=1.1' 'scipy>=1.9' 'tqdm>=4.64'
scripts/run_frontier_smoke.sh coder_100
scripts/run_frontier_smoke.sh coder_2000
```

Each run writes:

- `runs/rs1/<fixture>/command.txt`
- `runs/rs1/<fixture>/stdout.log`
- `runs/rs1/<fixture>/stderr.log`
- `runs/rs1/<fixture>/exit_code.txt`
- `runs/rs1/<fixture>/runtime_seconds.txt`
- `runs/rs1/<fixture>/frontier_metrics/.../config.json`
- `runs/rs1/<fixture>/frontier_metrics/.../system_metrics.json`
- `runs/rs1/<fixture>/frontier_metrics/.../request_metrics.csv`
- `runs/rs1/<fixture>/postprocess_summary.json`
- `runs/rs1/<fixture>/postprocess_summary.md`

## Current Results

Initial local attempt with `network_device=a800_pairwise_nvlink` failed during
config reconstruction:

```text
ValueError: [BaseNodeSKUConfig] Invalid type string: a800_pairwise_nvlink
```

The preserved failed run context is under
`runs/rs1/coder_100_failed_a800_pairwise_nvlink/`.

The first `a800_dgx` attempt failed because the base Python environment lacked
Frontier runtime dependencies:

```text
ModuleNotFoundError: No module named 'plotly'
```

Dependencies were installed into ReplayServe-local `.deps/python` with pip
`--target`; Frontier source was not installed or modified.

### coder_100

Status: passed.

- Run dir: `runs/rs1/coder_100/`
- Runtime: 7 seconds
- Metrics dir:
  `runs/rs1/coder_100/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_100/`
- Frontier block-level prefix hit ratio: `0.04948661841440835`
- ReplayServe token-weighted prefix hit ratio: `0.04956232588915065`
- Frontier total query blocks: `29705`
- Frontier total hit blocks: `1470`
- ReplayServe total query tokens: `474554`
- ReplayServe total hit tokens: `23520`
- Memory planner mode: `memory_planner`
- GPU memory utilization: `0.9`
- A800 memory budget: `80 GiB * 0.9 = 77309411328 bytes`
- Qwen3-32B TP2 analytical weight shard estimate:
  `28940697600 bytes` (`26.953125 GiB`)
- Non-KV overhead assumption: `0 bytes`
- Available KV budget under this smoke assumption: `48368713728 bytes`
- Derived KV blocks: `36902`
- Preemption events: `0`
- Allocation/preemption/OOM log lines: `0`

The derived KV block count is recomputed by ReplayServe postprocess with the
same formula as Frontier `MemoryPlanner.get_num_blocks` because this run did
not emit Frontier's `[MEMORY_STATE]` line in stdout/stderr.

### coder_2000

Status: blocked by Frontier runtime error under the fixed RS1 configuration.

- Run dir: `runs/rs1/coder_2000/`
- Runtime: 4 seconds
- Config:
  `runs/rs1/coder_2000/frontier_metrics/qwen_qwen3_32b/online_serving/rs1_coder_2000/config.json`
- Failure summary: `runs/rs1/coder_2000/failure_summary.md`

Frontier failed during vLLM v1 prefix-cache scheduling:

```text
ValueError: Request 194 already scheduled.
```

The traceback reaches
`vllm_v1_engine_replica_scheduler.py:3185`, where the scheduler calls
`request.on_cache_hit(prefix_cached_tokens)`, and then
`request.py:505`, where `Request.on_cache_hit()` rejects cache-hit updates after
the request has already been scheduled.

No Frontier source changes were made. RS1 stops here rather than changing
scheduler knobs, because disabling prefix caching or chunked prefill would no
longer match the fixed smoke point.

## Metric Semantics

Frontier reports prefix-cache hits at block granularity. ReplayServe postprocess
uses `sidecar.jsonl` to weight each request's first `hit_blocks` by
`block_token_counts`, so a hit on a partial final block contributes the true
partial token count rather than 16 tokens.

If Frontier omits `request_cached_prefill_tokens`,
`request_prefix_cache_query_blocks`, or `request_prefix_cache_hit_blocks` from
`request_metrics.csv`, ReplayServe cannot compute token-weighted hit ratio from
that run without additional simulator instrumentation.

## Limitations

- Frontier's public A800 compute profiles in the checked source do not include a
  dense `Qwen/Qwen3-32B` profile.
- Dummy execution predictor is enabled, so TTFT, TPOT, E2E latency, and
  throughput are only pipeline smoke outputs.
- Memory planner uses analytical parameter memory and a 0-byte non-KV overhead
  assumption. The derived KV capacity must be replaced by calibrated overhead or
  runtime profiling before interpreting capacity pressure.