Files
replaysim/docs/rs1_frontier_patch.md

6.9 KiB

RS1B Frontier Patch

This document records the scratch Frontier patch used to unblock RS1 fixed config replay. It is not applied to the canonical Frontier checkout.

Patch

  • Patch file: patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
  • Canonical Frontier checkout: /tmp/toc-llm-sim-research/Frontier
  • Scratch Frontier checkout: /tmp/replayserve-frontier-rs1b
  • Frontier base HEAD: d9cfeb6d8791fbf2f295dd9744c56a666171776e

Apply from a Frontier checkout at the same base commit:

cd /path/to/Frontier
git apply /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch

Check applicability without modifying a checkout:

cd /path/to/Frontier
git apply --check /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch

Root Cause

Instrumentation in the scratch checkout showed the minimal N=193 failure has two admissions for request 192:

req=192 source=request_queue scheduled=False preempted=False prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=112 num_new_tokens=64
req=192 source=preempted_requests scheduled=True preempted=True prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=1232 num_new_tokens=64

The second admission comes from _preempted_requests. Frontier preemption resets victim._num_processed_tokens and removes the explicit scheduler frontier, but it leaves victim._scheduled=True. The request then re-enters waiting admission, prefix-cache admission finds cached blocks, and request.on_cache_hit(prefix_cached_tokens) raises because on_cache_hit requires _scheduled=False.

The failure is therefore a Frontier runtime-state reset issue for preempted chunked-prefill requests with prefix caching enabled, not bad ReplayServe trace data.

Patch Rationale

The first patch reset two request runtime fields in VLLMv1EngineReplicaScheduler._preempt_request:

victim._num_prefill_tokens_cached = 0
victim._scheduled = False

This matches the existing preemption intent in the same block: computed tokens are reset and the request is re-entered into a waiting queue for recomputation. After that reset, waiting admission can apply prefix-cache hit state through the existing Request.on_cache_hit path before the request is scheduled again.

An earlier conservative experiment skipped on_cache_hit for already scheduled requests and advanced only the scheduler frontier. That avoided the immediate exception but left request 192 incomplete at simulation shutdown, because the request object's processed-token state never reflected the cached prefix.

The 2026-06-25 RS10 debug runs exposed a second lifecycle bug. Missing request metrics for coder_200_ts2 and coder_200_ts3 were not postprocess artifacts: Frontier drained with completed_requests < total_requests. Missing requests had this state pattern:

preempted=True
is_prefill_complete=True
num_processed_tokens=0
scheduled=False
completed=False

They had been preempted after entering decode. Frontier cleared processed tokens but kept the request in prefill-complete state. The next waiting admission therefore computed num_new_tokens=0 and dropped the request from the waiting queue.

The current patch now also:

  • replays decode-phase preemption by turning already-produced tokens into the next prefill segment and leaving the remaining tokens as decode work;
  • preserves user-facing prompt/output lengths for metrics after runtime token splitting;
  • preserves unfinished zero-token waiting requests instead of silently dropping them;
  • makes sequential simulation fail fast if the event queue drains before all generated requests complete, with per-request debug snapshots.

Verification Matrix

All patched runs used RS1 fixed config unless explicitly stated otherwise: online, co-location, vLLM v1, A800, Qwen/Qwen3-32B, TP2, dummy predictor, analytical communication backend, max_tokens=32768, prefix cache on, block size 16, chunked prefill on, batch cap 128, max batch tokens 32768, memory planner KV capacity.

run Frontier root result runtime notes
runs/rs1b/instrumentation/n193_instrumented_print scratch instrumentation fail 4s Proved request 192 re-entered from _preempted_requests with _scheduled=True.
runs/rs1b/patched/n193_fixed_v2 patched scratch pass 11s N=193 fixed config passed.
runs/rs1b/patched/coder_100 patched scratch pass 8s Prefix hit ratios matched original RS1 coder_100.
runs/rs1b/patched/coder_2000 patched scratch pass 87s Full fixed config run completed.
runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2 patched scratch pass 462s RS10 H20 TP1 full32K profile; completion 200/200; 33 preemption events.
runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3 patched scratch pass 465s RS10 H20 TP1 full32K profile; completion 200/200; 20 preemption events.

Prefix cache summaries:

run Frontier block hit ratio ReplayServe token-weighted hit ratio preemption events
original runs/rs1/coder_100 0.0494866184 0.0495623259 0
patched runs/rs1b/patched/coder_100 0.0494866184 0.0495623259 0
patched runs/rs1b/patched/n193_fixed_v2 0.1245897179 0.1247698141 5
patched runs/rs1b/patched/coder_2000 0.1231893025 0.1233297822 35940
patched runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2 0.2310157359 0.2313416900 33
patched runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3 0.2173684294 0.2176751278 20

For coder_2000, ReplayServe postprocess skipped 745 request rows whose Frontier request metrics had blank prefix-cache fields. The run still completed and produced system_metrics.json and request_metrics.csv.

Risks

  • The patch touches Frontier private Request fields from scheduler code, matching existing local style but still relying on internal state layout.
  • Resetting _scheduled during preemption may affect request scheduling accounting outside this RS1 path. It does not clear _scheduled_at, so schedule history remains present, but downstream assumptions about the boolean should be reviewed upstream.
  • Resetting _num_prefill_tokens_cached means request-level cached-prefill metrics reflect the current post-preemption admission rather than stale pre-preemption state. This is necessary for the existing on_cache_hit path to model cached-prefix progress, but metrics semantics should be confirmed with Frontier maintainers.
  • The decode-phase preemption replay mutates Frontier private request token fields. Metrics are explicitly anchored to user-facing prompt/output lengths, but upstream should review whether this should become a public Request method.
  • The patched coder_2000 run has many preemptions. RS1 remains a plumbing smoke; latency and throughput should not be treated as performance evidence.