RS1B Frontier Patch

This document records the scratch Frontier patch used to unblock RS1 fixed config replay. It is not applied to the canonical Frontier checkout.

Patch

Patch file: patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
Canonical Frontier checkout: /tmp/toc-llm-sim-research/Frontier
Scratch Frontier checkout: /tmp/replayserve-frontier-rs1b
Frontier base HEAD: d9cfeb6d8791fbf2f295dd9744c56a666171776e

Apply from a Frontier checkout at the same base commit:

cd /path/to/Frontier
git apply /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch

Check applicability without modifying a checkout:

cd /path/to/Frontier
git apply --check /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch

Root Cause

Instrumentation in the scratch checkout showed the minimal N=193 failure has two admissions for request 192:

req=192 source=request_queue scheduled=False preempted=False prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=112 num_new_tokens=64
req=192 source=preempted_requests scheduled=True preempted=True prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=1232 num_new_tokens=64

The second admission comes from _preempted_requests. Frontier preemption resets victim._num_processed_tokens and removes the explicit scheduler frontier, but it leaves victim._scheduled=True. The request then re-enters waiting admission, prefix-cache admission finds cached blocks, and request.on_cache_hit(prefix_cached_tokens) raises because on_cache_hit requires _scheduled=False.

The failure is therefore a Frontier runtime-state reset issue for preempted chunked-prefill requests with prefix caching enabled, not bad ReplayServe trace data.

Patch Rationale

The first patch reset two request runtime fields in VLLMv1EngineReplicaScheduler._preempt_request:

victim._num_prefill_tokens_cached = 0
victim._scheduled = False

This matches the existing preemption intent in the same block: computed tokens are reset and the request is re-entered into a waiting queue for recomputation. After that reset, waiting admission can apply prefix-cache hit state through the existing Request.on_cache_hit path before the request is scheduled again.

An earlier conservative experiment skipped on_cache_hit for already scheduled requests and advanced only the scheduler frontier. That avoided the immediate exception but left request 192 incomplete at simulation shutdown, because the request object's processed-token state never reflected the cached prefix.

The 2026-06-25 RS10 debug runs exposed a second lifecycle bug. Missing request metrics for coder_200_ts2 and coder_200_ts3 were not postprocess artifacts: Frontier drained with completed_requests < total_requests. Missing requests had this state pattern:

preempted=True
is_prefill_complete=True
num_processed_tokens=0
scheduled=False
completed=False

They had been preempted after entering decode. Frontier cleared processed tokens but kept the request in prefill-complete state. The next waiting admission therefore computed num_new_tokens=0 and dropped the request from the waiting queue.

The current patch now also:

replays decode-phase preemption by turning already-produced tokens into the next prefill segment and leaving the remaining tokens as decode work;
preserves user-facing prompt/output lengths for metrics after runtime token splitting;
preserves unfinished zero-token waiting requests instead of silently dropping them;
makes sequential simulation fail fast if the event queue drains before all generated requests complete, with per-request debug snapshots.

Verification Matrix

All patched runs used RS1 fixed config unless explicitly stated otherwise: online, co-location, vLLM v1, A800, Qwen/Qwen3-32B, TP2, dummy predictor, analytical communication backend, max_tokens=32768, prefix cache on, block size 16, chunked prefill on, batch cap 128, max batch tokens 32768, memory planner KV capacity.

run	Frontier root	result	runtime	notes
`runs/rs1b/instrumentation/n193_instrumented_print`	scratch instrumentation	fail	4s	Proved request 192 re-entered from `_preempted_requests` with `_scheduled=True`.
`runs/rs1b/patched/n193_fixed_v2`	patched scratch	pass	11s	`N=193` fixed config passed.
`runs/rs1b/patched/coder_100`	patched scratch	pass	8s	Prefix hit ratios matched original RS1 `coder_100`.
`runs/rs1b/patched/coder_2000`	patched scratch	pass	87s	Full fixed config run completed.
`runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2`	patched scratch	pass	462s	RS10 H20 TP1 full32K profile; completion `200/200`; 33 preemption events.
`runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3`	patched scratch	pass	465s	RS10 H20 TP1 full32K profile; completion `200/200`; 20 preemption events.

Prefix cache summaries:

run	Frontier block hit ratio	ReplayServe token-weighted hit ratio	preemption events
original `runs/rs1/coder_100`	0.0494866184	0.0495623259	0
patched `runs/rs1b/patched/coder_100`	0.0494866184	0.0495623259	0
patched `runs/rs1b/patched/n193_fixed_v2`	0.1245897179	0.1247698141	5
patched `runs/rs1b/patched/coder_2000`	0.1231893025	0.1233297822	35940
patched `runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2`	0.2310157359	0.2313416900	33
patched `runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3`	0.2173684294	0.2176751278	20

For coder_2000, ReplayServe postprocess skipped 745 request rows whose Frontier request metrics had blank prefix-cache fields. The run still completed and produced system_metrics.json and request_metrics.csv.

Risks

The patch touches Frontier private Request fields from scheduler code, matching existing local style but still relying on internal state layout.
Resetting _scheduled during preemption may affect request scheduling accounting outside this RS1 path. It does not clear _scheduled_at, so schedule history remains present, but downstream assumptions about the boolean should be reviewed upstream.
Resetting _num_prefill_tokens_cached means request-level cached-prefill metrics reflect the current post-preemption admission rather than stale pre-preemption state. This is necessary for the existing on_cache_hit path to model cached-prefix progress, but metrics semantics should be confirmed with Frontier maintainers.
The decode-phase preemption replay mutates Frontier private request token fields. Metrics are explicitly anchored to user-facing prompt/output lengths, but upstream should review whether this should become a public Request method.
The patched coder_2000 run has many preemptions. RS1 remains a plumbing smoke; latency and throughput should not be treated as performance evidence.

6.9 KiB Raw Permalink Blame History