6.9 KiB
RS1B Frontier Patch
This document records the scratch Frontier patch used to unblock RS1 fixed config replay. It is not applied to the canonical Frontier checkout.
Patch
- Patch file:
patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch - Canonical Frontier checkout:
/tmp/toc-llm-sim-research/Frontier - Scratch Frontier checkout:
/tmp/replayserve-frontier-rs1b - Frontier base HEAD:
d9cfeb6d8791fbf2f295dd9744c56a666171776e
Apply from a Frontier checkout at the same base commit:
cd /path/to/Frontier
git apply /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
Check applicability without modifying a checkout:
cd /path/to/Frontier
git apply --check /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
Root Cause
Instrumentation in the scratch checkout showed the minimal N=193 failure
has two admissions for request 192:
req=192 source=request_queue scheduled=False preempted=False prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=112 num_new_tokens=64
req=192 source=preempted_requests scheduled=True preempted=True prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=1232 num_new_tokens=64
The second admission comes from _preempted_requests. Frontier preemption
resets victim._num_processed_tokens and removes the explicit scheduler
frontier, but it leaves victim._scheduled=True. The request then re-enters
waiting admission, prefix-cache admission finds cached blocks, and
request.on_cache_hit(prefix_cached_tokens) raises because on_cache_hit
requires _scheduled=False.
The failure is therefore a Frontier runtime-state reset issue for preempted chunked-prefill requests with prefix caching enabled, not bad ReplayServe trace data.
Patch Rationale
The first patch reset two request runtime fields in
VLLMv1EngineReplicaScheduler._preempt_request:
victim._num_prefill_tokens_cached = 0
victim._scheduled = False
This matches the existing preemption intent in the same block: computed tokens
are reset and the request is re-entered into a waiting queue for recomputation.
After that reset, waiting admission can apply prefix-cache hit state through
the existing Request.on_cache_hit path before the request is scheduled again.
An earlier conservative experiment skipped on_cache_hit for already scheduled
requests and advanced only the scheduler frontier. That avoided the immediate
exception but left request 192 incomplete at simulation shutdown, because the
request object's processed-token state never reflected the cached prefix.
The 2026-06-25 RS10 debug runs exposed a second lifecycle bug. Missing request
metrics for coder_200_ts2 and coder_200_ts3 were not postprocess artifacts:
Frontier drained with completed_requests < total_requests. Missing requests
had this state pattern:
preempted=True
is_prefill_complete=True
num_processed_tokens=0
scheduled=False
completed=False
They had been preempted after entering decode. Frontier cleared processed
tokens but kept the request in prefill-complete state. The next waiting
admission therefore computed num_new_tokens=0 and dropped the request from
the waiting queue.
The current patch now also:
- replays decode-phase preemption by turning already-produced tokens into the next prefill segment and leaving the remaining tokens as decode work;
- preserves user-facing prompt/output lengths for metrics after runtime token splitting;
- preserves unfinished zero-token waiting requests instead of silently dropping them;
- makes sequential simulation fail fast if the event queue drains before all generated requests complete, with per-request debug snapshots.
Verification Matrix
All patched runs used RS1 fixed config unless explicitly stated otherwise:
online, co-location, vLLM v1, A800, Qwen/Qwen3-32B, TP2, dummy predictor,
analytical communication backend, max_tokens=32768, prefix cache on, block
size 16, chunked prefill on, batch cap 128, max batch tokens 32768, memory
planner KV capacity.
| run | Frontier root | result | runtime | notes |
|---|---|---|---|---|
runs/rs1b/instrumentation/n193_instrumented_print |
scratch instrumentation | fail | 4s | Proved request 192 re-entered from _preempted_requests with _scheduled=True. |
runs/rs1b/patched/n193_fixed_v2 |
patched scratch | pass | 11s | N=193 fixed config passed. |
runs/rs1b/patched/coder_100 |
patched scratch | pass | 8s | Prefix hit ratios matched original RS1 coder_100. |
runs/rs1b/patched/coder_2000 |
patched scratch | pass | 87s | Full fixed config run completed. |
runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2 |
patched scratch | pass | 462s | RS10 H20 TP1 full32K profile; completion 200/200; 33 preemption events. |
runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3 |
patched scratch | pass | 465s | RS10 H20 TP1 full32K profile; completion 200/200; 20 preemption events. |
Prefix cache summaries:
| run | Frontier block hit ratio | ReplayServe token-weighted hit ratio | preemption events |
|---|---|---|---|
original runs/rs1/coder_100 |
0.0494866184 | 0.0495623259 | 0 |
patched runs/rs1b/patched/coder_100 |
0.0494866184 | 0.0495623259 | 0 |
patched runs/rs1b/patched/n193_fixed_v2 |
0.1245897179 | 0.1247698141 | 5 |
patched runs/rs1b/patched/coder_2000 |
0.1231893025 | 0.1233297822 | 35940 |
patched runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2 |
0.2310157359 | 0.2313416900 | 33 |
patched runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3 |
0.2173684294 | 0.2176751278 | 20 |
For coder_2000, ReplayServe postprocess skipped 745 request rows whose
Frontier request metrics had blank prefix-cache fields. The run still completed
and produced system_metrics.json and request_metrics.csv.
Risks
- The patch touches Frontier private
Requestfields from scheduler code, matching existing local style but still relying on internal state layout. - Resetting
_scheduledduring preemption may affect request scheduling accounting outside this RS1 path. It does not clear_scheduled_at, so schedule history remains present, but downstream assumptions about the boolean should be reviewed upstream. - Resetting
_num_prefill_tokens_cachedmeans request-level cached-prefill metrics reflect the current post-preemption admission rather than stale pre-preemption state. This is necessary for the existingon_cache_hitpath to model cached-prefix progress, but metrics semantics should be confirmed with Frontier maintainers. - The decode-phase preemption replay mutates Frontier private request token fields. Metrics are explicitly anchored to user-facing prompt/output lengths, but upstream should review whether this should become a public Request method.
- The patched
coder_2000run has many preemptions. RS1 remains a plumbing smoke; latency and throughput should not be treated as performance evidence.