# RS1B Frontier Patch This document records the scratch Frontier patch used to unblock RS1 fixed config replay. It is not applied to the canonical Frontier checkout. ## Patch - Patch file: `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch` - Canonical Frontier checkout: `/tmp/toc-llm-sim-research/Frontier` - Scratch Frontier checkout: `/tmp/replayserve-frontier-rs1b` - Frontier base HEAD: `d9cfeb6d8791fbf2f295dd9744c56a666171776e` Apply from a Frontier checkout at the same base commit: ```bash cd /path/to/Frontier git apply /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch ``` Check applicability without modifying a checkout: ```bash cd /path/to/Frontier git apply --check /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch ``` ## Root Cause Instrumentation in the scratch checkout showed the minimal `N=193` failure has two admissions for request 192: ```text req=192 source=request_queue scheduled=False preempted=False prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=112 num_new_tokens=64 req=192 source=preempted_requests scheduled=True preempted=True prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=1232 num_new_tokens=64 ``` The second admission comes from `_preempted_requests`. Frontier preemption resets `victim._num_processed_tokens` and removes the explicit scheduler frontier, but it leaves `victim._scheduled=True`. The request then re-enters waiting admission, prefix-cache admission finds cached blocks, and `request.on_cache_hit(prefix_cached_tokens)` raises because `on_cache_hit` requires `_scheduled=False`. The failure is therefore a Frontier runtime-state reset issue for preempted chunked-prefill requests with prefix caching enabled, not bad ReplayServe trace data. ## Patch Rationale The first patch reset two request runtime fields in `VLLMv1EngineReplicaScheduler._preempt_request`: ```python victim._num_prefill_tokens_cached = 0 victim._scheduled = False ``` This matches the existing preemption intent in the same block: computed tokens are reset and the request is re-entered into a waiting queue for recomputation. After that reset, waiting admission can apply prefix-cache hit state through the existing `Request.on_cache_hit` path before the request is scheduled again. An earlier conservative experiment skipped `on_cache_hit` for already scheduled requests and advanced only the scheduler frontier. That avoided the immediate exception but left request 192 incomplete at simulation shutdown, because the request object's processed-token state never reflected the cached prefix. The 2026-06-25 RS10 debug runs exposed a second lifecycle bug. Missing request metrics for `coder_200_ts2` and `coder_200_ts3` were not postprocess artifacts: Frontier drained with `completed_requests < total_requests`. Missing requests had this state pattern: ```text preempted=True is_prefill_complete=True num_processed_tokens=0 scheduled=False completed=False ``` They had been preempted after entering decode. Frontier cleared processed tokens but kept the request in prefill-complete state. The next waiting admission therefore computed `num_new_tokens=0` and dropped the request from the waiting queue. The current patch now also: - replays decode-phase preemption by turning already-produced tokens into the next prefill segment and leaving the remaining tokens as decode work; - preserves user-facing prompt/output lengths for metrics after runtime token splitting; - preserves unfinished zero-token waiting requests instead of silently dropping them; - makes sequential simulation fail fast if the event queue drains before all generated requests complete, with per-request debug snapshots. ## Verification Matrix All patched runs used RS1 fixed config unless explicitly stated otherwise: online, co-location, vLLM v1, A800, Qwen/Qwen3-32B, TP2, dummy predictor, analytical communication backend, `max_tokens=32768`, prefix cache on, block size 16, chunked prefill on, batch cap 128, max batch tokens 32768, memory planner KV capacity. | run | Frontier root | result | runtime | notes | |---|---|---:|---:|---| | `runs/rs1b/instrumentation/n193_instrumented_print` | scratch instrumentation | fail | 4s | Proved request 192 re-entered from `_preempted_requests` with `_scheduled=True`. | | `runs/rs1b/patched/n193_fixed_v2` | patched scratch | pass | 11s | `N=193` fixed config passed. | | `runs/rs1b/patched/coder_100` | patched scratch | pass | 8s | Prefix hit ratios matched original RS1 `coder_100`. | | `runs/rs1b/patched/coder_2000` | patched scratch | pass | 87s | Full fixed config run completed. | | `runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2` | patched scratch | pass | 462s | RS10 H20 TP1 full32K profile; completion `200/200`; 33 preemption events. | | `runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3` | patched scratch | pass | 465s | RS10 H20 TP1 full32K profile; completion `200/200`; 20 preemption events. | Prefix cache summaries: | run | Frontier block hit ratio | ReplayServe token-weighted hit ratio | preemption events | |---|---:|---:|---:| | original `runs/rs1/coder_100` | 0.0494866184 | 0.0495623259 | 0 | | patched `runs/rs1b/patched/coder_100` | 0.0494866184 | 0.0495623259 | 0 | | patched `runs/rs1b/patched/n193_fixed_v2` | 0.1245897179 | 0.1247698141 | 5 | | patched `runs/rs1b/patched/coder_2000` | 0.1231893025 | 0.1233297822 | 35940 | | patched `runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2` | 0.2310157359 | 0.2313416900 | 33 | | patched `runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3` | 0.2173684294 | 0.2176751278 | 20 | For `coder_2000`, ReplayServe postprocess skipped 745 request rows whose Frontier request metrics had blank prefix-cache fields. The run still completed and produced `system_metrics.json` and `request_metrics.csv`. ## Risks - The patch touches Frontier private `Request` fields from scheduler code, matching existing local style but still relying on internal state layout. - Resetting `_scheduled` during preemption may affect request scheduling accounting outside this RS1 path. It does not clear `_scheduled_at`, so schedule history remains present, but downstream assumptions about the boolean should be reviewed upstream. - Resetting `_num_prefill_tokens_cached` means request-level cached-prefill metrics reflect the current post-preemption admission rather than stale pre-preemption state. This is necessary for the existing `on_cache_hit` path to model cached-prefix progress, but metrics semantics should be confirmed with Frontier maintainers. - The decode-phase preemption replay mutates Frontier private request token fields. Metrics are explicitly anchored to user-facing prompt/output lengths, but upstream should review whether this should become a public Request method. - The patched `coder_2000` run has many preemptions. RS1 remains a plumbing smoke; latency and throughput should not be treated as performance evidence.