# RS1B Frontier Patch

This document records the scratch Frontier patch used to unblock RS1 fixed
config replay. It is not applied to the canonical Frontier checkout.

## Patch

- Patch file:
  `patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch`
- Canonical Frontier checkout:
  `/tmp/toc-llm-sim-research/Frontier`
- Scratch Frontier checkout:
  `/tmp/replayserve-frontier-rs1b`
- Frontier base HEAD:
  `d9cfeb6d8791fbf2f295dd9744c56a666171776e`

Apply from a Frontier checkout at the same base commit:

```bash
cd /path/to/Frontier
git apply /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
```

Check applicability without modifying a checkout:

```bash
cd /path/to/Frontier
git apply --check /home/gahow/phd/replayserve/patches/frontier-vllm-v1-prefix-cache-chunked-prefill.patch
```

## Root Cause

Instrumentation in the scratch checkout showed the minimal `N=193` failure
has two admissions for request 192:

```text
req=192 source=request_queue scheduled=False preempted=False prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=112 num_new_tokens=64
req=192 source=preempted_requests scheduled=True preempted=True prefill_complete=False num_processed_tokens=0 prefix_cached_tokens=1232 num_new_tokens=64
```

The second admission comes from `_preempted_requests`. Frontier preemption
resets `victim._num_processed_tokens` and removes the explicit scheduler
frontier, but it leaves `victim._scheduled=True`. The request then re-enters
waiting admission, prefix-cache admission finds cached blocks, and
`request.on_cache_hit(prefix_cached_tokens)` raises because `on_cache_hit`
requires `_scheduled=False`.

The failure is therefore a Frontier runtime-state reset issue for preempted
chunked-prefill requests with prefix caching enabled, not bad ReplayServe
trace data.

## Patch Rationale

The first patch reset two request runtime fields in
`VLLMv1EngineReplicaScheduler._preempt_request`:

```python
victim._num_prefill_tokens_cached = 0
victim._scheduled = False
```

This matches the existing preemption intent in the same block: computed tokens
are reset and the request is re-entered into a waiting queue for recomputation.
After that reset, waiting admission can apply prefix-cache hit state through
the existing `Request.on_cache_hit` path before the request is scheduled again.

An earlier conservative experiment skipped `on_cache_hit` for already scheduled
requests and advanced only the scheduler frontier. That avoided the immediate
exception but left request 192 incomplete at simulation shutdown, because the
request object's processed-token state never reflected the cached prefix.

The 2026-06-25 RS10 debug runs exposed a second lifecycle bug. Missing request
metrics for `coder_200_ts2` and `coder_200_ts3` were not postprocess artifacts:
Frontier drained with `completed_requests < total_requests`. Missing requests
had this state pattern:

```text
preempted=True
is_prefill_complete=True
num_processed_tokens=0
scheduled=False
completed=False
```

They had been preempted after entering decode. Frontier cleared processed
tokens but kept the request in prefill-complete state. The next waiting
admission therefore computed `num_new_tokens=0` and dropped the request from
the waiting queue.

The current patch now also:

- replays decode-phase preemption by turning already-produced tokens into the
  next prefill segment and leaving the remaining tokens as decode work;
- preserves user-facing prompt/output lengths for metrics after runtime token
  splitting;
- preserves unfinished zero-token waiting requests instead of silently dropping
  them;
- makes sequential simulation fail fast if the event queue drains before all
  generated requests complete, with per-request debug snapshots.

## Verification Matrix

All patched runs used RS1 fixed config unless explicitly stated otherwise:
online, co-location, vLLM v1, A800, Qwen/Qwen3-32B, TP2, dummy predictor,
analytical communication backend, `max_tokens=32768`, prefix cache on, block
size 16, chunked prefill on, batch cap 128, max batch tokens 32768, memory
planner KV capacity.

| run | Frontier root | result | runtime | notes |
|---|---|---:|---:|---|
| `runs/rs1b/instrumentation/n193_instrumented_print` | scratch instrumentation | fail | 4s | Proved request 192 re-entered from `_preempted_requests` with `_scheduled=True`. |
| `runs/rs1b/patched/n193_fixed_v2` | patched scratch | pass | 11s | `N=193` fixed config passed. |
| `runs/rs1b/patched/coder_100` | patched scratch | pass | 8s | Prefix hit ratios matched original RS1 `coder_100`. |
| `runs/rs1b/patched/coder_2000` | patched scratch | pass | 87s | Full fixed config run completed. |
| `runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2` | patched scratch | pass | 462s | RS10 H20 TP1 full32K profile; completion `200/200`; 33 preemption events. |
| `runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3` | patched scratch | pass | 465s | RS10 H20 TP1 full32K profile; completion `200/200`; 20 preemption events. |

Prefix cache summaries:

| run | Frontier block hit ratio | ReplayServe token-weighted hit ratio | preemption events |
|---|---:|---:|---:|
| original `runs/rs1/coder_100` | 0.0494866184 | 0.0495623259 | 0 |
| patched `runs/rs1b/patched/coder_100` | 0.0494866184 | 0.0495623259 | 0 |
| patched `runs/rs1b/patched/n193_fixed_v2` | 0.1245897179 | 0.1247698141 | 5 |
| patched `runs/rs1b/patched/coder_2000` | 0.1231893025 | 0.1233297822 | 35940 |
| patched `runs/rs10_preemption_replay_fix_ts2/.../coder_200_ts2` | 0.2310157359 | 0.2313416900 | 33 |
| patched `runs/rs10_preemption_replay_fix_ts3/.../coder_200_ts3` | 0.2173684294 | 0.2176751278 | 20 |

For `coder_2000`, ReplayServe postprocess skipped 745 request rows whose
Frontier request metrics had blank prefix-cache fields. The run still completed
and produced `system_metrics.json` and `request_metrics.csv`.

## Risks

- The patch touches Frontier private `Request` fields from scheduler code,
  matching existing local style but still relying on internal state layout.
- Resetting `_scheduled` during preemption may affect request scheduling
  accounting outside this RS1 path. It does not clear `_scheduled_at`, so
  schedule history remains present, but downstream assumptions about the
  boolean should be reviewed upstream.
- Resetting `_num_prefill_tokens_cached` means request-level cached-prefill
  metrics reflect the current post-preemption admission rather than stale
  pre-preemption state. This is necessary for the existing `on_cache_hit` path
  to model cached-prefix progress, but metrics semantics should be confirmed
  with Frontier maintainers.
- The decode-phase preemption replay mutates Frontier private request token
  fields. Metrics are explicitly anchored to user-facing prompt/output lengths,
  but upstream should review whether this should become a public Request method.
- The patched `coder_2000` run has many preemptions. RS1 remains a plumbing
  smoke; latency and throughput should not be treated as performance evidence.