# RS1 Frontier Blocker: Prefix Cache + Chunked Prefill This note narrows the RS1 `coder_2000` failure into a small Frontier repro. It does not change the RS1 fixed config or make performance claims. ## Status - Frontier repo: `/tmp/toc-llm-sim-research/Frontier` - Frontier HEAD: `d9cfeb6d8791fbf2f295dd9744c56a666171776e` - ReplayServe canonical fixtures were not changed. - Frontier source was not modified. - Diagnostic artifacts live under `runs/rs1/blocker_request_194/`. The original `coder_2000` run failed with: ```text ValueError: Request 194 already scheduled. ``` First-N probing shows the smaller blocker is not a single malformed row 194. The smallest observed first-N failure is `N=193`, which fails as: ```text ValueError: Request 192 already scheduled. ``` `N=192` passes under the same fixed config. ## Repro Commands Generate diagnostic slices: ```bash cd /home/gahow/phd/replayserve for n in 190 191 192 193 194 195 200; do out="runs/rs1/blocker_request_194/fixtures/coder_${n}" mkdir -p "$out" python3 tools/qwen_to_frontier.py \ --input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \ --frontier-csv "$out/frontier.csv" \ --sidecar-jsonl "$out/sidecar.jsonl" \ --source-jsonl "$out/source.jsonl" \ --manifest-json "$out/manifest.json" \ --fixture-name "blocker_coder_${n}" \ --limit "$n" \ --max-tokens 32768 \ --block-size 16 \ --fail-on-overflow done ``` Minimal failing command: ```bash cd /home/gahow/phd/replayserve scripts/run_frontier_blocker_probe.sh \ n193_default \ runs/rs1/blocker_request_194/fixtures/coder_193 ``` The exact Frontier CLI for every probe is preserved in each `runs/rs1/blocker_request_194/probes//command.txt`. ## First-N Matrix All default rows use the RS1 fixed config: prefix caching on, chunked prefill on, `long_prefill_token_threshold=64`, batch cap 128, max batch tokens 32768. | probe | rows | prefix cache | chunked prefill | threshold | exit | result | |---|---:|---|---|---:|---:|---| | `n190_default` | 190 | on | on | 64 | 0 | pass | | `n191_default` | 191 | on | on | 64 | 0 | pass | | `n192_default` | 192 | on | on | 64 | 0 | pass | | `n193_default` | 193 | on | on | 64 | 1 | `Request 192 already scheduled` | | `n194_default` | 194 | on | on | 64 | 1 | `Request 192 already scheduled` | | `n195_default` | 195 | on | on | 64 | 1 | `Request 194 already scheduled` | | `n200_default` | 200 | on | on | 64 | 1 | `Request 194 already scheduled` | ## Diagnostic Variants These are diagnosis only. They are not replacements for the RS1 fixed config. | probe | rows | prefix cache | chunked prefill | threshold | exit | result | |---|---:|---|---|---:|---:|---| | `n193_prefix_off` | 193 | off | on | 64 | 0 | pass | | `n193_chunked_off` | 193 | on | off | 64 | 1 | Frontier config rejects this combination | | `n193_chunked_off_threshold_0` | 193 | on | off | 0 | 0 | pass | | `n193_threshold_32768` | 193 | on | on | 32768 | 0 | pass | | `n195_prefix_off` | 195 | off | on | 64 | 0 | pass | | `n195_chunked_off` | 195 | on | off | 64 | 1 | Frontier config rejects this combination | | `n195_chunked_off_threshold_0` | 195 | on | off | 0 | 0 | pass | | `n195_threshold_32768` | 195 | on | on | 32768 | 0 | pass | | `n200_prefix_off` | 200 | off | on | 64 | 0 | pass | | `n200_chunked_off_threshold_0` | 200 | on | off | 0 | 0 | pass | | `n200_threshold_32768` | 200 | on | on | 32768 | 0 | pass | Frontier enforces: ```text VllmV1SchedulerConfig.long_prefill_token_threshold > 0 requires enable_chunked_prefill=True ``` So a valid chunked-off diagnostic also sets `LONG_PREFILL_TOKEN_THRESHOLD=0`. ## Local Trace Analysis Generated files: - `runs/rs1/blocker_request_194/analysis/request_192_analysis.json` - `runs/rs1/blocker_request_194/analysis/request_192_analysis.md` - `runs/rs1/blocker_request_194/analysis/request_194_analysis.json` - `runs/rs1/blocker_request_194/analysis/request_194_analysis.md` Request 192, the minimal first-N failure target: - `timestamp=43.406` - `chat_id=192`, `parent_chat_id=-1`, `turn=1`, `type=coder` - `input_length=13436`, `output_length=1425`, total `14861` - `hash_count=840` - partial final block: yes, final block token count `12` - top prior prefix overlap: 7 blocks, 112 tokens - no parent candidate in the sidecar Request 194, the original `coder_2000` failing request: - `timestamp=43.931` - `chat_id=194`, `parent_chat_id=-1`, `turn=1`, `type=coder` - `input_length=2064`, `output_length=2278`, total `4342` - `hash_count=129` - partial final block: no, final block token count `16` - top prior prefix overlap: 1 block, 16 tokens - no parent candidate in the sidecar Interpretation: - The failing requests are independent first turns, not child turns in a chat. - Request 192 has a partial final block, but its observed prior prefix overlap is only the first 7 full blocks. - Request 194 has no partial final block and only a 1-block prefix overlap. - The failure is therefore not explained by a malformed partial final block, deep shared-prefix trace structure, or a parent/child chat mismatch. - Fixture validation confirms monotonic timestamps, max-token compliance, sidecar hash lengths, and block token counts. ## Frontier Source Localization Relevant Frontier files: - `/tmp/toc-llm-sim-research/Frontier/frontier/scheduler/replica_scheduler/vllm_v1_engine_replica_scheduler.py` - `/tmp/toc-llm-sim-research/Frontier/frontier/entities/request.py` - `/tmp/toc-llm-sim-research/Frontier/frontier/config/config.py` Key path: - `VllmV1EngineReplicaScheduler._prepare_prefix_cache_admission` at `vllm_v1_engine_replica_scheduler.py:1178` calls `kv_cache_manager.get_computed_blocks(request)` and returns `prefix_cached_tokens`. - `_schedule_waiting_requests` at `vllm_v1_engine_replica_scheduler.py:3075` runs prefix-cache admission for any waiting request with prefix caching enabled and incomplete prefill. - The same waiting path allocates KV and then calls `request.on_cache_hit(prefix_cached_tokens)` at `vllm_v1_engine_replica_scheduler.py:3179`. - `Request.on_cache_hit` at `request.py:503` raises if `_scheduled` is already true. - `Request.on_batch_schedule` at `request.py:1058` sets `_scheduled=True`. - Chunked-prefill continuations run through `_schedule_running_requests` around `vllm_v1_engine_replica_scheduler.py:2696`, with long-prefill capping applied around `:2826`. - Valid chunked-off CLI requires `long_prefill_token_threshold=0`; otherwise `config.py:714` rejects the configuration. The evidence points to a Frontier scheduler state issue: with prefix caching enabled and chunked prefill active, a request that has already been scheduled can later reach waiting-admission prefix-cache handling and receive `on_cache_hit` again. That violates `Request.on_cache_hit`'s current invariant. This is more consistent with a repeated cache-hit application or scheduled request re-admission path than with bad ReplayServe trace/hash data. ## Suggested Next Steps 1. Add temporary Frontier instrumentation around `_schedule_waiting_requests` before `request.on_cache_hit` to log `request.id`, `_scheduled`, `_preempted`, `is_prefill_complete`, `num_processed_tokens`, `prefix_cached_tokens`, and whether the request came from `_preempted_requests` or `_request_queue`. 2. Decide Frontier semantics for prefix-cache hits after a request has already been scheduled once. A likely fix is to apply `on_cache_hit` only for a first admission with `_scheduled=False` and `num_processed_tokens=0`, or to reset/request-restart state before re-admission if that is the intended vLLM parity behavior. 3. Keep RS1 fixed config blocked for `coder_2000` until Frontier behavior is patched or a documented upstream-compatible workaround is selected. 4. Do not use the passing diagnosis variants as RS1 performance evidence: prefix-off, chunked-off, and threshold-32768 change the fixed config.