7.8 KiB
RS1 Frontier Blocker: Prefix Cache + Chunked Prefill
This note narrows the RS1 coder_2000 failure into a small Frontier repro.
It does not change the RS1 fixed config or make performance claims.
Status
- Frontier repo:
/tmp/toc-llm-sim-research/Frontier - Frontier HEAD:
d9cfeb6d8791fbf2f295dd9744c56a666171776e - ReplayServe canonical fixtures were not changed.
- Frontier source was not modified.
- Diagnostic artifacts live under
runs/rs1/blocker_request_194/.
The original coder_2000 run failed with:
ValueError: Request 194 already scheduled.
First-N probing shows the smaller blocker is not a single malformed row 194.
The smallest observed first-N failure is N=193, which fails as:
ValueError: Request 192 already scheduled.
N=192 passes under the same fixed config.
Repro Commands
Generate diagnostic slices:
cd /home/gahow/phd/replayserve
for n in 190 191 192 193 194 195 200; do
out="runs/rs1/blocker_request_194/fixtures/coder_${n}"
mkdir -p "$out"
python3 tools/qwen_to_frontier.py \
--input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
--frontier-csv "$out/frontier.csv" \
--sidecar-jsonl "$out/sidecar.jsonl" \
--source-jsonl "$out/source.jsonl" \
--manifest-json "$out/manifest.json" \
--fixture-name "blocker_coder_${n}" \
--limit "$n" \
--max-tokens 32768 \
--block-size 16 \
--fail-on-overflow
done
Minimal failing command:
cd /home/gahow/phd/replayserve
scripts/run_frontier_blocker_probe.sh \
n193_default \
runs/rs1/blocker_request_194/fixtures/coder_193
The exact Frontier CLI for every probe is preserved in each
runs/rs1/blocker_request_194/probes/<name>/command.txt.
First-N Matrix
All default rows use the RS1 fixed config: prefix caching on, chunked prefill
on, long_prefill_token_threshold=64, batch cap 128, max batch tokens 32768.
| probe | rows | prefix cache | chunked prefill | threshold | exit | result |
|---|---|---|---|---|---|---|
n190_default |
190 | on | on | 64 | 0 | pass |
n191_default |
191 | on | on | 64 | 0 | pass |
n192_default |
192 | on | on | 64 | 0 | pass |
n193_default |
193 | on | on | 64 | 1 | Request 192 already scheduled |
n194_default |
194 | on | on | 64 | 1 | Request 192 already scheduled |
n195_default |
195 | on | on | 64 | 1 | Request 194 already scheduled |
n200_default |
200 | on | on | 64 | 1 | Request 194 already scheduled |
Diagnostic Variants
These are diagnosis only. They are not replacements for the RS1 fixed config.
| probe | rows | prefix cache | chunked prefill | threshold | exit | result |
|---|---|---|---|---|---|---|
n193_prefix_off |
193 | off | on | 64 | 0 | pass |
n193_chunked_off |
193 | on | off | 64 | 1 | Frontier config rejects this combination |
n193_chunked_off_threshold_0 |
193 | on | off | 0 | 0 | pass |
n193_threshold_32768 |
193 | on | on | 32768 | 0 | pass |
n195_prefix_off |
195 | off | on | 64 | 0 | pass |
n195_chunked_off |
195 | on | off | 64 | 1 | Frontier config rejects this combination |
n195_chunked_off_threshold_0 |
195 | on | off | 0 | 0 | pass |
n195_threshold_32768 |
195 | on | on | 32768 | 0 | pass |
n200_prefix_off |
200 | off | on | 64 | 0 | pass |
n200_chunked_off_threshold_0 |
200 | on | off | 0 | 0 | pass |
n200_threshold_32768 |
200 | on | on | 32768 | 0 | pass |
Frontier enforces:
VllmV1SchedulerConfig.long_prefill_token_threshold > 0 requires enable_chunked_prefill=True
So a valid chunked-off diagnostic also sets
LONG_PREFILL_TOKEN_THRESHOLD=0.
Local Trace Analysis
Generated files:
runs/rs1/blocker_request_194/analysis/request_192_analysis.jsonruns/rs1/blocker_request_194/analysis/request_192_analysis.mdruns/rs1/blocker_request_194/analysis/request_194_analysis.jsonruns/rs1/blocker_request_194/analysis/request_194_analysis.md
Request 192, the minimal first-N failure target:
timestamp=43.406chat_id=192,parent_chat_id=-1,turn=1,type=coderinput_length=13436,output_length=1425, total14861hash_count=840- partial final block: yes, final block token count
12 - top prior prefix overlap: 7 blocks, 112 tokens
- no parent candidate in the sidecar
Request 194, the original coder_2000 failing request:
timestamp=43.931chat_id=194,parent_chat_id=-1,turn=1,type=coderinput_length=2064,output_length=2278, total4342hash_count=129- partial final block: no, final block token count
16 - top prior prefix overlap: 1 block, 16 tokens
- no parent candidate in the sidecar
Interpretation:
- The failing requests are independent first turns, not child turns in a chat.
- Request 192 has a partial final block, but its observed prior prefix overlap is only the first 7 full blocks.
- Request 194 has no partial final block and only a 1-block prefix overlap.
- The failure is therefore not explained by a malformed partial final block, deep shared-prefix trace structure, or a parent/child chat mismatch.
- Fixture validation confirms monotonic timestamps, max-token compliance, sidecar hash lengths, and block token counts.
Frontier Source Localization
Relevant Frontier files:
/tmp/toc-llm-sim-research/Frontier/frontier/scheduler/replica_scheduler/vllm_v1_engine_replica_scheduler.py/tmp/toc-llm-sim-research/Frontier/frontier/entities/request.py/tmp/toc-llm-sim-research/Frontier/frontier/config/config.py
Key path:
VllmV1EngineReplicaScheduler._prepare_prefix_cache_admissionatvllm_v1_engine_replica_scheduler.py:1178callskv_cache_manager.get_computed_blocks(request)and returnsprefix_cached_tokens._schedule_waiting_requestsatvllm_v1_engine_replica_scheduler.py:3075runs prefix-cache admission for any waiting request with prefix caching enabled and incomplete prefill.- The same waiting path allocates KV and then calls
request.on_cache_hit(prefix_cached_tokens)atvllm_v1_engine_replica_scheduler.py:3179. Request.on_cache_hitatrequest.py:503raises if_scheduledis already true.Request.on_batch_scheduleatrequest.py:1058sets_scheduled=True.- Chunked-prefill continuations run through
_schedule_running_requestsaroundvllm_v1_engine_replica_scheduler.py:2696, with long-prefill capping applied around:2826. - Valid chunked-off CLI requires
long_prefill_token_threshold=0; otherwiseconfig.py:714rejects the configuration.
The evidence points to a Frontier scheduler state issue: with prefix caching
enabled and chunked prefill active, a request that has already been scheduled
can later reach waiting-admission prefix-cache handling and receive
on_cache_hit again. That violates Request.on_cache_hit's current invariant.
This is more consistent with a repeated cache-hit application or scheduled request re-admission path than with bad ReplayServe trace/hash data.
Suggested Next Steps
- Add temporary Frontier instrumentation around
_schedule_waiting_requestsbeforerequest.on_cache_hitto logrequest.id,_scheduled,_preempted,is_prefill_complete,num_processed_tokens,prefix_cached_tokens, and whether the request came from_preempted_requestsor_request_queue. - Decide Frontier semantics for prefix-cache hits after a request has already
been scheduled once. A likely fix is to apply
on_cache_hitonly for a first admission with_scheduled=Falseandnum_processed_tokens=0, or to reset/request-restart state before re-admission if that is the intended vLLM parity behavior. - Keep RS1 fixed config blocked for
coder_2000until Frontier behavior is patched or a documented upstream-compatible workaround is selected. - Do not use the passing diagnosis variants as RS1 performance evidence: prefix-off, chunked-off, and threshold-32768 change the fixed config.