Files
replaysim/docs/rs1_frontier_blocker.md

7.8 KiB

RS1 Frontier Blocker: Prefix Cache + Chunked Prefill

This note narrows the RS1 coder_2000 failure into a small Frontier repro. It does not change the RS1 fixed config or make performance claims.

Status

  • Frontier repo: /tmp/toc-llm-sim-research/Frontier
  • Frontier HEAD: d9cfeb6d8791fbf2f295dd9744c56a666171776e
  • ReplayServe canonical fixtures were not changed.
  • Frontier source was not modified.
  • Diagnostic artifacts live under runs/rs1/blocker_request_194/.

The original coder_2000 run failed with:

ValueError: Request 194 already scheduled.

First-N probing shows the smaller blocker is not a single malformed row 194. The smallest observed first-N failure is N=193, which fails as:

ValueError: Request 192 already scheduled.

N=192 passes under the same fixed config.

Repro Commands

Generate diagnostic slices:

cd /home/gahow/phd/replayserve
for n in 190 191 192 193 194 195 200; do
  out="runs/rs1/blocker_request_194/fixtures/coder_${n}"
  mkdir -p "$out"
  python3 tools/qwen_to_frontier.py \
    --input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
    --frontier-csv "$out/frontier.csv" \
    --sidecar-jsonl "$out/sidecar.jsonl" \
    --source-jsonl "$out/source.jsonl" \
    --manifest-json "$out/manifest.json" \
    --fixture-name "blocker_coder_${n}" \
    --limit "$n" \
    --max-tokens 32768 \
    --block-size 16 \
    --fail-on-overflow
done

Minimal failing command:

cd /home/gahow/phd/replayserve
scripts/run_frontier_blocker_probe.sh \
  n193_default \
  runs/rs1/blocker_request_194/fixtures/coder_193

The exact Frontier CLI for every probe is preserved in each runs/rs1/blocker_request_194/probes/<name>/command.txt.

First-N Matrix

All default rows use the RS1 fixed config: prefix caching on, chunked prefill on, long_prefill_token_threshold=64, batch cap 128, max batch tokens 32768.

probe rows prefix cache chunked prefill threshold exit result
n190_default 190 on on 64 0 pass
n191_default 191 on on 64 0 pass
n192_default 192 on on 64 0 pass
n193_default 193 on on 64 1 Request 192 already scheduled
n194_default 194 on on 64 1 Request 192 already scheduled
n195_default 195 on on 64 1 Request 194 already scheduled
n200_default 200 on on 64 1 Request 194 already scheduled

Diagnostic Variants

These are diagnosis only. They are not replacements for the RS1 fixed config.

probe rows prefix cache chunked prefill threshold exit result
n193_prefix_off 193 off on 64 0 pass
n193_chunked_off 193 on off 64 1 Frontier config rejects this combination
n193_chunked_off_threshold_0 193 on off 0 0 pass
n193_threshold_32768 193 on on 32768 0 pass
n195_prefix_off 195 off on 64 0 pass
n195_chunked_off 195 on off 64 1 Frontier config rejects this combination
n195_chunked_off_threshold_0 195 on off 0 0 pass
n195_threshold_32768 195 on on 32768 0 pass
n200_prefix_off 200 off on 64 0 pass
n200_chunked_off_threshold_0 200 on off 0 0 pass
n200_threshold_32768 200 on on 32768 0 pass

Frontier enforces:

VllmV1SchedulerConfig.long_prefill_token_threshold > 0 requires enable_chunked_prefill=True

So a valid chunked-off diagnostic also sets LONG_PREFILL_TOKEN_THRESHOLD=0.

Local Trace Analysis

Generated files:

  • runs/rs1/blocker_request_194/analysis/request_192_analysis.json
  • runs/rs1/blocker_request_194/analysis/request_192_analysis.md
  • runs/rs1/blocker_request_194/analysis/request_194_analysis.json
  • runs/rs1/blocker_request_194/analysis/request_194_analysis.md

Request 192, the minimal first-N failure target:

  • timestamp=43.406
  • chat_id=192, parent_chat_id=-1, turn=1, type=coder
  • input_length=13436, output_length=1425, total 14861
  • hash_count=840
  • partial final block: yes, final block token count 12
  • top prior prefix overlap: 7 blocks, 112 tokens
  • no parent candidate in the sidecar

Request 194, the original coder_2000 failing request:

  • timestamp=43.931
  • chat_id=194, parent_chat_id=-1, turn=1, type=coder
  • input_length=2064, output_length=2278, total 4342
  • hash_count=129
  • partial final block: no, final block token count 16
  • top prior prefix overlap: 1 block, 16 tokens
  • no parent candidate in the sidecar

Interpretation:

  • The failing requests are independent first turns, not child turns in a chat.
  • Request 192 has a partial final block, but its observed prior prefix overlap is only the first 7 full blocks.
  • Request 194 has no partial final block and only a 1-block prefix overlap.
  • The failure is therefore not explained by a malformed partial final block, deep shared-prefix trace structure, or a parent/child chat mismatch.
  • Fixture validation confirms monotonic timestamps, max-token compliance, sidecar hash lengths, and block token counts.

Frontier Source Localization

Relevant Frontier files:

  • /tmp/toc-llm-sim-research/Frontier/frontier/scheduler/replica_scheduler/vllm_v1_engine_replica_scheduler.py
  • /tmp/toc-llm-sim-research/Frontier/frontier/entities/request.py
  • /tmp/toc-llm-sim-research/Frontier/frontier/config/config.py

Key path:

  • VllmV1EngineReplicaScheduler._prepare_prefix_cache_admission at vllm_v1_engine_replica_scheduler.py:1178 calls kv_cache_manager.get_computed_blocks(request) and returns prefix_cached_tokens.
  • _schedule_waiting_requests at vllm_v1_engine_replica_scheduler.py:3075 runs prefix-cache admission for any waiting request with prefix caching enabled and incomplete prefill.
  • The same waiting path allocates KV and then calls request.on_cache_hit(prefix_cached_tokens) at vllm_v1_engine_replica_scheduler.py:3179.
  • Request.on_cache_hit at request.py:503 raises if _scheduled is already true.
  • Request.on_batch_schedule at request.py:1058 sets _scheduled=True.
  • Chunked-prefill continuations run through _schedule_running_requests around vllm_v1_engine_replica_scheduler.py:2696, with long-prefill capping applied around :2826.
  • Valid chunked-off CLI requires long_prefill_token_threshold=0; otherwise config.py:714 rejects the configuration.

The evidence points to a Frontier scheduler state issue: with prefix caching enabled and chunked prefill active, a request that has already been scheduled can later reach waiting-admission prefix-cache handling and receive on_cache_hit again. That violates Request.on_cache_hit's current invariant.

This is more consistent with a repeated cache-hit application or scheduled request re-admission path than with bad ReplayServe trace/hash data.

Suggested Next Steps

  1. Add temporary Frontier instrumentation around _schedule_waiting_requests before request.on_cache_hit to log request.id, _scheduled, _preempted, is_prefill_complete, num_processed_tokens, prefix_cached_tokens, and whether the request came from _preempted_requests or _request_queue.
  2. Decide Frontier semantics for prefix-cache hits after a request has already been scheduled once. A likely fix is to apply on_cache_hit only for a first admission with _scheduled=False and num_processed_tokens=0, or to reset/request-restart state before re-admission if that is the intended vLLM parity behavior.
  3. Keep RS1 fixed config blocked for coder_2000 until Frontier behavior is patched or a documented upstream-compatible workaround is selected.
  4. Do not use the passing diagnosis variants as RS1 performance evidence: prefix-off, chunked-off, and threshold-32768 change the fixed config.