Add ReplayServe Frontier vLLM alignment report
This commit is contained in:
199
docs/rs1_frontier_blocker.md
Normal file
199
docs/rs1_frontier_blocker.md
Normal file
@@ -0,0 +1,199 @@
|
||||
# RS1 Frontier Blocker: Prefix Cache + Chunked Prefill
|
||||
|
||||
This note narrows the RS1 `coder_2000` failure into a small Frontier repro.
|
||||
It does not change the RS1 fixed config or make performance claims.
|
||||
|
||||
## Status
|
||||
|
||||
- Frontier repo: `/tmp/toc-llm-sim-research/Frontier`
|
||||
- Frontier HEAD: `d9cfeb6d8791fbf2f295dd9744c56a666171776e`
|
||||
- ReplayServe canonical fixtures were not changed.
|
||||
- Frontier source was not modified.
|
||||
- Diagnostic artifacts live under `runs/rs1/blocker_request_194/`.
|
||||
|
||||
The original `coder_2000` run failed with:
|
||||
|
||||
```text
|
||||
ValueError: Request 194 already scheduled.
|
||||
```
|
||||
|
||||
First-N probing shows the smaller blocker is not a single malformed row 194.
|
||||
The smallest observed first-N failure is `N=193`, which fails as:
|
||||
|
||||
```text
|
||||
ValueError: Request 192 already scheduled.
|
||||
```
|
||||
|
||||
`N=192` passes under the same fixed config.
|
||||
|
||||
## Repro Commands
|
||||
|
||||
Generate diagnostic slices:
|
||||
|
||||
```bash
|
||||
cd /home/gahow/phd/replayserve
|
||||
for n in 190 191 192 193 194 195 200; do
|
||||
out="runs/rs1/blocker_request_194/fixtures/coder_${n}"
|
||||
mkdir -p "$out"
|
||||
python3 tools/qwen_to_frontier.py \
|
||||
--input /home/gahow/phd/qwen-bailian-usagetraces-anon/qwen_coder_blksz_16.jsonl \
|
||||
--frontier-csv "$out/frontier.csv" \
|
||||
--sidecar-jsonl "$out/sidecar.jsonl" \
|
||||
--source-jsonl "$out/source.jsonl" \
|
||||
--manifest-json "$out/manifest.json" \
|
||||
--fixture-name "blocker_coder_${n}" \
|
||||
--limit "$n" \
|
||||
--max-tokens 32768 \
|
||||
--block-size 16 \
|
||||
--fail-on-overflow
|
||||
done
|
||||
```
|
||||
|
||||
Minimal failing command:
|
||||
|
||||
```bash
|
||||
cd /home/gahow/phd/replayserve
|
||||
scripts/run_frontier_blocker_probe.sh \
|
||||
n193_default \
|
||||
runs/rs1/blocker_request_194/fixtures/coder_193
|
||||
```
|
||||
|
||||
The exact Frontier CLI for every probe is preserved in each
|
||||
`runs/rs1/blocker_request_194/probes/<name>/command.txt`.
|
||||
|
||||
## First-N Matrix
|
||||
|
||||
All default rows use the RS1 fixed config: prefix caching on, chunked prefill
|
||||
on, `long_prefill_token_threshold=64`, batch cap 128, max batch tokens 32768.
|
||||
|
||||
| probe | rows | prefix cache | chunked prefill | threshold | exit | result |
|
||||
|---|---:|---|---|---:|---:|---|
|
||||
| `n190_default` | 190 | on | on | 64 | 0 | pass |
|
||||
| `n191_default` | 191 | on | on | 64 | 0 | pass |
|
||||
| `n192_default` | 192 | on | on | 64 | 0 | pass |
|
||||
| `n193_default` | 193 | on | on | 64 | 1 | `Request 192 already scheduled` |
|
||||
| `n194_default` | 194 | on | on | 64 | 1 | `Request 192 already scheduled` |
|
||||
| `n195_default` | 195 | on | on | 64 | 1 | `Request 194 already scheduled` |
|
||||
| `n200_default` | 200 | on | on | 64 | 1 | `Request 194 already scheduled` |
|
||||
|
||||
## Diagnostic Variants
|
||||
|
||||
These are diagnosis only. They are not replacements for the RS1 fixed config.
|
||||
|
||||
| probe | rows | prefix cache | chunked prefill | threshold | exit | result |
|
||||
|---|---:|---|---|---:|---:|---|
|
||||
| `n193_prefix_off` | 193 | off | on | 64 | 0 | pass |
|
||||
| `n193_chunked_off` | 193 | on | off | 64 | 1 | Frontier config rejects this combination |
|
||||
| `n193_chunked_off_threshold_0` | 193 | on | off | 0 | 0 | pass |
|
||||
| `n193_threshold_32768` | 193 | on | on | 32768 | 0 | pass |
|
||||
| `n195_prefix_off` | 195 | off | on | 64 | 0 | pass |
|
||||
| `n195_chunked_off` | 195 | on | off | 64 | 1 | Frontier config rejects this combination |
|
||||
| `n195_chunked_off_threshold_0` | 195 | on | off | 0 | 0 | pass |
|
||||
| `n195_threshold_32768` | 195 | on | on | 32768 | 0 | pass |
|
||||
| `n200_prefix_off` | 200 | off | on | 64 | 0 | pass |
|
||||
| `n200_chunked_off_threshold_0` | 200 | on | off | 0 | 0 | pass |
|
||||
| `n200_threshold_32768` | 200 | on | on | 32768 | 0 | pass |
|
||||
|
||||
Frontier enforces:
|
||||
|
||||
```text
|
||||
VllmV1SchedulerConfig.long_prefill_token_threshold > 0 requires enable_chunked_prefill=True
|
||||
```
|
||||
|
||||
So a valid chunked-off diagnostic also sets
|
||||
`LONG_PREFILL_TOKEN_THRESHOLD=0`.
|
||||
|
||||
## Local Trace Analysis
|
||||
|
||||
Generated files:
|
||||
|
||||
- `runs/rs1/blocker_request_194/analysis/request_192_analysis.json`
|
||||
- `runs/rs1/blocker_request_194/analysis/request_192_analysis.md`
|
||||
- `runs/rs1/blocker_request_194/analysis/request_194_analysis.json`
|
||||
- `runs/rs1/blocker_request_194/analysis/request_194_analysis.md`
|
||||
|
||||
Request 192, the minimal first-N failure target:
|
||||
|
||||
- `timestamp=43.406`
|
||||
- `chat_id=192`, `parent_chat_id=-1`, `turn=1`, `type=coder`
|
||||
- `input_length=13436`, `output_length=1425`, total `14861`
|
||||
- `hash_count=840`
|
||||
- partial final block: yes, final block token count `12`
|
||||
- top prior prefix overlap: 7 blocks, 112 tokens
|
||||
- no parent candidate in the sidecar
|
||||
|
||||
Request 194, the original `coder_2000` failing request:
|
||||
|
||||
- `timestamp=43.931`
|
||||
- `chat_id=194`, `parent_chat_id=-1`, `turn=1`, `type=coder`
|
||||
- `input_length=2064`, `output_length=2278`, total `4342`
|
||||
- `hash_count=129`
|
||||
- partial final block: no, final block token count `16`
|
||||
- top prior prefix overlap: 1 block, 16 tokens
|
||||
- no parent candidate in the sidecar
|
||||
|
||||
Interpretation:
|
||||
|
||||
- The failing requests are independent first turns, not child turns in a chat.
|
||||
- Request 192 has a partial final block, but its observed prior prefix overlap
|
||||
is only the first 7 full blocks.
|
||||
- Request 194 has no partial final block and only a 1-block prefix overlap.
|
||||
- The failure is therefore not explained by a malformed partial final block,
|
||||
deep shared-prefix trace structure, or a parent/child chat mismatch.
|
||||
- Fixture validation confirms monotonic timestamps, max-token compliance,
|
||||
sidecar hash lengths, and block token counts.
|
||||
|
||||
## Frontier Source Localization
|
||||
|
||||
Relevant Frontier files:
|
||||
|
||||
- `/tmp/toc-llm-sim-research/Frontier/frontier/scheduler/replica_scheduler/vllm_v1_engine_replica_scheduler.py`
|
||||
- `/tmp/toc-llm-sim-research/Frontier/frontier/entities/request.py`
|
||||
- `/tmp/toc-llm-sim-research/Frontier/frontier/config/config.py`
|
||||
|
||||
Key path:
|
||||
|
||||
- `VllmV1EngineReplicaScheduler._prepare_prefix_cache_admission`
|
||||
at `vllm_v1_engine_replica_scheduler.py:1178` calls
|
||||
`kv_cache_manager.get_computed_blocks(request)` and returns
|
||||
`prefix_cached_tokens`.
|
||||
- `_schedule_waiting_requests` at
|
||||
`vllm_v1_engine_replica_scheduler.py:3075` runs prefix-cache admission for
|
||||
any waiting request with prefix caching enabled and incomplete prefill.
|
||||
- The same waiting path allocates KV and then calls
|
||||
`request.on_cache_hit(prefix_cached_tokens)` at
|
||||
`vllm_v1_engine_replica_scheduler.py:3179`.
|
||||
- `Request.on_cache_hit` at `request.py:503` raises if `_scheduled` is already
|
||||
true.
|
||||
- `Request.on_batch_schedule` at `request.py:1058` sets `_scheduled=True`.
|
||||
- Chunked-prefill continuations run through `_schedule_running_requests`
|
||||
around `vllm_v1_engine_replica_scheduler.py:2696`, with long-prefill
|
||||
capping applied around `:2826`.
|
||||
- Valid chunked-off CLI requires `long_prefill_token_threshold=0`; otherwise
|
||||
`config.py:714` rejects the configuration.
|
||||
|
||||
The evidence points to a Frontier scheduler state issue: with prefix caching
|
||||
enabled and chunked prefill active, a request that has already been scheduled
|
||||
can later reach waiting-admission prefix-cache handling and receive
|
||||
`on_cache_hit` again. That violates `Request.on_cache_hit`'s current invariant.
|
||||
|
||||
This is more consistent with a repeated cache-hit application or scheduled
|
||||
request re-admission path than with bad ReplayServe trace/hash data.
|
||||
|
||||
## Suggested Next Steps
|
||||
|
||||
1. Add temporary Frontier instrumentation around `_schedule_waiting_requests`
|
||||
before `request.on_cache_hit` to log `request.id`, `_scheduled`,
|
||||
`_preempted`, `is_prefill_complete`, `num_processed_tokens`,
|
||||
`prefix_cached_tokens`, and whether the request came from
|
||||
`_preempted_requests` or `_request_queue`.
|
||||
2. Decide Frontier semantics for prefix-cache hits after a request has already
|
||||
been scheduled once. A likely fix is to apply `on_cache_hit` only for a
|
||||
first admission with `_scheduled=False` and `num_processed_tokens=0`, or to
|
||||
reset/request-restart state before re-admission if that is the intended
|
||||
vLLM parity behavior.
|
||||
3. Keep RS1 fixed config blocked for `coder_2000` until Frontier behavior is
|
||||
patched or a documented upstream-compatible workaround is selected.
|
||||
4. Do not use the passing diagnosis variants as RS1 performance evidence:
|
||||
prefix-off, chunked-off, and threshold-32768 change the fixed config.
|
||||
|
||||
Reference in New Issue
Block a user