agentic-kvc

gahow/agentic-kvc

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	b9f324f2e6	B2 interference driver: request return_token_ids + text fallback The first B2 run produced metrics with ttft_s=null/tpot_s=null for every decode request because the OpenAI-style payload did not set return_token_ids: true, and the parser only inspected choices[0].token_ids. With token_ids missing the loop skipped every chunk, so no per-token timestamps were captured and the aggregator returned interference_index=null on all 10 cells. Fix: - send return_token_ids: true in the payload (matches replayer.replay) - also accept text-delta chunks as token signals (fallback for servers that drop token_ids despite the flag) vLLM engine_state was fine; only the load-gen metric capture was broken. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 22:39:54 +08:00
Gahow Wang	e23128ad65	B2: PD-colo interference microbench harness + sweep aggregator scripts/b2_interference.py is the controlled microbench. It runs two coroutines against the open proxy bypass (direct vLLM endpoints): - decode_load: continuous short-prompt requests at fixed QPS into a designated decode instance, to keep it decode-saturated. - prefill_injections: N large one-token requests at fixed interval, pointed at either the same instance (same-worker variant) or a paired one (different-worker control). Each cell (variant × prefill_size) gets its own metrics.jsonl plus a run_window.json containing t_start_unix/t_end_unix. The shared engine_*.jsonl from the scheduler patch is sliced by that window in the aggregator. analysis/characterization/b2_sweep_analysis.py walks the cell tree, slices the per-worker step log by each cell's window, runs the A5 interference_index() against the slice, and emits a single b2_sweep_summary.json with one row per cell. This is what feeds the "interference vs uncached prefill size" figure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:51 +08:00

Author

SHA1

Message

Date

Gahow Wang

b9f324f2e6

B2 interference driver: request return_token_ids + text fallback

The first B2 run produced metrics with ttft_s=null/tpot_s=null for
every decode request because the OpenAI-style payload did not set
return_token_ids: true, and the parser only inspected
choices[0].token_ids. With token_ids missing the loop skipped every
chunk, so no per-token timestamps were captured and the aggregator
returned interference_index=null on all 10 cells.

Fix:
- send return_token_ids: true in the payload (matches replayer.replay)
- also accept text-delta chunks as token signals (fallback for
  servers that drop token_ids despite the flag)

vLLM engine_state was fine; only the load-gen metric capture was
broken.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 22:39:54 +08:00

Gahow Wang

e23128ad65

B2: PD-colo interference microbench harness + sweep aggregator

scripts/b2_interference.py is the controlled microbench. It runs two
coroutines against the open proxy bypass (direct vLLM endpoints):

- decode_load: continuous short-prompt requests at fixed QPS into a
  designated decode instance, to keep it decode-saturated.
- prefill_injections: N large one-token requests at fixed interval,
  pointed at either the same instance (same-worker variant) or a
  paired one (different-worker control).

Each cell (variant × prefill_size) gets its own metrics.jsonl plus a
run_window.json containing t_start_unix/t_end_unix. The shared
engine_*.jsonl from the scheduler patch is sliced by that window in
the aggregator.

analysis/characterization/b2_sweep_analysis.py walks the cell tree,
slices the per-worker step log by each cell's window, runs the A5
interference_index() against the slice, and emits a single
b2_sweep_summary.json with one row per cell. This is what feeds the
"interference vs uncached prefill size" figure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 17:54:51 +08:00

2 Commits