B2 interference driver: request return_token_ids + text fallback

The first B2 run produced metrics with ttft_s=null/tpot_s=null for
every decode request because the OpenAI-style payload did not set
return_token_ids: true, and the parser only inspected
choices[0].token_ids. With token_ids missing the loop skipped every
chunk, so no per-token timestamps were captured and the aggregator
returned interference_index=null on all 10 cells.

Fix:
- send return_token_ids: true in the payload (matches replayer.replay)
- also accept text-delta chunks as token signals (fallback for
  servers that drop token_ids despite the flag)

vLLM engine_state was fine; only the load-gen metric capture was
broken.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 22:39:54 +08:00
parent df3249925b
commit b9f324f2e6

View File

@@ -51,6 +51,7 @@ async def _send(
"max_tokens": max_tokens,
"min_tokens": max_tokens,
"temperature": 0,
"return_token_ids": True,
"stream": True,
"stream_options": {"include_usage": True},
}
@@ -82,10 +83,15 @@ async def _send(
if choices:
now = time.time()
token_ids = choices[0].get("token_ids")
delta = choices[0].get("text", "")
if isinstance(token_ids, list) and token_ids:
if ttft is None:
ttft = now - t_dispatch
token_times.extend([now] * len(token_ids))
elif delta:
if ttft is None:
ttft = now - t_dispatch
token_times.append(now)
usage = chunk.get("usage")
if usage:
n_output = usage.get("completion_tokens", n_output)