B2 interference driver: request return_token_ids + text fallback
The first B2 run produced metrics with ttft_s=null/tpot_s=null for every decode request because the OpenAI-style payload did not set return_token_ids: true, and the parser only inspected choices[0].token_ids. With token_ids missing the loop skipped every chunk, so no per-token timestamps were captured and the aggregator returned interference_index=null on all 10 cells. Fix: - send return_token_ids: true in the payload (matches replayer.replay) - also accept text-delta chunks as token signals (fallback for servers that drop token_ids despite the flag) vLLM engine_state was fine; only the load-gen metric capture was broken. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -51,6 +51,7 @@ async def _send(
|
||||
"max_tokens": max_tokens,
|
||||
"min_tokens": max_tokens,
|
||||
"temperature": 0,
|
||||
"return_token_ids": True,
|
||||
"stream": True,
|
||||
"stream_options": {"include_usage": True},
|
||||
}
|
||||
@@ -82,10 +83,15 @@ async def _send(
|
||||
if choices:
|
||||
now = time.time()
|
||||
token_ids = choices[0].get("token_ids")
|
||||
delta = choices[0].get("text", "")
|
||||
if isinstance(token_ids, list) and token_ids:
|
||||
if ttft is None:
|
||||
ttft = now - t_dispatch
|
||||
token_times.extend([now] * len(token_ids))
|
||||
elif delta:
|
||||
if ttft is None:
|
||||
ttft = now - t_dispatch
|
||||
token_times.append(now)
|
||||
usage = chunk.get("usage")
|
||||
if usage:
|
||||
n_output = usage.get("completion_tokens", n_output)
|
||||
|
||||
Reference in New Issue
Block a user