B2 interference driver: request return_token_ids + text fallback

The first B2 run produced metrics with ttft_s=null/tpot_s=null for every decode request because the OpenAI-style payload did not set return_token_ids: true, and the parser only inspected choices[0].token_ids. With token_ids missing the loop skipped every chunk, so no per-token timestamps were captured and the aggregator returned interference_index=null on all 10 cells. Fix: - send return_token_ids: true in the payload (matches replayer.replay) - also accept text-delta chunks as token signals (fallback for servers that drop token_ids despite the flag) vLLM engine_state was fine; only the load-gen metric capture was broken. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 22:39:54 +08:00
parent df3249925b
commit b9f324f2e6
1 changed files with 6 additions and 0 deletions
--- a/scripts/b2_interference.py
+++ b/scripts/b2_interference.py
@@ -51,6 +51,7 @@ async def _send(
        "max_tokens": max_tokens,
        "min_tokens": max_tokens,
        "temperature": 0,
+        "return_token_ids": True,
        "stream": True,
        "stream_options": {"include_usage": True},
    }
@@ -82,10 +83,15 @@ async def _send(
                if choices:
                    now = time.time()
                    token_ids = choices[0].get("token_ids")
+                    delta = choices[0].get("text", "")
                    if isinstance(token_ids, list) and token_ids:
                        if ttft is None:
                            ttft = now - t_dispatch
                        token_times.extend([now] * len(token_ids))
+                    elif delta:
+                        if ttft is None:
+                            ttft = now - t_dispatch
+                        token_times.append(now)
                usage = chunk.get("usage")
                if usage:
                    n_output = usage.get("completion_tokens", n_output)