The old filter `if row.latency_s is not None` accepted SGLang's fast
input-length-aborts (latency_s ~ 0.08s, finish_reason='abort/BadRequest')
as if they were successful zero-cost requests. This deflated mean/p50
of any run where the model rejected oversized inputs.
Impact on existing comparisons (ts=1 4-run validation + v2):
KVC v2 has 40 aborts + 5 ReadTimeouts (was reported as just 5);
DP 4w has 67 aborts (was reported as 5).
Both runs have abort behavior; the asymmetry (40 vs 67) is purely from
SGLang's mem-fraction-derived max-input-len: KVC decode-only worker gets
~10 GB free GPU mem -> max-input=92098, DP fused worker gets ~9 GB ->
max-input=87811, because DP also needs chunked-prefill workspace.
The KVC-vs-DP latency-win direction holds and widens slightly under the
fixed filter (lat mean delta: -0.8% -> -1.4%); see V2_DEEP_ANALYSIS_ZH
§4.3 for the recomputed table.
Changes:
- metrics.py: new _is_failed_request(row) helper; latency/ttft/tpot
stats now exclude both errors and aborts. New summary fields
abort_count and failure_count expose the counts directly.
- scripts/analysis/recompute_summary.py: re-derives summary.json from
existing metrics.jsonl using the fixed code, with optional --diff
against the old buggy summary for inspection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>