47 lines
2.3 KiB
Markdown
47 lines
2.3 KiB
Markdown
R1 default run.sh, 4 GPUs, 1.0x
|
|
QPS 0.110594 Goodput 0.108720 Goodput/GPU 0.027180
|
|
TTFT 1171.53 / 2566.92 ms TPOT 7.56 / 11.30 ms Pass 98.31%
|
|
Diagnosis: underutilized, but this round had a client-side stream-line bug.
|
|
Action: fix harness and continue with a clean confirmation later.
|
|
|
|
R2 GPU_MEMORY_UTILIZATION=0.8, 4.0x
|
|
QPS 0.442377 Goodput 0.322410 Goodput/GPU 0.080603
|
|
TTFT 2306.44 / 5880.85 ms TPOT 15.51 / 41.96 ms Pass 72.88%
|
|
Diagnosis: prefill/queueing-limited.
|
|
Action: reduce offered load to find the knee.
|
|
|
|
R3 GPU_MEMORY_UTILIZATION=0.8, 3.0x
|
|
QPS 0.331783 Goodput 0.269925 Goodput/GPU 0.067481
|
|
TTFT 1835.28 / 5026.43 ms TPOT 12.29 / 23.83 ms Pass 81.36%
|
|
Diagnosis: still prefill/queueing-limited.
|
|
Action: try larger prefill batch and remove speculative overhead.
|
|
|
|
R4 GPU_MEMORY_UTILIZATION=0.8, MAX_NUM_BATCHED_TOKENS=32768, intended no-spec, 3.0x
|
|
QPS 0.331783 Goodput 0.264301 Goodput/GPU 0.066075
|
|
TTFT 1882.44 / 5071.41 ms TPOT 12.16 / 24.34 ms Pass 79.66%
|
|
Diagnosis: still prefill-limited; change did not help.
|
|
Action: patch run.sh so empty SPECULATIVE_CONFIG really disables speculation.
|
|
|
|
R5 GPU_MEMORY_UTILIZATION=0.8, baseline batching/spec, 2.0x
|
|
QPS 0.221188 Goodput 0.202444 Goodput/GPU 0.050611
|
|
TTFT 1464.60 / 3545.68 ms TPOT 10.00 / 25.96 ms Pass 91.53%
|
|
Diagnosis: improved, but still TTFT/pass-rate limited.
|
|
Action: retry 2.0x with real no-spec + larger prefill batch.
|
|
|
|
R6 GPU_MEMORY_UTILIZATION=0.8, MAX_NUM_BATCHED_TOKENS=32768, SPECULATIVE_CONFIG='', 2.0x
|
|
QPS 0.221188 Goodput 0.198695 Goodput/GPU 0.049674
|
|
TTFT 1485.97 / 4219.81 ms TPOT 17.64 / 29.77 ms Pass 89.83%
|
|
Diagnosis: no-spec reduced decode step time but did not improve SLO pass rate.
|
|
Action: stop chasing config knobs; search lower rate frontier.
|
|
|
|
R7 GPU_MEMORY_UTILIZATION=0.8, baseline batching/spec, 1.5x
|
|
QPS 0.165891 Goodput 0.157456 Goodput/GPU 0.039364
|
|
TTFT 1338.11 / 3048.92 ms TPOT 8.60 / 14.51 ms Pass 94.92%
|
|
Diagnosis: still TTFT-limited; frontier is below 1.5x.
|
|
Action: run a clean 1.0x confirmation.
|
|
|
|
R8 GPU_MEMORY_UTILIZATION=0.8, baseline batching/spec, 1.0x
|
|
QPS 0.110594 Goodput 0.110594 Goodput/GPU 0.027649
|
|
TTFT 1202.72 / 2596.63 ms TPOT 7.53 / 11.25 ms Pass 100.00%
|
|
Diagnosis: compliant and underutilized.
|