Files
obsidian/projects/auto-tuner/Untitled 3.md

2.3 KiB

R1 default run.sh, 4 GPUs, 1.0x QPS 0.110594 Goodput 0.108720 Goodput/GPU 0.027180 TTFT 1171.53 / 2566.92 ms TPOT 7.56 / 11.30 ms Pass 98.31% Diagnosis: underutilized, but this round had a client-side stream-line bug. Action: fix harness and continue with a clean confirmation later.

R2 GPU_MEMORY_UTILIZATION=0.8, 4.0x QPS 0.442377 Goodput 0.322410 Goodput/GPU 0.080603 TTFT 2306.44 / 5880.85 ms TPOT 15.51 / 41.96 ms Pass 72.88% Diagnosis: prefill/queueing-limited. Action: reduce offered load to find the knee.

R3 GPU_MEMORY_UTILIZATION=0.8, 3.0x QPS 0.331783 Goodput 0.269925 Goodput/GPU 0.067481 TTFT 1835.28 / 5026.43 ms TPOT 12.29 / 23.83 ms Pass 81.36% Diagnosis: still prefill/queueing-limited. Action: try larger prefill batch and remove speculative overhead.

R4 GPU_MEMORY_UTILIZATION=0.8, MAX_NUM_BATCHED_TOKENS=32768, intended no-spec, 3.0x QPS 0.331783 Goodput 0.264301 Goodput/GPU 0.066075 TTFT 1882.44 / 5071.41 ms TPOT 12.16 / 24.34 ms Pass 79.66% Diagnosis: still prefill-limited; change did not help. Action: patch run.sh so empty SPECULATIVE_CONFIG really disables speculation.

R5 GPU_MEMORY_UTILIZATION=0.8, baseline batching/spec, 2.0x QPS 0.221188 Goodput 0.202444 Goodput/GPU 0.050611 TTFT 1464.60 / 3545.68 ms TPOT 10.00 / 25.96 ms Pass 91.53% Diagnosis: improved, but still TTFT/pass-rate limited. Action: retry 2.0x with real no-spec + larger prefill batch.

R6 GPU_MEMORY_UTILIZATION=0.8, MAX_NUM_BATCHED_TOKENS=32768, SPECULATIVE_CONFIG='', 2.0x QPS 0.221188 Goodput 0.198695 Goodput/GPU 0.049674 TTFT 1485.97 / 4219.81 ms TPOT 17.64 / 29.77 ms Pass 89.83% Diagnosis: no-spec reduced decode step time but did not improve SLO pass rate. Action: stop chasing config knobs; search lower rate frontier.

R7 GPU_MEMORY_UTILIZATION=0.8, baseline batching/spec, 1.5x QPS 0.165891 Goodput 0.157456 Goodput/GPU 0.039364 TTFT 1338.11 / 3048.92 ms TPOT 8.60 / 14.51 ms Pass 94.92% Diagnosis: still TTFT-limited; frontier is below 1.5x. Action: run a clean 1.0x confirmation.

R8 GPU_MEMORY_UTILIZATION=0.8, baseline batching/spec, 1.0x QPS 0.110594 Goodput 0.110594 Goodput/GPU 0.027649 TTFT 1202.72 / 2596.63 ms TPOT 7.53 / 11.25 ms Pass 100.00% Diagnosis: compliant and underutilized.