The prior write-up presented one specific reading of the data as
the headline without flagging methodology gaps. Three corrections:
1. The "0% low-concurrency tax" comes from a single back-to-back
mooncake_both_v2/plain_v2 rerun. The original Phase A pair
showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2
— a 40 percentage-point swing between two consecutive runs
that the original write-up did not call out. The run-to-run
noise floor is too high to claim "0%" at low concurrency.
2. get_finished() was never instrumented. The patch only times
step_duration_us and build_meta_us. "100% of per-step cost is
build_meta" is an upper bound on what was timed, not a true
decomposition.
3. H5 (cache-size dependence) was the central hypothesis but
was never tested in the prior run; random content kept APC
near empty.
The +7-9% high-concurrency (single instance, 512x64, rate=8-16)
and +17% 8-instance-saturated numbers are kept; they were
measured with adequate sample sizes and are reproducible.
The follow-up sweep in cache_sweep/ tests H5 directly and
revises the decomposition.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>