After the B3 audit bug fixes (joined_analysis hotspot median +
b3_analyze percentile interp), regenerate b3_policy_comparison.json
and the per-policy hotspot_index.json from the same raw run on
dash0 and re-render the three affected figures (apc-vs-hotspot,
latency-bars, per-worker TTFT).
Key number changes in window_1_results.md:
- hotspot_index magnitudes corrected (all five policies; lmetric
smallest delta at +0.7%, sticky largest at +16.1%)
- "capped reduces hotspot 13%" -> "~10% (2.253 -> 2.020)"
- TTFT/E2E/TPOT percentiles shift by <1% from floor->interp
(unified TTFT p90 7.24 -> 7.35 s)
Restructured "Caveats" into "Limitations (read this before quoting
B3 numbers)":
1. Agentic dispatch coupling is by design — promoted from caveat
to top-level methodology framing, tied to
agentic_dispatch_coupling.md
2. B3 interference_index is binary (not size-graded) — added
3. Hot-sweep cache contamination (<1%) — kept
4. Unified interference unrecoverable — kept with explicit warning
not to read unified's failure attribution as causal
5. w600 is a sample, not full trace — kept
6. Reuse decomposition is per-token in expectation — added
current_results/characterization_claim_matrix.md updates:
- The "heavy-tail not sole cause" claim now cites the corrected
~10% drop with the median bug noted
- New supported claim: "B3 saturated-replay latency gaps include an
agentic dispatch-coupling feedback term, which is intentional and
matches production"; cited against agentic_dispatch_coupling.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>