6.0 KiB
Characterization Protocols For Remaining Batches
Status: implementation protocol and audit checklist Date: 2026-05-25
This file completes the analysis/characterization scaffold for the TODO
list. It separates what is already implemented from what requires fresh GPU
runs or new engine/proxy instrumentation.
Implemented Now
Batch 0/1 Analyzer
Use:
python3 analysis/characterization/analyze.py \
--trace traces/w600_r0.0015_st30.jsonl \
--kv-bytes-per-token 98304 \
--task-name w600_local_full_trace \
--overwrite
The analyzer writes:
manifest.jsonsummary.jsonsummary.mdaudit.mdsession_concurrency.jsonsession_arrival_stats.jsonturn_interval_stats.jsontrace_profile.jsonworkload_summary.jsonkv_footprint_summary.jsonreuse_decomposition.jsonsession_skew.jsonappend_delta_stats.json
Limitations:
- Actual online sequentiality requires dispatch and finish/error timestamps.
Existing
metrics.jsonlartifacts generally do not contain these fields. - Actual reuse decomposition requires
cached_tokens/cache_hit,hash_ids, andsession_idin the same joinable request record.
Existing-Run Audit
Use:
python3 analysis/characterization/summarize_runs.py
The script writes an audit package under:
analysis/characterization/current_results/
It summarizes already completed runs and explicitly marks which claims are supported, partially supported, or not yet supported.
Batch 2 Protocol: PD-Colo Prefill/Decode Interference
Purpose:
Prove whether same-worker prefill overlap increases decode TPOT/queue delay.
Required new instrumentation:
- per-request dispatch timestamp
- per-request finish/error timestamp
- per decode step timestamp
- decode step worker id
- prefill chunk start/end timestamp
- prefill worker id
- request/session id associated with each prefill chunk
Required arms:
- decode-only steady load
- decode + same-worker heavy prefill injection
- decode + different-worker heavy prefill injection
- trace replay with overlap labels
Required sweep:
uncached_prefill_tokens in {2k, 8k, 16k, 32k, 64k}
chunked_prefill_size in available engine values
Required outputs:
interference_microbench_summary.jsondecode_step_timeseries.csvprefill_overlap_events.jsonlinterference_index.json- TPOT timeline figure with prefill overlays
- same-worker vs different-worker TPOT boxplot
Pass condition:
TPOT_p90(overlap_same_worker) / TPOT_p90(no_overlap) > 1
and the effect must be materially weaker in the different-worker control.
Batch 3 Protocol: Session Hot-Spot Residual Imbalance
Purpose:
Prove whether cache-aware/LMetric still leaves hot workers under session-heavy skew.
Required new instrumentation:
- route decision per request
- chosen worker
- candidate worker scores
- cache hit / estimated uncached tokens per candidate
- per-worker request queue length/delay
- per-worker decode queue length/delay
- per-worker KV occupancy
- per-worker APC/cache-hit snapshot
Required arms:
- corrected LMetric/cache-aware
- load-only routing
- hard sticky routing
- current Unified hybrid
- session-mass capped/equalized replay
Required outputs:
worker_balance_summary.jsonsession_to_worker_map.jsonsession_mass_summary.jsonrouting_policy_comparison.jsonhotspot_index.json- per-worker queue delay bar
- APC vs queue delay scatter
- top-session contribution bar
- policy tradeoff plot: APC vs hot-spot index
Pass condition:
LMetric/cache-aware must show measurable residual worker skew, and that skew must correlate with session token mass or locality.
GPU utilization alone is not enough for this claim.
Batch 4 Protocol: Sustainable Request Rate
Purpose:
Measure:
SRR(SLO) = max arrival rate satisfying SLO in steady state
Required load generator behavior:
- open-loop session arrivals, preferably Poisson
- session-internal sequentiality
- warmup window
- steady-state measurement window
- explicit attempted/completed/error counters
Provisional SLO:
TTFT_p90 <= T_ttft
E2E_p90 <= T_e2e
TPOT_p90 <= T_tpot
error_rate <= epsilon
queue length stable
KV occupancy stable
Required arms:
- PD-colo corrected LMetric/cache-aware
- static PD-disagg
- current Unified hybrid
- optional hard sticky
- optional load-only
Required outputs:
srr_curve.jsonlambda_runs/<lambda>/summary.jsonslo_violation_reason.jsongoodput_vs_arrival_rate.json- SRR bar chart
- latency vs arrival rate curves
- goodput vs arrival rate
- queue/KV stability plot near failure point
Pass condition:
Each policy has a measured max sustainable lambda under the same SLO and same session-causal arrival process.
Batch 5 Protocol: Failure Attribution Near SRR Boundary
Purpose:
Explain why each policy fails near SRR.
Required rates:
lambda = 0.9 * SRR
lambda = 1.0 * SRR
lambda = 1.1 * SRR
Labels for each slow/SLO-violating request:
- same-worker prefill overlap
- hot worker queue
- high KV occupancy
- cache miss / large uncached append
- transfer wait
- P queue wait
- D admission wait
- unknown
Required outputs:
slow_request_attribution.jsonlfailure_breakdown.jsoncase_studies.mdworker_failure_windows.json- violation cause stacked bar
- slow request waterfall
- worker timeline near failure
Pass condition:
The analysis must explain whether PD-colo is limited by interference, hot-spot, KV pressure, or a mixture, and whether Unified/PUSH underperforms because of trigger quality, transfer cost, target admission, or load regime.
Batch 6 Protocol: Audit Package
Implemented by summarize_runs.py for existing runs and extended by fresh
Batch 2-5 outputs later.
Required files:
characterization_claim_matrix.mdall_figures_index.mdreviewer_risk_register.mdreproduction_commands.shmain_claim_allowed_runs.md
Current package intentionally marks Batch 2/4/5 claims as not yet supported until fresh instrumented experiments exist.