Files

6.0 KiB

Characterization Protocols For Remaining Batches

Status: implementation protocol and audit checklist Date: 2026-05-25

This file completes the analysis/characterization scaffold for the TODO list. It separates what is already implemented from what requires fresh GPU runs or new engine/proxy instrumentation.

Implemented Now

Batch 0/1 Analyzer

Use:

python3 analysis/characterization/analyze.py \
  --trace traces/w600_r0.0015_st30.jsonl \
  --kv-bytes-per-token 98304 \
  --task-name w600_local_full_trace \
  --overwrite

The analyzer writes:

  • manifest.json
  • summary.json
  • summary.md
  • audit.md
  • session_concurrency.json
  • session_arrival_stats.json
  • turn_interval_stats.json
  • trace_profile.json
  • workload_summary.json
  • kv_footprint_summary.json
  • reuse_decomposition.json
  • session_skew.json
  • append_delta_stats.json

Limitations:

  • Actual online sequentiality requires dispatch and finish/error timestamps. Existing metrics.jsonl artifacts generally do not contain these fields.
  • Actual reuse decomposition requires cached_tokens/cache_hit, hash_ids, and session_id in the same joinable request record.

Existing-Run Audit

Use:

python3 analysis/characterization/summarize_runs.py

The script writes an audit package under:

analysis/characterization/current_results/

It summarizes already completed runs and explicitly marks which claims are supported, partially supported, or not yet supported.

Batch 2 Protocol: PD-Colo Prefill/Decode Interference

Purpose:

Prove whether same-worker prefill overlap increases decode TPOT/queue delay.

Required new instrumentation:

  • per-request dispatch timestamp
  • per-request finish/error timestamp
  • per decode step timestamp
  • decode step worker id
  • prefill chunk start/end timestamp
  • prefill worker id
  • request/session id associated with each prefill chunk

Required arms:

  1. decode-only steady load
  2. decode + same-worker heavy prefill injection
  3. decode + different-worker heavy prefill injection
  4. trace replay with overlap labels

Required sweep:

uncached_prefill_tokens in {2k, 8k, 16k, 32k, 64k}
chunked_prefill_size in available engine values

Required outputs:

  • interference_microbench_summary.json
  • decode_step_timeseries.csv
  • prefill_overlap_events.jsonl
  • interference_index.json
  • TPOT timeline figure with prefill overlays
  • same-worker vs different-worker TPOT boxplot

Pass condition:

TPOT_p90(overlap_same_worker) / TPOT_p90(no_overlap) > 1

and the effect must be materially weaker in the different-worker control.

Batch 3 Protocol: Session Hot-Spot Residual Imbalance

Purpose:

Prove whether cache-aware/LMetric still leaves hot workers under session-heavy skew.

Required new instrumentation:

  • route decision per request
  • chosen worker
  • candidate worker scores
  • cache hit / estimated uncached tokens per candidate
  • per-worker request queue length/delay
  • per-worker decode queue length/delay
  • per-worker KV occupancy
  • per-worker APC/cache-hit snapshot

Required arms:

  1. corrected LMetric/cache-aware
  2. load-only routing
  3. hard sticky routing
  4. current Unified hybrid
  5. session-mass capped/equalized replay

Required outputs:

  • worker_balance_summary.json
  • session_to_worker_map.json
  • session_mass_summary.json
  • routing_policy_comparison.json
  • hotspot_index.json
  • per-worker queue delay bar
  • APC vs queue delay scatter
  • top-session contribution bar
  • policy tradeoff plot: APC vs hot-spot index

Pass condition:

LMetric/cache-aware must show measurable residual worker skew, and that skew must correlate with session token mass or locality.

GPU utilization alone is not enough for this claim.

Batch 4 Protocol: Sustainable Request Rate

Purpose:

Measure:

SRR(SLO) = max arrival rate satisfying SLO in steady state

Required load generator behavior:

  • open-loop session arrivals, preferably Poisson
  • session-internal sequentiality
  • warmup window
  • steady-state measurement window
  • explicit attempted/completed/error counters

Provisional SLO:

TTFT_p90 <= T_ttft
E2E_p90  <= T_e2e
TPOT_p90 <= T_tpot
error_rate <= epsilon
queue length stable
KV occupancy stable

Required arms:

  1. PD-colo corrected LMetric/cache-aware
  2. static PD-disagg
  3. current Unified hybrid
  4. optional hard sticky
  5. optional load-only

Required outputs:

  • srr_curve.json
  • lambda_runs/<lambda>/summary.json
  • slo_violation_reason.json
  • goodput_vs_arrival_rate.json
  • SRR bar chart
  • latency vs arrival rate curves
  • goodput vs arrival rate
  • queue/KV stability plot near failure point

Pass condition:

Each policy has a measured max sustainable lambda under the same SLO and same session-causal arrival process.

Batch 5 Protocol: Failure Attribution Near SRR Boundary

Purpose:

Explain why each policy fails near SRR.

Required rates:

lambda = 0.9 * SRR
lambda = 1.0 * SRR
lambda = 1.1 * SRR

Labels for each slow/SLO-violating request:

  • same-worker prefill overlap
  • hot worker queue
  • high KV occupancy
  • cache miss / large uncached append
  • transfer wait
  • P queue wait
  • D admission wait
  • unknown

Required outputs:

  • slow_request_attribution.jsonl
  • failure_breakdown.json
  • case_studies.md
  • worker_failure_windows.json
  • violation cause stacked bar
  • slow request waterfall
  • worker timeline near failure

Pass condition:

The analysis must explain whether PD-colo is limited by interference, hot-spot, KV pressure, or a mixture, and whether Unified/PUSH underperforms because of trigger quality, transfer cost, target admission, or load regime.

Batch 6 Protocol: Audit Package

Implemented by summarize_runs.py for existing runs and extended by fresh Batch 2-5 outputs later.

Required files:

  • characterization_claim_matrix.md
  • all_figures_index.md
  • reviewer_risk_register.md
  • reproduction_commands.sh
  • main_claim_allowed_runs.md

Current package intentionally marks Batch 2/4/5 claims as not yet supported until fresh instrumented experiments exist.