Files

265 lines
6.0 KiB
Markdown

# Characterization Protocols For Remaining Batches
Status: implementation protocol and audit checklist
Date: 2026-05-25
This file completes the `analysis/characterization` scaffold for the TODO
list. It separates what is already implemented from what requires fresh GPU
runs or new engine/proxy instrumentation.
## Implemented Now
### Batch 0/1 Analyzer
Use:
```bash
python3 analysis/characterization/analyze.py \
--trace traces/w600_r0.0015_st30.jsonl \
--kv-bytes-per-token 98304 \
--task-name w600_local_full_trace \
--overwrite
```
The analyzer writes:
- `manifest.json`
- `summary.json`
- `summary.md`
- `audit.md`
- `session_concurrency.json`
- `session_arrival_stats.json`
- `turn_interval_stats.json`
- `trace_profile.json`
- `workload_summary.json`
- `kv_footprint_summary.json`
- `reuse_decomposition.json`
- `session_skew.json`
- `append_delta_stats.json`
Limitations:
- Actual online sequentiality requires dispatch and finish/error timestamps.
Existing `metrics.jsonl` artifacts generally do not contain these fields.
- Actual reuse decomposition requires `cached_tokens`/`cache_hit`, `hash_ids`,
and `session_id` in the same joinable request record.
### Existing-Run Audit
Use:
```bash
python3 analysis/characterization/summarize_runs.py
```
The script writes an audit package under:
```text
analysis/characterization/current_results/
```
It summarizes already completed runs and explicitly marks which claims are
supported, partially supported, or not yet supported.
## Batch 2 Protocol: PD-Colo Prefill/Decode Interference
Purpose:
Prove whether same-worker prefill overlap increases decode TPOT/queue delay.
Required new instrumentation:
- per-request dispatch timestamp
- per-request finish/error timestamp
- per decode step timestamp
- decode step worker id
- prefill chunk start/end timestamp
- prefill worker id
- request/session id associated with each prefill chunk
Required arms:
1. decode-only steady load
2. decode + same-worker heavy prefill injection
3. decode + different-worker heavy prefill injection
4. trace replay with overlap labels
Required sweep:
```text
uncached_prefill_tokens in {2k, 8k, 16k, 32k, 64k}
chunked_prefill_size in available engine values
```
Required outputs:
- `interference_microbench_summary.json`
- `decode_step_timeseries.csv`
- `prefill_overlap_events.jsonl`
- `interference_index.json`
- TPOT timeline figure with prefill overlays
- same-worker vs different-worker TPOT boxplot
Pass condition:
```text
TPOT_p90(overlap_same_worker) / TPOT_p90(no_overlap) > 1
```
and the effect must be materially weaker in the different-worker control.
## Batch 3 Protocol: Session Hot-Spot Residual Imbalance
Purpose:
Prove whether cache-aware/LMetric still leaves hot workers under
session-heavy skew.
Required new instrumentation:
- route decision per request
- chosen worker
- candidate worker scores
- cache hit / estimated uncached tokens per candidate
- per-worker request queue length/delay
- per-worker decode queue length/delay
- per-worker KV occupancy
- per-worker APC/cache-hit snapshot
Required arms:
1. corrected LMetric/cache-aware
2. load-only routing
3. hard sticky routing
4. current Unified hybrid
5. session-mass capped/equalized replay
Required outputs:
- `worker_balance_summary.json`
- `session_to_worker_map.json`
- `session_mass_summary.json`
- `routing_policy_comparison.json`
- `hotspot_index.json`
- per-worker queue delay bar
- APC vs queue delay scatter
- top-session contribution bar
- policy tradeoff plot: APC vs hot-spot index
Pass condition:
LMetric/cache-aware must show measurable residual worker skew, and that skew
must correlate with session token mass or locality.
GPU utilization alone is not enough for this claim.
## Batch 4 Protocol: Sustainable Request Rate
Purpose:
Measure:
```text
SRR(SLO) = max arrival rate satisfying SLO in steady state
```
Required load generator behavior:
- open-loop session arrivals, preferably Poisson
- session-internal sequentiality
- warmup window
- steady-state measurement window
- explicit attempted/completed/error counters
Provisional SLO:
```text
TTFT_p90 <= T_ttft
E2E_p90 <= T_e2e
TPOT_p90 <= T_tpot
error_rate <= epsilon
queue length stable
KV occupancy stable
```
Required arms:
1. PD-colo corrected LMetric/cache-aware
2. static PD-disagg
3. current Unified hybrid
4. optional hard sticky
5. optional load-only
Required outputs:
- `srr_curve.json`
- `lambda_runs/<lambda>/summary.json`
- `slo_violation_reason.json`
- `goodput_vs_arrival_rate.json`
- SRR bar chart
- latency vs arrival rate curves
- goodput vs arrival rate
- queue/KV stability plot near failure point
Pass condition:
Each policy has a measured max sustainable lambda under the same SLO and
same session-causal arrival process.
## Batch 5 Protocol: Failure Attribution Near SRR Boundary
Purpose:
Explain why each policy fails near SRR.
Required rates:
```text
lambda = 0.9 * SRR
lambda = 1.0 * SRR
lambda = 1.1 * SRR
```
Labels for each slow/SLO-violating request:
- same-worker prefill overlap
- hot worker queue
- high KV occupancy
- cache miss / large uncached append
- transfer wait
- P queue wait
- D admission wait
- unknown
Required outputs:
- `slow_request_attribution.jsonl`
- `failure_breakdown.json`
- `case_studies.md`
- `worker_failure_windows.json`
- violation cause stacked bar
- slow request waterfall
- worker timeline near failure
Pass condition:
The analysis must explain whether PD-colo is limited by interference,
hot-spot, KV pressure, or a mixture, and whether Unified/PUSH underperforms
because of trigger quality, transfer cost, target admission, or load regime.
## Batch 6 Protocol: Audit Package
Implemented by `summarize_runs.py` for existing runs and extended by fresh
Batch 2-5 outputs later.
Required files:
- `characterization_claim_matrix.md`
- `all_figures_index.md`
- `reviewer_risk_register.md`
- `reproduction_commands.sh`
- `main_claim_allowed_runs.md`
Current package intentionally marks Batch 2/4/5 claims as not yet supported
until fresh instrumented experiments exist.