265 lines
6.0 KiB
Markdown
265 lines
6.0 KiB
Markdown
# Characterization Protocols For Remaining Batches
|
|
|
|
Status: implementation protocol and audit checklist
|
|
Date: 2026-05-25
|
|
|
|
This file completes the `analysis/characterization` scaffold for the TODO
|
|
list. It separates what is already implemented from what requires fresh GPU
|
|
runs or new engine/proxy instrumentation.
|
|
|
|
## Implemented Now
|
|
|
|
### Batch 0/1 Analyzer
|
|
|
|
Use:
|
|
|
|
```bash
|
|
python3 analysis/characterization/analyze.py \
|
|
--trace traces/w600_r0.0015_st30.jsonl \
|
|
--kv-bytes-per-token 98304 \
|
|
--task-name w600_local_full_trace \
|
|
--overwrite
|
|
```
|
|
|
|
The analyzer writes:
|
|
|
|
- `manifest.json`
|
|
- `summary.json`
|
|
- `summary.md`
|
|
- `audit.md`
|
|
- `session_concurrency.json`
|
|
- `session_arrival_stats.json`
|
|
- `turn_interval_stats.json`
|
|
- `trace_profile.json`
|
|
- `workload_summary.json`
|
|
- `kv_footprint_summary.json`
|
|
- `reuse_decomposition.json`
|
|
- `session_skew.json`
|
|
- `append_delta_stats.json`
|
|
|
|
Limitations:
|
|
|
|
- Actual online sequentiality requires dispatch and finish/error timestamps.
|
|
Existing `metrics.jsonl` artifacts generally do not contain these fields.
|
|
- Actual reuse decomposition requires `cached_tokens`/`cache_hit`, `hash_ids`,
|
|
and `session_id` in the same joinable request record.
|
|
|
|
### Existing-Run Audit
|
|
|
|
Use:
|
|
|
|
```bash
|
|
python3 analysis/characterization/summarize_runs.py
|
|
```
|
|
|
|
The script writes an audit package under:
|
|
|
|
```text
|
|
analysis/characterization/current_results/
|
|
```
|
|
|
|
It summarizes already completed runs and explicitly marks which claims are
|
|
supported, partially supported, or not yet supported.
|
|
|
|
## Batch 2 Protocol: PD-Colo Prefill/Decode Interference
|
|
|
|
Purpose:
|
|
|
|
Prove whether same-worker prefill overlap increases decode TPOT/queue delay.
|
|
|
|
Required new instrumentation:
|
|
|
|
- per-request dispatch timestamp
|
|
- per-request finish/error timestamp
|
|
- per decode step timestamp
|
|
- decode step worker id
|
|
- prefill chunk start/end timestamp
|
|
- prefill worker id
|
|
- request/session id associated with each prefill chunk
|
|
|
|
Required arms:
|
|
|
|
1. decode-only steady load
|
|
2. decode + same-worker heavy prefill injection
|
|
3. decode + different-worker heavy prefill injection
|
|
4. trace replay with overlap labels
|
|
|
|
Required sweep:
|
|
|
|
```text
|
|
uncached_prefill_tokens in {2k, 8k, 16k, 32k, 64k}
|
|
chunked_prefill_size in available engine values
|
|
```
|
|
|
|
Required outputs:
|
|
|
|
- `interference_microbench_summary.json`
|
|
- `decode_step_timeseries.csv`
|
|
- `prefill_overlap_events.jsonl`
|
|
- `interference_index.json`
|
|
- TPOT timeline figure with prefill overlays
|
|
- same-worker vs different-worker TPOT boxplot
|
|
|
|
Pass condition:
|
|
|
|
```text
|
|
TPOT_p90(overlap_same_worker) / TPOT_p90(no_overlap) > 1
|
|
```
|
|
|
|
and the effect must be materially weaker in the different-worker control.
|
|
|
|
## Batch 3 Protocol: Session Hot-Spot Residual Imbalance
|
|
|
|
Purpose:
|
|
|
|
Prove whether cache-aware/LMetric still leaves hot workers under
|
|
session-heavy skew.
|
|
|
|
Required new instrumentation:
|
|
|
|
- route decision per request
|
|
- chosen worker
|
|
- candidate worker scores
|
|
- cache hit / estimated uncached tokens per candidate
|
|
- per-worker request queue length/delay
|
|
- per-worker decode queue length/delay
|
|
- per-worker KV occupancy
|
|
- per-worker APC/cache-hit snapshot
|
|
|
|
Required arms:
|
|
|
|
1. corrected LMetric/cache-aware
|
|
2. load-only routing
|
|
3. hard sticky routing
|
|
4. current Unified hybrid
|
|
5. session-mass capped/equalized replay
|
|
|
|
Required outputs:
|
|
|
|
- `worker_balance_summary.json`
|
|
- `session_to_worker_map.json`
|
|
- `session_mass_summary.json`
|
|
- `routing_policy_comparison.json`
|
|
- `hotspot_index.json`
|
|
- per-worker queue delay bar
|
|
- APC vs queue delay scatter
|
|
- top-session contribution bar
|
|
- policy tradeoff plot: APC vs hot-spot index
|
|
|
|
Pass condition:
|
|
|
|
LMetric/cache-aware must show measurable residual worker skew, and that skew
|
|
must correlate with session token mass or locality.
|
|
|
|
GPU utilization alone is not enough for this claim.
|
|
|
|
## Batch 4 Protocol: Sustainable Request Rate
|
|
|
|
Purpose:
|
|
|
|
Measure:
|
|
|
|
```text
|
|
SRR(SLO) = max arrival rate satisfying SLO in steady state
|
|
```
|
|
|
|
Required load generator behavior:
|
|
|
|
- open-loop session arrivals, preferably Poisson
|
|
- session-internal sequentiality
|
|
- warmup window
|
|
- steady-state measurement window
|
|
- explicit attempted/completed/error counters
|
|
|
|
Provisional SLO:
|
|
|
|
```text
|
|
TTFT_p90 <= T_ttft
|
|
E2E_p90 <= T_e2e
|
|
TPOT_p90 <= T_tpot
|
|
error_rate <= epsilon
|
|
queue length stable
|
|
KV occupancy stable
|
|
```
|
|
|
|
Required arms:
|
|
|
|
1. PD-colo corrected LMetric/cache-aware
|
|
2. static PD-disagg
|
|
3. current Unified hybrid
|
|
4. optional hard sticky
|
|
5. optional load-only
|
|
|
|
Required outputs:
|
|
|
|
- `srr_curve.json`
|
|
- `lambda_runs/<lambda>/summary.json`
|
|
- `slo_violation_reason.json`
|
|
- `goodput_vs_arrival_rate.json`
|
|
- SRR bar chart
|
|
- latency vs arrival rate curves
|
|
- goodput vs arrival rate
|
|
- queue/KV stability plot near failure point
|
|
|
|
Pass condition:
|
|
|
|
Each policy has a measured max sustainable lambda under the same SLO and
|
|
same session-causal arrival process.
|
|
|
|
## Batch 5 Protocol: Failure Attribution Near SRR Boundary
|
|
|
|
Purpose:
|
|
|
|
Explain why each policy fails near SRR.
|
|
|
|
Required rates:
|
|
|
|
```text
|
|
lambda = 0.9 * SRR
|
|
lambda = 1.0 * SRR
|
|
lambda = 1.1 * SRR
|
|
```
|
|
|
|
Labels for each slow/SLO-violating request:
|
|
|
|
- same-worker prefill overlap
|
|
- hot worker queue
|
|
- high KV occupancy
|
|
- cache miss / large uncached append
|
|
- transfer wait
|
|
- P queue wait
|
|
- D admission wait
|
|
- unknown
|
|
|
|
Required outputs:
|
|
|
|
- `slow_request_attribution.jsonl`
|
|
- `failure_breakdown.json`
|
|
- `case_studies.md`
|
|
- `worker_failure_windows.json`
|
|
- violation cause stacked bar
|
|
- slow request waterfall
|
|
- worker timeline near failure
|
|
|
|
Pass condition:
|
|
|
|
The analysis must explain whether PD-colo is limited by interference,
|
|
hot-spot, KV pressure, or a mixture, and whether Unified/PUSH underperforms
|
|
because of trigger quality, transfer cost, target admission, or load regime.
|
|
|
|
## Batch 6 Protocol: Audit Package
|
|
|
|
Implemented by `summarize_runs.py` for existing runs and extended by fresh
|
|
Batch 2-5 outputs later.
|
|
|
|
Required files:
|
|
|
|
- `characterization_claim_matrix.md`
|
|
- `all_figures_index.md`
|
|
- `reviewer_risk_register.md`
|
|
- `reproduction_commands.sh`
|
|
- `main_claim_allowed_runs.md`
|
|
|
|
Current package intentionally marks Batch 2/4/5 claims as not yet supported
|
|
until fresh instrumented experiments exist.
|