Files

Gahow Wang 0f64fb3261 Add agentic workload characterization audit scaffold

2026-05-25 15:01:18 +08:00

6.0 KiB

Raw Permalink Blame History

Characterization Protocols For Remaining Batches

Status: implementation protocol and audit checklist Date: 2026-05-25

This file completes the analysis/characterization scaffold for the TODO list. It separates what is already implemented from what requires fresh GPU runs or new engine/proxy instrumentation.

Implemented Now

Batch 0/1 Analyzer

Use:

python3 analysis/characterization/analyze.py \
  --trace traces/w600_r0.0015_st30.jsonl \
  --kv-bytes-per-token 98304 \
  --task-name w600_local_full_trace \
  --overwrite

The analyzer writes:

manifest.json
summary.json
summary.md
audit.md
session_concurrency.json
session_arrival_stats.json
turn_interval_stats.json
trace_profile.json
workload_summary.json
kv_footprint_summary.json
reuse_decomposition.json
session_skew.json
append_delta_stats.json

Limitations:

Actual online sequentiality requires dispatch and finish/error timestamps. Existing metrics.jsonl artifacts generally do not contain these fields.
Actual reuse decomposition requires cached_tokens/cache_hit, hash_ids, and session_id in the same joinable request record.

Existing-Run Audit

Use:

python3 analysis/characterization/summarize_runs.py

The script writes an audit package under:

analysis/characterization/current_results/

It summarizes already completed runs and explicitly marks which claims are supported, partially supported, or not yet supported.

Batch 2 Protocol: PD-Colo Prefill/Decode Interference

Purpose:

Prove whether same-worker prefill overlap increases decode TPOT/queue delay.

Required new instrumentation:

per-request dispatch timestamp
per-request finish/error timestamp
per decode step timestamp
decode step worker id
prefill chunk start/end timestamp
prefill worker id
request/session id associated with each prefill chunk

Required arms:

decode-only steady load
decode + same-worker heavy prefill injection
decode + different-worker heavy prefill injection
trace replay with overlap labels

Required sweep:

uncached_prefill_tokens in {2k, 8k, 16k, 32k, 64k}
chunked_prefill_size in available engine values

Required outputs:

interference_microbench_summary.json
decode_step_timeseries.csv
prefill_overlap_events.jsonl
interference_index.json
TPOT timeline figure with prefill overlays
same-worker vs different-worker TPOT boxplot

Pass condition:

TPOT_p90(overlap_same_worker) / TPOT_p90(no_overlap) > 1

and the effect must be materially weaker in the different-worker control.

Batch 3 Protocol: Session Hot-Spot Residual Imbalance

Purpose:

Prove whether cache-aware/LMetric still leaves hot workers under session-heavy skew.

Required new instrumentation:

route decision per request
chosen worker
candidate worker scores
cache hit / estimated uncached tokens per candidate
per-worker request queue length/delay
per-worker decode queue length/delay
per-worker KV occupancy
per-worker APC/cache-hit snapshot

Required arms:

corrected LMetric/cache-aware
load-only routing
hard sticky routing
current Unified hybrid
session-mass capped/equalized replay

Required outputs:

worker_balance_summary.json
session_to_worker_map.json
session_mass_summary.json
routing_policy_comparison.json
hotspot_index.json
per-worker queue delay bar
APC vs queue delay scatter
top-session contribution bar
policy tradeoff plot: APC vs hot-spot index

Pass condition:

LMetric/cache-aware must show measurable residual worker skew, and that skew must correlate with session token mass or locality.

GPU utilization alone is not enough for this claim.

Batch 4 Protocol: Sustainable Request Rate

Purpose:

Measure:

SRR(SLO) = max arrival rate satisfying SLO in steady state

Required load generator behavior:

open-loop session arrivals, preferably Poisson
session-internal sequentiality
warmup window
steady-state measurement window
explicit attempted/completed/error counters

Provisional SLO:

TTFT_p90 <= T_ttft
E2E_p90  <= T_e2e
TPOT_p90 <= T_tpot
error_rate <= epsilon
queue length stable
KV occupancy stable

Required arms:

PD-colo corrected LMetric/cache-aware
static PD-disagg
current Unified hybrid
optional hard sticky
optional load-only

Required outputs:

srr_curve.json
lambda_runs/<lambda>/summary.json
slo_violation_reason.json
goodput_vs_arrival_rate.json
SRR bar chart
latency vs arrival rate curves
goodput vs arrival rate
queue/KV stability plot near failure point

Pass condition:

Each policy has a measured max sustainable lambda under the same SLO and same session-causal arrival process.

Batch 5 Protocol: Failure Attribution Near SRR Boundary

Purpose:

Explain why each policy fails near SRR.

Required rates:

lambda = 0.9 * SRR
lambda = 1.0 * SRR
lambda = 1.1 * SRR

Labels for each slow/SLO-violating request:

same-worker prefill overlap
hot worker queue
high KV occupancy
cache miss / large uncached append
transfer wait
P queue wait
D admission wait
unknown

Required outputs:

slow_request_attribution.jsonl
failure_breakdown.json
case_studies.md
worker_failure_windows.json
violation cause stacked bar
slow request waterfall
worker timeline near failure

Pass condition:

The analysis must explain whether PD-colo is limited by interference, hot-spot, KV pressure, or a mixture, and whether Unified/PUSH underperforms because of trigger quality, transfer cost, target admission, or load regime.

Batch 6 Protocol: Audit Package

Implemented by summarize_runs.py for existing runs and extended by fresh Batch 2-5 outputs later.

Required files:

characterization_claim_matrix.md
all_figures_index.md
reviewer_risk_register.md
reproduction_commands.sh
main_claim_allowed_runs.md

Current package intentionally marks Batch 2/4/5 claims as not yet supported until fresh instrumented experiments exist.

6.0 KiB Raw Permalink Blame History

Characterization Protocols For Remaining Batches

Implemented Now

Batch 0/1 Analyzer

Existing-Run Audit

Batch 2 Protocol: PD-Colo Prefill/Decode Interference

Batch 3 Protocol: Session Hot-Spot Residual Imbalance

Batch 4 Protocol: Sustainable Request Rate

Batch 5 Protocol: Failure Attribution Near SRR Boundary

Batch 6 Protocol: Audit Package

6.0 KiB

Raw Permalink Blame History