- Add Progress Snapshot table to the intern TODO so per-batch status (DONE / partial / blocked-on-instrumentation) is visible at a glance. - New analysis/claude_characterization_work_plan.md scopes the Phase A instrumentation tasks (A1-A5) plus Window 1 (B1'+B2+B3) and Window 2 (B4+B5) on dash0, with locked decisions for model, topology, trace, SLO style, and GPU phasing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
14 KiB
Claude Characterization Work Plan
Status: planning, awaiting dash0 idle
Date: 2026-05-25
Owner: Claude (not interns)
Source of requirements: analysis/characterization_todo_for_interns.md
Scope
This plan covers the four hard gates and the B2–B5 GPU experiments that the
intern TODO marks as NOT DONE / protocol DONE. The B0 analyzer, the
B1 trace-shape statistics, and the B6 audit scaffold are already done; this
plan does not re-do them, only refreshes their inputs.
The work is split into:
- Phase A (CPU-only) — instrumentation + analyzer extensions. Can run on the local dev box; does not need dash0. Must finish before any GPU run.
- Phase B (dash0 GPU) — controlled microbench + routing sweep + SRR sweep + failure attribution.
- Phase C (CPU-only) — final audit package refresh.
Phase A: Instrumentation + Analyzer (CPU-only, before dash0)
A1. Replayer instrumentation — close Gate 1 + Gate 2
File: replayer/metrics.py, replayer/replay.py
Add these fields to RequestMetrics:
t_dispatch_unix float # absolute wall-clock when POST starts
t_first_token_unix float # absolute wall-clock at first stream chunk
t_finish_unix float # absolute wall-clock at stream done or error
proxy_request_id str # value sent in X-Request-Id (matches breakdown)
endpoint_url str # which proxy/instance the request hit
trace_hash_ids list[int] # carried from trace for reuse joins
Change _dispatch_request to:
- send a deterministic
X-Request-Id: <session_id>:<turn_id>header (so proxy breakdown can be joined to metrics by exact key); - record
time.time()(unix) at dispatch, first token, finish; keepperf_counterfor the latency arithmetic.
Acceptance: a 30-request smoke run produces metrics.jsonl where every
row has those fields; breakdown.json rows from the proxy have the same
request_id keys.
Effort: 1 small PR. Pure CPU.
A2. Proxy instrumentation — close Gate 1 + Gate 3 + Gate 4
File: scripts/cache_aware_proxy.py
Changes:
- Honor incoming
X-Request-Id: if header present, use it instead of generating a new uuid. Falls back to uuid otherwise. - Record on every breakdown row:
session_id(already on header, not currently stored)input_lengthestimated_new_tokens(already produced by router)candidate_scores(list of{url, p_tokens_score, cache_score, bs, occupancy})chosen_score
- At route decision time, snapshot per-worker state:
pending_prefill_tokensper workerrunning_decode_requestsper workerkv_blocks_used/kv_blocks_totalper workerapc_hits/apc_queriescumulative per worker Write to a separateworker_state.jsonl(one line per route decision) with(t_decision_unix, request_id, per_worker_state).
- New endpoint
GET /worker_statereturns the latest snapshot per worker (for sanity / live debugging).
Acceptance: smoke run produces breakdown.json with new fields and a
non-empty worker_state.jsonl that joins to breakdown by request_id.
Effort: 1 medium PR. Pure CPU + light proxy work.
A3. Engine-side step timestamps — close Gate 3 for B2
vLLM 0.18.1 already exposes:
vllm:request_prefill_time_seconds(histogram, per-request)vllm:request_decode_time_secondsvllm:time_per_output_token_seconds- step-level scheduler stats via
engine.async_steplogging
For B2 we need decode-step and prefill-chunk timestamps with worker id. Plan:
- Inspect whether the vLLM proxy can be polled at high rate (e.g.
100 Hz) for per-engine scheduler counters
(
num_running,num_waiting,gpu_cache_usage,prefix_cache_queries,prefix_cache_hits). If yes, sample intoengine_state.jsonlduring runs. - If finer step-level data is needed, patch one vLLM file
(
vllm/engine/async_llm_engine.pystep loop orvllm/v1/core/sched/scheduler.py) to emit a JSONL line per scheduler step with(t_unix, worker_id, num_prefill_tokens_scheduled, num_decode_steps, running_request_ids). Patch goes underpatches/so it can be applied/reverted cleanly. - Worker id mapping: when running TP1xDP8 or similar, each engine
listens on a distinct port;
worker_id == endpoint_url.
Acceptance: a single 10-minute run produces engine_state.jsonl from
which a decode step at time T on worker W can be classified as
"overlapping a same-worker prefill chunk" or not.
Effort: 1 medium investigation (decide poll vs patch) + 1 medium PR.
A4. Open-loop session-causal loadgen for B4
File: replayer/replay.py (new mode) or new replayer/srr_loadgen.py
Current replayer dispatches by trace timestamps. SRR sweep needs:
- pool of session templates (each = ordered list of turns from the trace);
- Poisson arrivals of new sessions at rate
lambda; - within a session: strict sequentiality (turn N+1 waits for turn N finish);
- per-run warmup window (e.g. 60s) + steady-state window (e.g. 300s);
- attempted / completed / error counters per window.
Add a new mode --mode srr --arrival-rate <lambda> --warmup-s 60 --steady-s 300 --session-pool-size N. The trace
file becomes the pool; sessions are drawn with replacement.
Acceptance: at lambda = 0.5 sess/s, the run shows exponential inter-
arrival times and per-session sequentiality in metrics.jsonl. A
window_summary.json lists warmup vs steady-state attempted/completed.
Effort: 1 medium PR.
A5. Analyzer extensions
File: analysis/characterization/analyze.py (extend, do not rewrite)
Add:
- Joined-record builder. Given
--metrics metrics.jsonl --breakdown breakdown.json --worker-state worker_state.jsonl --engine-state engine_state.jsonl, producejoined.jsonlkeyed onrequest_idwith all fields merged. - Reuse decomposition (real). Using joined records that carry
session_id+hash_ids+cached_tokens, computeintra_session/cross_session/shared_prefix/unclassifiedcached-token mass. Replaces the currentstatus: unavailableplaceholder when fields are present. - Interference index. Per decode step, label "overlap same-
worker prefill" using
engine_state.jsonl. ComputeTPOT_p90(overlap) / TPOT_p90(no_overlap). - Hotspot index. Per worker queue delay p90, output
max_worker_q_p90 / median_worker_q_p90. - Failure label. For each slow / SLO-violating request, assign
one of:
same_worker_prefill_overlap,hot_worker_queue,high_kv_occupancy,cache_miss_large_append,transfer_wait,p_queue_wait,d_admission_wait,unknown. - Window summary. For SRR runs, compute attempted/completed/ error/goodput plus latency percentiles on the steady-state window only.
Acceptance: re-run analyzer on smoke output and confirm reuse_decomposition
no longer says unavailable; interference_index.json produced when
engine state present; failure_breakdown.json populated when
labels assigned.
Effort: 1 large PR. CPU-only.
Phase B: GPU experiments (needs dash0)
B1' Workload characterization closure
Inputs: instrumented replayer + small smoke trace (≤500 req).
Steps:
- Pick
kv_bytes_per_tokenfor the production model. For Qwen3-Coder TP1 the value depends on layer/head config; compute fromvllm.configonce at run start and record in manifest. - Re-run analyzer on full GLM-5.1 trace with
--kv-bytes-per-token. Output: KV footprint p50/p90/p99 inkv_footprint_summary.json. - Run a 1k-request session-causal smoke replay with instrumented proxy. Use the joined records to populate real reuse decomposition for the small sample. (Full-trace replay is too expensive; sample is acceptable for the decomposition claim.)
Wall-clock: ~30 min GPU. Produces 2 figures: KV footprint CDF, reuse decomposition stacked bar.
B2 PD-colo interference microbench
Setup: 1 combined instance on TP1. Two synthetic load generators:
- Decode-only steady load — short-prompt sessions at fixed per-second arrival, designed to saturate decode without prefill contention.
- Prefill injector — single-shot long-prompt requests at controlled cadence; same worker (target the decode worker) vs different worker (route to a paired idle instance).
Sweep uncached_prefill_tokens ∈ {2k, 8k, 16k, 32k, 64k} × {same, different} worker.
Outputs: interference_microbench_summary.json,
decode_step_timeseries.csv (from engine_state.jsonl),
prefill_overlap_events.jsonl, interference_index.json,
TPOT-with-overlay figure, interference-index-vs-prefill-size figure.
Wall-clock: ~2–3 h GPU including warm-up between sweeps.
B3 Routing sweep on session-causal trace
Setup: 8 combined instances (TP1 × DP8) with the cache-aware proxy.
Run the same session-causal trace (e.g. r=0.0015 st=30 850-req config
from auto-mem feedback-bench-config.md) under five policies:
- corrected LMetric / cache-aware (
--policy lmetric) - load-only (new policy
--policy load_only— picks min running) - hard sticky (new policy
--policy sticky— once a session lands on a worker, never moves) - current Unified hybrid (
--policy unified) - session-mass capped replay (filter the trace so no session exceeds
cap_turnsorcap_input_tokens; rerun policy 1)
Per run, collect: replayer metrics, proxy breakdown, worker_state, engine_state. Compute per-worker queue delay, GPU util, KV occupancy, APC, session-to-worker map.
Outputs: worker_balance_summary.json, session_to_worker_map.json,
session_mass_summary.json, routing_policy_comparison.json,
hotspot_index.json, capped_session_replay_summary.json,
8 figures from the TODO list (§5.figures).
Wall-clock: 5 runs × ~13 min ≈ 1.5 h GPU.
Implementation note: load_only and sticky are small additions to
scripts/cache_aware_proxy.py — they reuse existing affinity / score
machinery.
B4 Sustainable Request Rate sweep
Setup: same 8 instances. Use Phase-A --mode srr loadgen.
SLO (locked per-class):
TTFT_p90 <= 2.0 s
TPOT_p90 <= 0.15 s
error_rate <= 0.5%
queue length stable (no monotone growth over steady window)
KV occupancy stable
E2E_p90 <= T_class[c] for each output-length decile c
T_class[c] is derived from a low-load reference run as
E2E_p90_low_load(c) * 2 (factor configurable). The reference run
is done once and cached as analysis/characterization/srr/slo_classes.json.
Per policy sweep lambda from low (clearly safe) to high (clearly
broken) using a bisection-ish search:
λ_low = 0.1 sess/s
λ_high = doubling until first SLO violation
binary-search λ_low .. λ_high for max sustainable λ
Policies covered: LMetric, static PD-disagg, Unified, hard sticky, load-only.
Outputs: srr_curve.json, lambda_runs/<lambda>/summary.json,
slo_violation_reason.json, goodput_vs_arrival_rate.json,
stability_summary.json, all 8 figures from §6.figures.
Wall-clock: this is the most expensive batch. With binary search, ~6 lambda points × 5 policies × ~8 min (warmup + steady) ≈ 4 h GPU.
B5 Failure attribution near SRR boundary
For each policy: pick λ ∈ {0.9, 1.0, 1.1} × SRR, run with full
instrumentation, then run the analyzer's failure-label step.
Outputs: slow_request_attribution.jsonl, failure_breakdown.json,
case_studies.md, worker_failure_windows.json, 5 figures from §7.
Wall-clock: 3 lambdas × 5 policies × 8 min ≈ 2 h GPU.
Phase C: Audit package refresh (CPU)
Re-run summarize_runs.py and plot_current_results.py after each
GPU batch. Final pass after B5: refresh claim_matrix, risk_register,
allowed_runs, regenerate all figures, update
reproduction_commands.sh.
Effort: ~1 h CPU.
Sequencing & rough timeline
Phase A (CPU, before dash0):
A1 + A2 (parallel) ~half day CPU
A3 patch (scheduler.py) ~half day CPU
A4 SRR loadgen ~half day CPU
A5 analyzer extensions ~1 day CPU
Window 1 on dash0 (B2 + B3 only, ~5 h GPU):
smoke validation of A1–A4 ~30 min GPU
B1' KV footprint + reuse decomp ~30 min GPU
B2 interference microbench ~3 h GPU
B3 routing sweep (5 policies) ~1.5 h GPU
Phase C partial refresh ~30 min CPU
── HARD STOP, hand results back ──
Window 2 on dash0 (B4 + B5, ~6 h GPU, only after review):
B4 SRR sweep (5 policies × bisect) ~4 h GPU
B5 failure attribution ~2 h GPU
Phase C final refresh ~1 h CPU
Decisions (locked 2026-05-25)
- Target model: Qwen3-Coder-30B-A3B. Compute
kv_bytes_per_tokenfrom this model's config at manifest time. - GPU topology: TP1 × 8 vLLM instances (DP8). All proxies and sweeps assume 8 worker endpoints.
- Trace for B3/B4:
traces/w600_r0.0015_st30.jsonl(~850 requests). No resampling. - E2E SLO: per-class. Split requests by
requested_output_tokensdecile, set separate E2E thresholds per class. No normalized-E2E headline. - vLLM scheduler patch: accepted. Step-level JSONL log goes
through a patch under
patches/. Polling falls back to per-engine/metricsfor sanity only. - GPU phasing: hard stop after B2 and B3. Hand results back for review before committing to B4 SRR sweep or B5 attribution.
What stays with the interns
- Re-running
summarize_runs.pyafter each GPU batch (mechanical). - Reviewing the auto-generated
current_results.mdfor typos. - Maintaining
main_claim_allowed_runs.mdif new traces are added. - Anything reading the audit package — not extending it.
Out of scope for this plan
- New routing policy design (Unified-v2 / PUSH variants).
- Production-grade KV transfer engineering.
- Any change to the production paper figures in
analysis/pd_sep_paper_section/. - vLLM upstream contributions.
These are downstream of characterization; once B2/B3/B5 attribution is in, we decide separately.