Files
agentic-kvc/analysis/claude_characterization_work_plan.md
Gahow Wang e5761fa6f3 Characterization plan: progress snapshot + Claude work plan
- Add Progress Snapshot table to the intern TODO so per-batch status
  (DONE / partial / blocked-on-instrumentation) is visible at a glance.
- New analysis/claude_characterization_work_plan.md scopes the Phase A
  instrumentation tasks (A1-A5) plus Window 1 (B1'+B2+B3) and Window 2
  (B4+B5) on dash0, with locked decisions for model, topology, trace,
  SLO style, and GPU phasing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:18:41 +08:00

14 KiB
Raw Blame History

Claude Characterization Work Plan

Status: planning, awaiting dash0 idle Date: 2026-05-25 Owner: Claude (not interns) Source of requirements: analysis/characterization_todo_for_interns.md

Scope

This plan covers the four hard gates and the B2B5 GPU experiments that the intern TODO marks as NOT DONE / protocol DONE. The B0 analyzer, the B1 trace-shape statistics, and the B6 audit scaffold are already done; this plan does not re-do them, only refreshes their inputs.

The work is split into:

  • Phase A (CPU-only) — instrumentation + analyzer extensions. Can run on the local dev box; does not need dash0. Must finish before any GPU run.
  • Phase B (dash0 GPU) — controlled microbench + routing sweep + SRR sweep + failure attribution.
  • Phase C (CPU-only) — final audit package refresh.

Phase A: Instrumentation + Analyzer (CPU-only, before dash0)

A1. Replayer instrumentation — close Gate 1 + Gate 2

File: replayer/metrics.py, replayer/replay.py

Add these fields to RequestMetrics:

t_dispatch_unix     float   # absolute wall-clock when POST starts
t_first_token_unix  float   # absolute wall-clock at first stream chunk
t_finish_unix       float   # absolute wall-clock at stream done or error
proxy_request_id    str     # value sent in X-Request-Id (matches breakdown)
endpoint_url        str     # which proxy/instance the request hit
trace_hash_ids      list[int]  # carried from trace for reuse joins

Change _dispatch_request to:

  • send a deterministic X-Request-Id: <session_id>:<turn_id> header (so proxy breakdown can be joined to metrics by exact key);
  • record time.time() (unix) at dispatch, first token, finish; keep perf_counter for the latency arithmetic.

Acceptance: a 30-request smoke run produces metrics.jsonl where every row has those fields; breakdown.json rows from the proxy have the same request_id keys.

Effort: 1 small PR. Pure CPU.

A2. Proxy instrumentation — close Gate 1 + Gate 3 + Gate 4

File: scripts/cache_aware_proxy.py

Changes:

  1. Honor incoming X-Request-Id: if header present, use it instead of generating a new uuid. Falls back to uuid otherwise.
  2. Record on every breakdown row:
    • session_id (already on header, not currently stored)
    • input_length
    • estimated_new_tokens (already produced by router)
    • candidate_scores (list of {url, p_tokens_score, cache_score, bs, occupancy})
    • chosen_score
  3. At route decision time, snapshot per-worker state:
    • pending_prefill_tokens per worker
    • running_decode_requests per worker
    • kv_blocks_used / kv_blocks_total per worker
    • apc_hits / apc_queries cumulative per worker Write to a separate worker_state.jsonl (one line per route decision) with (t_decision_unix, request_id, per_worker_state).
  4. New endpoint GET /worker_state returns the latest snapshot per worker (for sanity / live debugging).

Acceptance: smoke run produces breakdown.json with new fields and a non-empty worker_state.jsonl that joins to breakdown by request_id.

Effort: 1 medium PR. Pure CPU + light proxy work.

A3. Engine-side step timestamps — close Gate 3 for B2

vLLM 0.18.1 already exposes:

  • vllm:request_prefill_time_seconds (histogram, per-request)
  • vllm:request_decode_time_seconds
  • vllm:time_per_output_token_seconds
  • step-level scheduler stats via engine.async_step logging

For B2 we need decode-step and prefill-chunk timestamps with worker id. Plan:

  1. Inspect whether the vLLM proxy can be polled at high rate (e.g. 100 Hz) for per-engine scheduler counters (num_running, num_waiting, gpu_cache_usage, prefix_cache_queries, prefix_cache_hits). If yes, sample into engine_state.jsonl during runs.
  2. If finer step-level data is needed, patch one vLLM file (vllm/engine/async_llm_engine.py step loop or vllm/v1/core/sched/scheduler.py) to emit a JSONL line per scheduler step with (t_unix, worker_id, num_prefill_tokens_scheduled, num_decode_steps, running_request_ids). Patch goes under patches/ so it can be applied/reverted cleanly.
  3. Worker id mapping: when running TP1xDP8 or similar, each engine listens on a distinct port; worker_id == endpoint_url.

Acceptance: a single 10-minute run produces engine_state.jsonl from which a decode step at time T on worker W can be classified as "overlapping a same-worker prefill chunk" or not.

Effort: 1 medium investigation (decide poll vs patch) + 1 medium PR.

A4. Open-loop session-causal loadgen for B4

File: replayer/replay.py (new mode) or new replayer/srr_loadgen.py

Current replayer dispatches by trace timestamps. SRR sweep needs:

  • pool of session templates (each = ordered list of turns from the trace);
  • Poisson arrivals of new sessions at rate lambda;
  • within a session: strict sequentiality (turn N+1 waits for turn N finish);
  • per-run warmup window (e.g. 60s) + steady-state window (e.g. 300s);
  • attempted / completed / error counters per window.

Add a new mode --mode srr --arrival-rate <lambda> --warmup-s 60 --steady-s 300 --session-pool-size N. The trace file becomes the pool; sessions are drawn with replacement.

Acceptance: at lambda = 0.5 sess/s, the run shows exponential inter- arrival times and per-session sequentiality in metrics.jsonl. A window_summary.json lists warmup vs steady-state attempted/completed.

Effort: 1 medium PR.

A5. Analyzer extensions

File: analysis/characterization/analyze.py (extend, do not rewrite)

Add:

  1. Joined-record builder. Given --metrics metrics.jsonl --breakdown breakdown.json --worker-state worker_state.jsonl --engine-state engine_state.jsonl, produce joined.jsonl keyed on request_id with all fields merged.
  2. Reuse decomposition (real). Using joined records that carry session_id + hash_ids + cached_tokens, compute intra_session / cross_session / shared_prefix / unclassified cached-token mass. Replaces the current status: unavailable placeholder when fields are present.
  3. Interference index. Per decode step, label "overlap same- worker prefill" using engine_state.jsonl. Compute TPOT_p90(overlap) / TPOT_p90(no_overlap).
  4. Hotspot index. Per worker queue delay p90, output max_worker_q_p90 / median_worker_q_p90.
  5. Failure label. For each slow / SLO-violating request, assign one of: same_worker_prefill_overlap, hot_worker_queue, high_kv_occupancy, cache_miss_large_append, transfer_wait, p_queue_wait, d_admission_wait, unknown.
  6. Window summary. For SRR runs, compute attempted/completed/ error/goodput plus latency percentiles on the steady-state window only.

Acceptance: re-run analyzer on smoke output and confirm reuse_decomposition no longer says unavailable; interference_index.json produced when engine state present; failure_breakdown.json populated when labels assigned.

Effort: 1 large PR. CPU-only.

Phase B: GPU experiments (needs dash0)

B1' Workload characterization closure

Inputs: instrumented replayer + small smoke trace (≤500 req).

Steps:

  1. Pick kv_bytes_per_token for the production model. For Qwen3-Coder TP1 the value depends on layer/head config; compute from vllm.config once at run start and record in manifest.
  2. Re-run analyzer on full GLM-5.1 trace with --kv-bytes-per-token. Output: KV footprint p50/p90/p99 in kv_footprint_summary.json.
  3. Run a 1k-request session-causal smoke replay with instrumented proxy. Use the joined records to populate real reuse decomposition for the small sample. (Full-trace replay is too expensive; sample is acceptable for the decomposition claim.)

Wall-clock: ~30 min GPU. Produces 2 figures: KV footprint CDF, reuse decomposition stacked bar.

B2 PD-colo interference microbench

Setup: 1 combined instance on TP1. Two synthetic load generators:

  1. Decode-only steady load — short-prompt sessions at fixed per-second arrival, designed to saturate decode without prefill contention.
  2. Prefill injector — single-shot long-prompt requests at controlled cadence; same worker (target the decode worker) vs different worker (route to a paired idle instance).

Sweep uncached_prefill_tokens ∈ {2k, 8k, 16k, 32k, 64k} × {same, different} worker.

Outputs: interference_microbench_summary.json, decode_step_timeseries.csv (from engine_state.jsonl), prefill_overlap_events.jsonl, interference_index.json, TPOT-with-overlay figure, interference-index-vs-prefill-size figure.

Wall-clock: ~23 h GPU including warm-up between sweeps.

B3 Routing sweep on session-causal trace

Setup: 8 combined instances (TP1 × DP8) with the cache-aware proxy.

Run the same session-causal trace (e.g. r=0.0015 st=30 850-req config from auto-mem feedback-bench-config.md) under five policies:

  1. corrected LMetric / cache-aware (--policy lmetric)
  2. load-only (new policy --policy load_only — picks min running)
  3. hard sticky (new policy --policy sticky — once a session lands on a worker, never moves)
  4. current Unified hybrid (--policy unified)
  5. session-mass capped replay (filter the trace so no session exceeds cap_turns or cap_input_tokens; rerun policy 1)

Per run, collect: replayer metrics, proxy breakdown, worker_state, engine_state. Compute per-worker queue delay, GPU util, KV occupancy, APC, session-to-worker map.

Outputs: worker_balance_summary.json, session_to_worker_map.json, session_mass_summary.json, routing_policy_comparison.json, hotspot_index.json, capped_session_replay_summary.json, 8 figures from the TODO list (§5.figures).

Wall-clock: 5 runs × ~13 min ≈ 1.5 h GPU.

Implementation note: load_only and sticky are small additions to scripts/cache_aware_proxy.py — they reuse existing affinity / score machinery.

B4 Sustainable Request Rate sweep

Setup: same 8 instances. Use Phase-A --mode srr loadgen.

SLO (locked per-class):

TTFT_p90 <= 2.0 s
TPOT_p90 <= 0.15 s
error_rate <= 0.5%
queue length stable (no monotone growth over steady window)
KV occupancy stable
E2E_p90 <= T_class[c]  for each output-length decile c

T_class[c] is derived from a low-load reference run as E2E_p90_low_load(c) * 2 (factor configurable). The reference run is done once and cached as analysis/characterization/srr/slo_classes.json.

Per policy sweep lambda from low (clearly safe) to high (clearly broken) using a bisection-ish search:

λ_low  = 0.1 sess/s
λ_high = doubling until first SLO violation
binary-search λ_low .. λ_high for max sustainable λ

Policies covered: LMetric, static PD-disagg, Unified, hard sticky, load-only.

Outputs: srr_curve.json, lambda_runs/<lambda>/summary.json, slo_violation_reason.json, goodput_vs_arrival_rate.json, stability_summary.json, all 8 figures from §6.figures.

Wall-clock: this is the most expensive batch. With binary search, ~6 lambda points × 5 policies × ~8 min (warmup + steady) ≈ 4 h GPU.

B5 Failure attribution near SRR boundary

For each policy: pick λ ∈ {0.9, 1.0, 1.1} × SRR, run with full instrumentation, then run the analyzer's failure-label step.

Outputs: slow_request_attribution.jsonl, failure_breakdown.json, case_studies.md, worker_failure_windows.json, 5 figures from §7.

Wall-clock: 3 lambdas × 5 policies × 8 min ≈ 2 h GPU.

Phase C: Audit package refresh (CPU)

Re-run summarize_runs.py and plot_current_results.py after each GPU batch. Final pass after B5: refresh claim_matrix, risk_register, allowed_runs, regenerate all figures, update reproduction_commands.sh.

Effort: ~1 h CPU.

Sequencing & rough timeline

Phase A (CPU, before dash0):
  A1 + A2     (parallel)               ~half day CPU
  A3 patch    (scheduler.py)           ~half day CPU
  A4 SRR loadgen                       ~half day CPU
  A5 analyzer extensions               ~1 day   CPU

Window 1 on dash0 (B2 + B3 only, ~5 h GPU):
  smoke validation of A1A4            ~30 min GPU
  B1' KV footprint + reuse decomp      ~30 min GPU
  B2 interference microbench           ~3 h    GPU
  B3 routing sweep (5 policies)        ~1.5 h  GPU
  Phase C partial refresh              ~30 min CPU
  ── HARD STOP, hand results back ──

Window 2 on dash0 (B4 + B5, ~6 h GPU, only after review):
  B4 SRR sweep (5 policies × bisect)   ~4 h    GPU
  B5 failure attribution               ~2 h    GPU
  Phase C final refresh                ~1 h    CPU

Decisions (locked 2026-05-25)

  1. Target model: Qwen3-Coder-30B-A3B. Compute kv_bytes_per_token from this model's config at manifest time.
  2. GPU topology: TP1 × 8 vLLM instances (DP8). All proxies and sweeps assume 8 worker endpoints.
  3. Trace for B3/B4: traces/w600_r0.0015_st30.jsonl (~850 requests). No resampling.
  4. E2E SLO: per-class. Split requests by requested_output_tokens decile, set separate E2E thresholds per class. No normalized-E2E headline.
  5. vLLM scheduler patch: accepted. Step-level JSONL log goes through a patch under patches/. Polling falls back to per-engine /metrics for sanity only.
  6. GPU phasing: hard stop after B2 and B3. Hand results back for review before committing to B4 SRR sweep or B5 attribution.

What stays with the interns

  • Re-running summarize_runs.py after each GPU batch (mechanical).
  • Reviewing the auto-generated current_results.md for typos.
  • Maintaining main_claim_allowed_runs.md if new traces are added.
  • Anything reading the audit package — not extending it.

Out of scope for this plan

  • New routing policy design (Unified-v2 / PUSH variants).
  • Production-grade KV transfer engineering.
  • Any change to the production paper figures in analysis/pd_sep_paper_section/.
  • vLLM upstream contributions.

These are downstream of characterization; once B2/B3/B5 attribution is in, we decide separately.