Refresh the standing audit package now that B1' / B2 / B3 are complete. current_results/characterization_claim_matrix.md Flips seven entries from "not_yet_supported" / "partially_supported" to "supported" with pointers into window_1_results/. New entries cover per-session sequentiality, KV per request, real reuse decomposition, theoretical APC ceiling, the LMetric locality gap, Unified breaking the locality-vs-latency tradeoff, B2 causal interference proof, sticky's interference inflation, and the partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay "not_yet_supported" (Window 2 work). current_results/main_claim_allowed_runs.md New "Allowed For Routing-Policy Comparison" section pins the five B3 policy directories. New "Allowed For PD-colo Interference" section pins the B2 sweep. Legacy section retained for the pre-instrumentation 200/500/1000-req runs. current_results/reviewer_risk_register.md Marks the two old "high"-severity risks (sequentiality / reuse decomposition) as resolved; adds new entries for the APC contamination empirics, the b3_analyze.sh truncate-write bug that cost unified's interference index, the GPU-0 EngineCore ghost cleanup, the saturated-replay caveat for trace-timestamp dispatch, and the synthetic B2 decode workload. current_results/all_figures_index.md Adds the 8 new Window 1 figures alongside the existing 6 from the legacy summarize_runs run. current_results/reproduction_commands.sh Records the full B3 + B2 + figure pipeline. analysis/characterization_todo_for_interns.md Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE; only B4 and B5 remain (Window 2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Characterization Analyzer Runbook
CPU-only scaffold for Batch 0 and Batch 1 in
analysis/characterization_todo_for_interns.md.
This directory has three components:
analyze.py: Batch 0/1 analyzer for trace and per-request metrics.summarize_runs.py: CPU-only audit of already completed benchmark directories.protocols.md: exact protocol for Batch 2-6 experiments that require fresh GPU runs or additional instrumentation.
The analyzer reads existing trace and metrics artifacts and writes:
outputs/characterization/<date>/<task_name>/
├── manifest.json
├── raw/
├── summary.json
├── summary.md
├── audit.md
├── session_concurrency.json
├── session_arrival_stats.json
├── turn_interval_stats.json
├── trace_profile.json
├── invalid_runs.md
├── workload_summary.json
├── kv_footprint_summary.json
├── reuse_decomposition.json
├── session_skew.json
├── append_delta_stats.json
└── figures/
If matplotlib is installed, simple PNG/PDF figures are emitted under
figures/. If it is not installed, all JSON/Markdown data artifacts are still
written.
Canonical Data Sources
Canonical full traces live on dash0:
- formatted trace:
~/ali-trace/trace-glm5.1-formatted/ - raw unformatted trace:
~/ali-trace/trace-glm5.1/
For the current GLM-5.1 characterization, prefer the compact formatted file:
~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl
Do not pass 051315-051317-raw.jsonl or the files under
~/ali-trace/trace-glm5.1/ directly to this analyzer unless you first convert
them to the formatted schema. Those raw files are tens to hundreds of GiB and
contain full prompt payloads rather than the compact characterization schema.
The analyzer is CPU-only. For full trace characterization, either:
- run it on dash0 against the formatted JSONL files without starting any GPU service; or
- copy/rsync the needed trace files from dash0 to this repository or another local path, then run the analyzer locally.
Only light directory/field inspection is needed on dash0 before choosing which trace file to analyze.
The raw unformatted directory is listed as a source option for provenance, but
this analyzer expects formatted JSONL records. Raw files should be converted to
the formatted schema before being passed to --trace.
Inputs
Trace JSONL:
- Expected formatted fields:
chat_id,parent_chat_id,timestamp,input_length,output_length,type,turn,hash_ids, optionalsession_id. - If
session_idis absent, sessions are reconstructed fromparent_chat_idchains. timestampis treated as scheduled trace time, not proof of actual dispatch time.
Metrics JSONL:
- Expected replayer fields:
request_id,session_id,turn_id,trace_timestamp_s,input_length,output_length,cached_tokens,latency_s,ttft_s,tpot_s,actual_output_tokens,error. - If the metrics file is from the current replayer, it does not include actual
dispatch/finish wall-clock timestamps. Batch 0 will therefore mark actual
session sequentiality as unavailable and separately report a scheduled
estimate from
trace_timestamp_s + latency_s.
Proxy breakdown:
- Optional JSON/JSONL with fields such as
request_id,t_proxy_recv,t_first_token,t_done,cache_hit,estimated_new_tokens,route_class,routed_to,policy. - Batch 0 can prove actual per-session in-flight concurrency only when these
timing rows can be joined to analyzed requests by
request_id. - Existing proxy breakdown artifacts may not contain
session_id; without a request-id join to trace/metrics, they can still support append/cache-hit statistics but not per-session concurrency.
Run config:
- Optional JSON, usually
outputs/<run>/config.json. - Used for manifest fields such as
policy,time_scale, and request count when available.
Commands
Trace-only dry run:
python3 analysis/characterization/analyze.py \
--trace traces/w600_r0.0015_st30.jsonl \
--task-name w600_trace_only \
--overwrite
Trace plus replayer metrics:
python3 analysis/characterization/analyze.py \
--trace traces/w600_r0.0015_st30.jsonl \
--metrics outputs/smoke_test/metrics.jsonl \
--task-name smoke_trace_metrics \
--overwrite
Proxy breakdown append/cache analysis:
python3 analysis/characterization/analyze.py \
--breakdown outputs/contention_16s_elastic/breakdown.json \
--config outputs/contention_16s_elastic/config.json \
--task-name contention_breakdown \
--overwrite
Full trace on dash0, CPU-only:
python3 analysis/characterization/analyze.py \
--trace ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
--task-name full_trace_characterization \
--overwrite
Local run after copying from dash0:
rsync -av dash0:~/ali-trace/trace-glm5.1-formatted/<trace-file>.jsonl traces/
python3 analysis/characterization/analyze.py \
--trace traces/<trace-file>.jsonl \
--task-name full_trace_characterization \
--overwrite
By default the analyzer records file size and mtime but skips full SHA256
hashing, because canonical raw trace files can be hundreds of GiB. Add
--hash-inputs only when you intentionally want a full file hash.
KV footprint requires a model-specific value:
python3 analysis/characterization/analyze.py \
--trace traces/w600_r0.0015_st30.jsonl \
--kv-bytes-per-token 98304 \
--task-name w600_with_kv_estimate \
--overwrite
Summarize existing completed runs:
python3 analysis/characterization/summarize_runs.py
This writes:
analysis/characterization/current_results/
├── run_summaries.json
├── comparisons.json
├── claim_matrix.json
├── reviewer_risk_register.json
├── current_results.md
├── characterization_claim_matrix.md
├── all_figures_index.md
├── reviewer_risk_register.md
└── reproduction_commands.sh
Batch 0 Semantics
The online-serving invariant is:
Each session has at most one in-flight turn.
The analyzer reports:
- actual interval status from dispatch and finish/error timestamps;
- scheduled estimate from trace timestamps plus latency when available;
- per-session max in-flight;
- session start-time distribution;
- turn inter-arrival distribution;
- attempted/completed/error counts and goodput when metrics exist;
- run classification.
Important limitation: trace timestamps alone cannot prove actual replay
sequentiality. A run is only classified as online_realistic when actual
per-request dispatch and finish/error timestamps prove
max_inflight_per_session <= 1.
Batch 1 Semantics
The analyzer reports:
- input/output CDF stats;
- input/output ratio;
- KV footprint CDF stats when
--kv-bytes-per-tokenis supplied; - session skew and top-session contribution;
- append/uncached token stats when
cached_tokensorcache_hitexists; - reuse decomposition when both cached-token fields and
hash_idsexist.
Reuse decomposition is conservative:
intra_session: cached hash block was seen earlier in the same session;cross_session: cached hash block was seen earlier in another session;shared/system-prefix: early-position block appears in many sessions;unclassified: cached tokens could not be mapped to a previously seen hash block.
If cached-token/cache-hit fields are absent, reuse and append artifacts are
written with status: "unavailable" and list the required fields.
Limitations
- The script does not run a benchmark, query a live service, touch GPU state, or start any daemon.
- Request-id joins are exact. If trace, metrics, and proxy artifacts use
different request IDs, the unmatched rows are preserved under
raw/. - Actual Batch 0 sequentiality needs actual dispatch and finish/error
timestamps. Current
replayer/metrics.pymetrics are not enough by themselves. kv_bytes_per_tokendepends on model architecture, layer count, KV heads, head dimension, and dtype. The analyzer will not guess it.- Shared/system-prefix reuse classification is a heuristic based on trace
hash_idspositions and cross-session frequency. Adjust--shared-prefix-min-sessionsand--system-prefix-blocksif the formatted trace provides a stronger system-prefix marker.