Files
agentic-kvc/analysis/characterization
Gahow Wang 4722883903 Audit package refresh: Window 1 supported claims + risk register
Refresh the standing audit package now that B1' / B2 / B3 are complete.

current_results/characterization_claim_matrix.md
  Flips seven entries from "not_yet_supported" / "partially_supported"
  to "supported" with pointers into window_1_results/. New entries
  cover per-session sequentiality, KV per request, real reuse
  decomposition, theoretical APC ceiling, the LMetric locality gap,
  Unified breaking the locality-vs-latency tradeoff, B2 causal
  interference proof, sticky's interference inflation, and the
  partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay
  "not_yet_supported" (Window 2 work).

current_results/main_claim_allowed_runs.md
  New "Allowed For Routing-Policy Comparison" section pins the five
  B3 policy directories. New "Allowed For PD-colo Interference"
  section pins the B2 sweep. Legacy section retained for the
  pre-instrumentation 200/500/1000-req runs.

current_results/reviewer_risk_register.md
  Marks the two old "high"-severity risks (sequentiality / reuse
  decomposition) as resolved; adds new entries for the APC
  contamination empirics, the b3_analyze.sh truncate-write bug that
  cost unified's interference index, the GPU-0 EngineCore ghost
  cleanup, the saturated-replay caveat for trace-timestamp dispatch,
  and the synthetic B2 decode workload.

current_results/all_figures_index.md
  Adds the 8 new Window 1 figures alongside the existing 6 from the
  legacy summarize_runs run.

current_results/reproduction_commands.sh
  Records the full B3 + B2 + figure pipeline.

analysis/characterization_todo_for_interns.md
  Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE;
  only B4 and B5 remain (Window 2).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 23:25:27 +08:00
..

Characterization Analyzer Runbook

CPU-only scaffold for Batch 0 and Batch 1 in analysis/characterization_todo_for_interns.md.

This directory has three components:

  • analyze.py: Batch 0/1 analyzer for trace and per-request metrics.
  • summarize_runs.py: CPU-only audit of already completed benchmark directories.
  • protocols.md: exact protocol for Batch 2-6 experiments that require fresh GPU runs or additional instrumentation.

The analyzer reads existing trace and metrics artifacts and writes:

outputs/characterization/<date>/<task_name>/
├── manifest.json
├── raw/
├── summary.json
├── summary.md
├── audit.md
├── session_concurrency.json
├── session_arrival_stats.json
├── turn_interval_stats.json
├── trace_profile.json
├── invalid_runs.md
├── workload_summary.json
├── kv_footprint_summary.json
├── reuse_decomposition.json
├── session_skew.json
├── append_delta_stats.json
└── figures/

If matplotlib is installed, simple PNG/PDF figures are emitted under figures/. If it is not installed, all JSON/Markdown data artifacts are still written.

Canonical Data Sources

Canonical full traces live on dash0:

  • formatted trace: ~/ali-trace/trace-glm5.1-formatted/
  • raw unformatted trace: ~/ali-trace/trace-glm5.1/

For the current GLM-5.1 characterization, prefer the compact formatted file:

~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl

Do not pass 051315-051317-raw.jsonl or the files under ~/ali-trace/trace-glm5.1/ directly to this analyzer unless you first convert them to the formatted schema. Those raw files are tens to hundreds of GiB and contain full prompt payloads rather than the compact characterization schema.

The analyzer is CPU-only. For full trace characterization, either:

  • run it on dash0 against the formatted JSONL files without starting any GPU service; or
  • copy/rsync the needed trace files from dash0 to this repository or another local path, then run the analyzer locally.

Only light directory/field inspection is needed on dash0 before choosing which trace file to analyze.

The raw unformatted directory is listed as a source option for provenance, but this analyzer expects formatted JSONL records. Raw files should be converted to the formatted schema before being passed to --trace.

Inputs

Trace JSONL:

  • Expected formatted fields: chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids, optional session_id.
  • If session_id is absent, sessions are reconstructed from parent_chat_id chains.
  • timestamp is treated as scheduled trace time, not proof of actual dispatch time.

Metrics JSONL:

  • Expected replayer fields: request_id, session_id, turn_id, trace_timestamp_s, input_length, output_length, cached_tokens, latency_s, ttft_s, tpot_s, actual_output_tokens, error.
  • If the metrics file is from the current replayer, it does not include actual dispatch/finish wall-clock timestamps. Batch 0 will therefore mark actual session sequentiality as unavailable and separately report a scheduled estimate from trace_timestamp_s + latency_s.

Proxy breakdown:

  • Optional JSON/JSONL with fields such as request_id, t_proxy_recv, t_first_token, t_done, cache_hit, estimated_new_tokens, route_class, routed_to, policy.
  • Batch 0 can prove actual per-session in-flight concurrency only when these timing rows can be joined to analyzed requests by request_id.
  • Existing proxy breakdown artifacts may not contain session_id; without a request-id join to trace/metrics, they can still support append/cache-hit statistics but not per-session concurrency.

Run config:

  • Optional JSON, usually outputs/<run>/config.json.
  • Used for manifest fields such as policy, time_scale, and request count when available.

Commands

Trace-only dry run:

python3 analysis/characterization/analyze.py \
  --trace traces/w600_r0.0015_st30.jsonl \
  --task-name w600_trace_only \
  --overwrite

Trace plus replayer metrics:

python3 analysis/characterization/analyze.py \
  --trace traces/w600_r0.0015_st30.jsonl \
  --metrics outputs/smoke_test/metrics.jsonl \
  --task-name smoke_trace_metrics \
  --overwrite

Proxy breakdown append/cache analysis:

python3 analysis/characterization/analyze.py \
  --breakdown outputs/contention_16s_elastic/breakdown.json \
  --config outputs/contention_16s_elastic/config.json \
  --task-name contention_breakdown \
  --overwrite

Full trace on dash0, CPU-only:

python3 analysis/characterization/analyze.py \
  --trace ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
  --task-name full_trace_characterization \
  --overwrite

Local run after copying from dash0:

rsync -av dash0:~/ali-trace/trace-glm5.1-formatted/<trace-file>.jsonl traces/
python3 analysis/characterization/analyze.py \
  --trace traces/<trace-file>.jsonl \
  --task-name full_trace_characterization \
  --overwrite

By default the analyzer records file size and mtime but skips full SHA256 hashing, because canonical raw trace files can be hundreds of GiB. Add --hash-inputs only when you intentionally want a full file hash.

KV footprint requires a model-specific value:

python3 analysis/characterization/analyze.py \
  --trace traces/w600_r0.0015_st30.jsonl \
  --kv-bytes-per-token 98304 \
  --task-name w600_with_kv_estimate \
  --overwrite

Summarize existing completed runs:

python3 analysis/characterization/summarize_runs.py

This writes:

analysis/characterization/current_results/
├── run_summaries.json
├── comparisons.json
├── claim_matrix.json
├── reviewer_risk_register.json
├── current_results.md
├── characterization_claim_matrix.md
├── all_figures_index.md
├── reviewer_risk_register.md
└── reproduction_commands.sh

Batch 0 Semantics

The online-serving invariant is:

Each session has at most one in-flight turn.

The analyzer reports:

  • actual interval status from dispatch and finish/error timestamps;
  • scheduled estimate from trace timestamps plus latency when available;
  • per-session max in-flight;
  • session start-time distribution;
  • turn inter-arrival distribution;
  • attempted/completed/error counts and goodput when metrics exist;
  • run classification.

Important limitation: trace timestamps alone cannot prove actual replay sequentiality. A run is only classified as online_realistic when actual per-request dispatch and finish/error timestamps prove max_inflight_per_session <= 1.

Batch 1 Semantics

The analyzer reports:

  • input/output CDF stats;
  • input/output ratio;
  • KV footprint CDF stats when --kv-bytes-per-token is supplied;
  • session skew and top-session contribution;
  • append/uncached token stats when cached_tokens or cache_hit exists;
  • reuse decomposition when both cached-token fields and hash_ids exist.

Reuse decomposition is conservative:

  • intra_session: cached hash block was seen earlier in the same session;
  • cross_session: cached hash block was seen earlier in another session;
  • shared/system-prefix: early-position block appears in many sessions;
  • unclassified: cached tokens could not be mapped to a previously seen hash block.

If cached-token/cache-hit fields are absent, reuse and append artifacts are written with status: "unavailable" and list the required fields.

Limitations

  • The script does not run a benchmark, query a live service, touch GPU state, or start any daemon.
  • Request-id joins are exact. If trace, metrics, and proxy artifacts use different request IDs, the unmatched rows are preserved under raw/.
  • Actual Batch 0 sequentiality needs actual dispatch and finish/error timestamps. Current replayer/metrics.py metrics are not enough by themselves.
  • kv_bytes_per_token depends on model architecture, layer count, KV heads, head dimension, and dtype. The analyzer will not guess it.
  • Shared/system-prefix reuse classification is a heuristic based on trace hash_ids positions and cross-session frequency. Adjust --shared-prefix-min-sessions and --system-prefix-blocks if the formatted trace provides a stronger system-prefix marker.