Files
agentic-kvc/analysis/characterization
Gahow Wang 876d09db83 Add chatbot T_external CDF; overlay on f3a vs agentic
User-requested comparison of inter-turn external gap distribution between
the production agentic trace (Qwen3-Coder) and a production chatbot trace
(qwen3-max chat). Both computed as
  T_external = next_turn.start_ms - prev_turn.end_ms
on the same kind of pipeline (raw input + raw output join on request_id,
session structure from the formatted trace's parent_chat_id chains).

The chatbot trace lives as two files on dash0:
  input  : bailian-trace/qwen-trace-260321-260327/qwen3-max-input-032309-032311.jsonl
  output : bailian-trace/qwen-trace-260321-260327/qwen3-max-output-032109-032711.jsonl
The raw input has no session_id (uuid is per-record, user_id has only 4
distinct tenant values for 346 k requests). We recover session structure
from the formatted file (qwen_chat_blksz_64_032309-032311.jsonl, which
groups requests by parent_chat_id), matching each formatted record to a
raw record by (timestamp, output_length) — prompt_token_num is anonymized
to 0 in this trace, so we use generate_token_num as the join key.
End time is derived from time_to_finish_token (ms duration) not the "time"
string field (which is the log-write time, not request completion).

Numbers (chatbot, 42 228 inter-turn gaps over 32 262 multi-turn sessions):
  p25  4.85 s   p50  7.18 s   p75  8.22 s   p90 15.0 s   p99  43 s
  4%  gaps < 1 s   29% < 5 s   78% < 10 s   98% < 30 s

Compare to agentic (same metric, scripts/compute_inter_turn_gap_remote.py):
  p25  0.69 s   p50  1.6  s   p75  8.6  s   p90  44  s   p99 738 s
  39% gaps < 1 s   67% < 5 s   77% < 10 s   87% < 30 s

Distributions differ in shape, not just location:
- Chatbot is tight, unimodal around 5–10 s (human interaction).
- Agentic is bimodal: a sub-second autonomous tool-call mode (39 % < 1 s)
  plus a long-pause tail (13 % > 30 s, p99 = 738 s) for sessions where
  the operator steps away.
- The sub-second tool-call mass is where dispatch coupling lives —
  those turns have W_turn ≫ T_external for any current scheduler.

The earlier "chatbot has T_human ≈ 30 s" hand-wave was wrong empirically.
The right framing for §2.3 is "agentic has a sub-second tool-call mode
that chatbot doesn't", not "chatbot has think-time and agentic doesn't".

Adds:
- scripts/compute_inter_turn_gap_chatbot.py: dash0-side aggregator
  (raw input/output join + formatted alignment by ts + output_length)
- analysis/characterization/data/chatbot_inter_turn_gap.json: CDF cache
- scripts/plot_inter_turn_gap.py: overlays both curves on log-x

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 14:49:44 +08:00
..

Characterization Analyzer Runbook

CPU-only scaffold for Batch 0 and Batch 1 in analysis/characterization_todo_for_interns.md.

This directory has three components:

  • analyze.py: Batch 0/1 analyzer for trace and per-request metrics.
  • summarize_runs.py: CPU-only audit of already completed benchmark directories.
  • protocols.md: exact protocol for Batch 2-6 experiments that require fresh GPU runs or additional instrumentation.

The analyzer reads existing trace and metrics artifacts and writes:

outputs/characterization/<date>/<task_name>/
├── manifest.json
├── raw/
├── summary.json
├── summary.md
├── audit.md
├── session_concurrency.json
├── session_arrival_stats.json
├── turn_interval_stats.json
├── trace_profile.json
├── invalid_runs.md
├── workload_summary.json
├── kv_footprint_summary.json
├── reuse_decomposition.json
├── session_skew.json
├── append_delta_stats.json
└── figures/

If matplotlib is installed, simple PNG/PDF figures are emitted under figures/. If it is not installed, all JSON/Markdown data artifacts are still written.

Canonical Data Sources

Canonical full traces live on dash0:

  • formatted trace: ~/ali-trace/trace-glm5.1-formatted/
  • raw unformatted trace: ~/ali-trace/trace-glm5.1/

For the current GLM-5.1 characterization, prefer the compact formatted file:

~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl

Do not pass 051315-051317-raw.jsonl or the files under ~/ali-trace/trace-glm5.1/ directly to this analyzer unless you first convert them to the formatted schema. Those raw files are tens to hundreds of GiB and contain full prompt payloads rather than the compact characterization schema.

The analyzer is CPU-only. For full trace characterization, either:

  • run it on dash0 against the formatted JSONL files without starting any GPU service; or
  • copy/rsync the needed trace files from dash0 to this repository or another local path, then run the analyzer locally.

Only light directory/field inspection is needed on dash0 before choosing which trace file to analyze.

The raw unformatted directory is listed as a source option for provenance, but this analyzer expects formatted JSONL records. Raw files should be converted to the formatted schema before being passed to --trace.

Inputs

Trace JSONL:

  • Expected formatted fields: chat_id, parent_chat_id, timestamp, input_length, output_length, type, turn, hash_ids, optional session_id.
  • If session_id is absent, sessions are reconstructed from parent_chat_id chains.
  • timestamp is treated as scheduled trace time, not proof of actual dispatch time.

Metrics JSONL:

  • Expected replayer fields: request_id, session_id, turn_id, trace_timestamp_s, input_length, output_length, cached_tokens, latency_s, ttft_s, tpot_s, actual_output_tokens, error.
  • If the metrics file is from the current replayer, it does not include actual dispatch/finish wall-clock timestamps. Batch 0 will therefore mark actual session sequentiality as unavailable and separately report a scheduled estimate from trace_timestamp_s + latency_s.

Proxy breakdown:

  • Optional JSON/JSONL with fields such as request_id, t_proxy_recv, t_first_token, t_done, cache_hit, estimated_new_tokens, route_class, routed_to, policy.
  • Batch 0 can prove actual per-session in-flight concurrency only when these timing rows can be joined to analyzed requests by request_id.
  • Existing proxy breakdown artifacts may not contain session_id; without a request-id join to trace/metrics, they can still support append/cache-hit statistics but not per-session concurrency.

Run config:

  • Optional JSON, usually outputs/<run>/config.json.
  • Used for manifest fields such as policy, time_scale, and request count when available.

Commands

Trace-only dry run:

python3 analysis/characterization/analyze.py \
  --trace traces/w600_r0.0015_st30.jsonl \
  --task-name w600_trace_only \
  --overwrite

Trace plus replayer metrics:

python3 analysis/characterization/analyze.py \
  --trace traces/w600_r0.0015_st30.jsonl \
  --metrics outputs/smoke_test/metrics.jsonl \
  --task-name smoke_trace_metrics \
  --overwrite

Proxy breakdown append/cache analysis:

python3 analysis/characterization/analyze.py \
  --breakdown outputs/contention_16s_elastic/breakdown.json \
  --config outputs/contention_16s_elastic/config.json \
  --task-name contention_breakdown \
  --overwrite

Full trace on dash0, CPU-only:

python3 analysis/characterization/analyze.py \
  --trace ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
  --task-name full_trace_characterization \
  --overwrite

Local run after copying from dash0:

rsync -av dash0:~/ali-trace/trace-glm5.1-formatted/<trace-file>.jsonl traces/
python3 analysis/characterization/analyze.py \
  --trace traces/<trace-file>.jsonl \
  --task-name full_trace_characterization \
  --overwrite

By default the analyzer records file size and mtime but skips full SHA256 hashing, because canonical raw trace files can be hundreds of GiB. Add --hash-inputs only when you intentionally want a full file hash.

KV footprint requires a model-specific value:

python3 analysis/characterization/analyze.py \
  --trace traces/w600_r0.0015_st30.jsonl \
  --kv-bytes-per-token 98304 \
  --task-name w600_with_kv_estimate \
  --overwrite

Summarize existing completed runs:

python3 analysis/characterization/summarize_runs.py

This writes:

analysis/characterization/current_results/
├── run_summaries.json
├── comparisons.json
├── claim_matrix.json
├── reviewer_risk_register.json
├── current_results.md
├── characterization_claim_matrix.md
├── all_figures_index.md
├── reviewer_risk_register.md
└── reproduction_commands.sh

Batch 0 Semantics

The online-serving invariant is:

Each session has at most one in-flight turn.

The analyzer reports:

  • actual interval status from dispatch and finish/error timestamps;
  • scheduled estimate from trace timestamps plus latency when available;
  • per-session max in-flight;
  • session start-time distribution;
  • turn inter-arrival distribution;
  • attempted/completed/error counts and goodput when metrics exist;
  • run classification.

Important limitation: trace timestamps alone cannot prove actual replay sequentiality. A run is only classified as online_realistic when actual per-request dispatch and finish/error timestamps prove max_inflight_per_session <= 1.

Batch 1 Semantics

The analyzer reports:

  • input/output CDF stats;
  • input/output ratio;
  • KV footprint CDF stats when --kv-bytes-per-token is supplied;
  • session skew and top-session contribution;
  • append/uncached token stats when cached_tokens or cache_hit exists;
  • reuse decomposition when both cached-token fields and hash_ids exist.

Reuse decomposition is conservative:

  • intra_session: cached hash block was seen earlier in the same session;
  • cross_session: cached hash block was seen earlier in another session;
  • shared/system-prefix: early-position block appears in many sessions;
  • unclassified: cached tokens could not be mapped to a previously seen hash block.

If cached-token/cache-hit fields are absent, reuse and append artifacts are written with status: "unavailable" and list the required fields.

Limitations

  • The script does not run a benchmark, query a live service, touch GPU state, or start any daemon.
  • Request-id joins are exact. If trace, metrics, and proxy artifacts use different request IDs, the unmatched rows are preserved under raw/.
  • Actual Batch 0 sequentiality needs actual dispatch and finish/error timestamps. Current replayer/metrics.py metrics are not enough by themselves.
  • kv_bytes_per_token depends on model architecture, layer count, KV heads, head dimension, and dtype. The analyzer will not guess it.
  • Shared/system-prefix reuse classification is a heuristic based on trace hash_ids positions and cross-session frequency. Adjust --shared-prefix-min-sessions and --system-prefix-blocks if the formatted trace provides a stronger system-prefix marker.