Current Characterization Results

Generated: 2026-05-25T06:52:18.096448+00:00 Git commit: 21ffb3d4f77956d008b1815a3c0d46e0188ac390

Canonical Full-Trace CPU Summary

Source: dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl. This is CPU-only parsing of the compact formatted trace with session IDs reconstructed from parent_chat_id chains.

Metric	Value
Requests	2,114,220
Sessions	1,307,276
Trace span	7,199.975 s
Input tokens p50/p90/p99	20,030 / 87,855 / 125,527
Output tokens p50/p90/p99	80 / 811 / 6,615
Input/output ratio p50/p90/p99	217.8 / 1,204.4 / 4,251.6
Turns/session p50/p90/p99/max	1 / 1 / 18 / 3,091
Session input tokens p50/p90/p99/max	12,486 / 72,676 / 974,934 / 156,756,974
Top 1% / 5% / 10% sessions by input-token mass	46.5% / 66.5% / 74.6%

Immediate reading: the full trace strongly supports long-input/short-output and heavy-tailed session token mass. It does not by itself prove online sequentiality or actual cache-hit reuse; those require runtime timestamps and cache-hit fields.

Existing Run Summaries

Run	OK/Req	TTFT p50/p90	E2E p50/p90	TPOT p90	GPU mean util	GPU imbalance
outputs/gpu_ab_combined	198/200	1.01/9.36	5.05/30.2	0.0732	30.5	3.24
outputs/gpu_ab_pdsep	187/200	1.99/13.5	7.11/34.8	0.0742	12.4	11.1
outputs/contention_16s_ts10	498/500	0.826/9.71	5.8/51	0.103	23	2.31
outputs/contention_16s_elastic	498/500	0.929/11	6.47/48.4	0.117	26.3	2.6
outputs/combined_1000req	998/1000	0.393/2.57	3.22/28	0.113	n/a	n/a
outputs/exp3_pd_sep_tp1_mooncake	796/1000	3.47/29	9.75/63.9	0.0739	n/a	n/a

Pairwise Comparisons

Comparison	TTFT p50 Δ	TTFT p90 Δ	E2E p50 Δ	E2E p90 Δ	TPOT p90 Δ	Wall-clock Δ
combined_vs_pdsep_200	+98.1%	+44.8%	+40.9%	+15.2%	+1.3%	+142.3%
contention_baseline_vs_elastic_500	+12.4%	+13.4%	+11.5%	-5.1%	+13.6%	-0.6%
combined_1000_vs_pdsep_mooncake	+782.0%	+1030.7%	+202.9%	+128.3%	-34.8%	+119.2%

What We Can Say Now

partially_supported: Batch 0 substrate audit is only partially complete for existing runs. Supporting data: metrics.jsonl lacks actual dispatch/finish timestamps in current artifacts. Next: Add request dispatch and finish/error timestamps to future replayer/proxy metrics.
supported_for_trace_shape: Batch 1 workload shape can be characterized from formatted traces and metrics. Supporting data: full compact trace CPU summary in full_trace_summary.json: input p50/p90/p99 = 20k/87.9k/125.5k, output p50/p90/p99 = 80/811/6.6k, top 1% sessions hold 46.5% of input-token mass. Next: add cache-hit joined records for actual reuse decomposition.
supported_by_existing_artifact: Static PD separation is worse than combined in existing 200-request GPU A/B. Supporting data: outputs/gpu_ab_combined vs outputs/gpu_ab_pdsep metrics.summary.json. Next: Refresh with PD matrix, multiple seeds, cudagraph-enabled methodology.
supported_by_existing_artifact: Elastic transfer-based migration does not improve high-contention 500-request run. Supporting data: outputs/contention_16s_ts10 vs outputs/contention_16s_elastic metrics.summary.json and gpu_util.csv. Next: Attribute whether failure is trigger quality, transfer overhead, or wrong load regime.
not_yet_supported: PD-colo prefill/decode interference is not yet directly proven by step-level data in this package. Supporting data: No decode-step and prefill-overlap timestamp artifact found in summarized runs. Next: Run Batch 2 controlled same-worker/different-worker injection with step timestamps.
partially_supported: Session hot-spot residual imbalance is suggested but not fully attributed. Supporting data: gpu_util.csv shows per-GPU mean-util imbalance in existing runs. Next: Collect per-worker queue delay, session-to-worker map, and per-session token mass per worker.
not_yet_supported: SRR is not measured by existing fixed-request runs. Supporting data: No arrival-rate sweep artifacts found. Next: Implement Batch 4 Poisson session-arrival SRR sweep.

Main Reviewer Risks

high: Session sequentiality not proven - Add dispatch/finish timestamps and run Batch 0 before SRR claims.
medium: Legacy PD-sep data may not match final methodology - Use fresh PD matrix for paper-grade claims.
medium: GPU util is not a sufficient hot-spot proof - Add route-decision and per-worker queue logs for Batch 3.
medium: Cache reuse decomposition is incomplete without joined hash/cache-hit data - Emit hash_ids/session_id/cached_tokens in the same per-request record.

4.7 KiB Raw Permalink Blame History