Files

4.7 KiB

Current Characterization Results

Generated: 2026-05-25T06:52:18.096448+00:00 Git commit: 21ffb3d4f77956d008b1815a3c0d46e0188ac390

Canonical Full-Trace CPU Summary

Source: dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl. This is CPU-only parsing of the compact formatted trace with session IDs reconstructed from parent_chat_id chains.

Metric Value
Requests 2,114,220
Sessions 1,307,276
Trace span 7,199.975 s
Input tokens p50/p90/p99 20,030 / 87,855 / 125,527
Output tokens p50/p90/p99 80 / 811 / 6,615
Input/output ratio p50/p90/p99 217.8 / 1,204.4 / 4,251.6
Turns/session p50/p90/p99/max 1 / 1 / 18 / 3,091
Session input tokens p50/p90/p99/max 12,486 / 72,676 / 974,934 / 156,756,974
Top 1% / 5% / 10% sessions by input-token mass 46.5% / 66.5% / 74.6%

Immediate reading: the full trace strongly supports long-input/short-output and heavy-tailed session token mass. It does not by itself prove online sequentiality or actual cache-hit reuse; those require runtime timestamps and cache-hit fields.

Existing Run Summaries

Run OK/Req TTFT p50/p90 E2E p50/p90 TPOT p90 GPU mean util GPU imbalance
outputs/gpu_ab_combined 198/200 1.01/9.36 5.05/30.2 0.0732 30.5 3.24
outputs/gpu_ab_pdsep 187/200 1.99/13.5 7.11/34.8 0.0742 12.4 11.1
outputs/contention_16s_ts10 498/500 0.826/9.71 5.8/51 0.103 23 2.31
outputs/contention_16s_elastic 498/500 0.929/11 6.47/48.4 0.117 26.3 2.6
outputs/combined_1000req 998/1000 0.393/2.57 3.22/28 0.113 n/a n/a
outputs/exp3_pd_sep_tp1_mooncake 796/1000 3.47/29 9.75/63.9 0.0739 n/a n/a

Pairwise Comparisons

Comparison TTFT p50 Δ TTFT p90 Δ E2E p50 Δ E2E p90 Δ TPOT p90 Δ Wall-clock Δ
combined_vs_pdsep_200 +98.1% +44.8% +40.9% +15.2% +1.3% +142.3%
contention_baseline_vs_elastic_500 +12.4% +13.4% +11.5% -5.1% +13.6% -0.6%
combined_1000_vs_pdsep_mooncake +782.0% +1030.7% +202.9% +128.3% -34.8% +119.2%

What We Can Say Now

  • partially_supported: Batch 0 substrate audit is only partially complete for existing runs. Supporting data: metrics.jsonl lacks actual dispatch/finish timestamps in current artifacts. Next: Add request dispatch and finish/error timestamps to future replayer/proxy metrics.
  • supported_for_trace_shape: Batch 1 workload shape can be characterized from formatted traces and metrics. Supporting data: full compact trace CPU summary in full_trace_summary.json: input p50/p90/p99 = 20k/87.9k/125.5k, output p50/p90/p99 = 80/811/6.6k, top 1% sessions hold 46.5% of input-token mass. Next: add cache-hit joined records for actual reuse decomposition.
  • supported_by_existing_artifact: Static PD separation is worse than combined in existing 200-request GPU A/B. Supporting data: outputs/gpu_ab_combined vs outputs/gpu_ab_pdsep metrics.summary.json. Next: Refresh with PD matrix, multiple seeds, cudagraph-enabled methodology.
  • supported_by_existing_artifact: Elastic transfer-based migration does not improve high-contention 500-request run. Supporting data: outputs/contention_16s_ts10 vs outputs/contention_16s_elastic metrics.summary.json and gpu_util.csv. Next: Attribute whether failure is trigger quality, transfer overhead, or wrong load regime.
  • not_yet_supported: PD-colo prefill/decode interference is not yet directly proven by step-level data in this package. Supporting data: No decode-step and prefill-overlap timestamp artifact found in summarized runs. Next: Run Batch 2 controlled same-worker/different-worker injection with step timestamps.
  • partially_supported: Session hot-spot residual imbalance is suggested but not fully attributed. Supporting data: gpu_util.csv shows per-GPU mean-util imbalance in existing runs. Next: Collect per-worker queue delay, session-to-worker map, and per-session token mass per worker.
  • not_yet_supported: SRR is not measured by existing fixed-request runs. Supporting data: No arrival-rate sweep artifacts found. Next: Implement Batch 4 Poisson session-arrival SRR sweep.

Main Reviewer Risks

  • high: Session sequentiality not proven - Add dispatch/finish timestamps and run Batch 0 before SRR claims.
  • medium: Legacy PD-sep data may not match final methodology - Use fresh PD matrix for paper-grade claims.
  • medium: GPU util is not a sufficient hot-spot proof - Add route-decision and per-worker queue logs for Batch 3.
  • medium: Cache reuse decomposition is incomplete without joined hash/cache-hit data - Emit hash_ids/session_id/cached_tokens in the same per-request record.