4.7 KiB
4.7 KiB
Current Characterization Results
Generated: 2026-05-25T06:52:18.096448+00:00
Git commit: 21ffb3d4f77956d008b1815a3c0d46e0188ac390
Canonical Full-Trace CPU Summary
Source: dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl.
This is CPU-only parsing of the compact formatted trace with session IDs
reconstructed from parent_chat_id chains.
| Metric | Value |
|---|---|
| Requests | 2,114,220 |
| Sessions | 1,307,276 |
| Trace span | 7,199.975 s |
| Input tokens p50/p90/p99 | 20,030 / 87,855 / 125,527 |
| Output tokens p50/p90/p99 | 80 / 811 / 6,615 |
| Input/output ratio p50/p90/p99 | 217.8 / 1,204.4 / 4,251.6 |
| Turns/session p50/p90/p99/max | 1 / 1 / 18 / 3,091 |
| Session input tokens p50/p90/p99/max | 12,486 / 72,676 / 974,934 / 156,756,974 |
| Top 1% / 5% / 10% sessions by input-token mass | 46.5% / 66.5% / 74.6% |
Immediate reading: the full trace strongly supports long-input/short-output and heavy-tailed session token mass. It does not by itself prove online sequentiality or actual cache-hit reuse; those require runtime timestamps and cache-hit fields.
Existing Run Summaries
| Run | OK/Req | TTFT p50/p90 | E2E p50/p90 | TPOT p90 | GPU mean util | GPU imbalance |
|---|---|---|---|---|---|---|
| outputs/gpu_ab_combined | 198/200 | 1.01/9.36 | 5.05/30.2 | 0.0732 | 30.5 | 3.24 |
| outputs/gpu_ab_pdsep | 187/200 | 1.99/13.5 | 7.11/34.8 | 0.0742 | 12.4 | 11.1 |
| outputs/contention_16s_ts10 | 498/500 | 0.826/9.71 | 5.8/51 | 0.103 | 23 | 2.31 |
| outputs/contention_16s_elastic | 498/500 | 0.929/11 | 6.47/48.4 | 0.117 | 26.3 | 2.6 |
| outputs/combined_1000req | 998/1000 | 0.393/2.57 | 3.22/28 | 0.113 | n/a | n/a |
| outputs/exp3_pd_sep_tp1_mooncake | 796/1000 | 3.47/29 | 9.75/63.9 | 0.0739 | n/a | n/a |
Pairwise Comparisons
| Comparison | TTFT p50 Δ | TTFT p90 Δ | E2E p50 Δ | E2E p90 Δ | TPOT p90 Δ | Wall-clock Δ |
|---|---|---|---|---|---|---|
| combined_vs_pdsep_200 | +98.1% | +44.8% | +40.9% | +15.2% | +1.3% | +142.3% |
| contention_baseline_vs_elastic_500 | +12.4% | +13.4% | +11.5% | -5.1% | +13.6% | -0.6% |
| combined_1000_vs_pdsep_mooncake | +782.0% | +1030.7% | +202.9% | +128.3% | -34.8% | +119.2% |
What We Can Say Now
- partially_supported: Batch 0 substrate audit is only partially complete for existing runs. Supporting data: metrics.jsonl lacks actual dispatch/finish timestamps in current artifacts. Next: Add request dispatch and finish/error timestamps to future replayer/proxy metrics.
- supported_for_trace_shape: Batch 1 workload shape can be characterized from formatted traces and metrics.
Supporting data: full compact trace CPU summary in
full_trace_summary.json: input p50/p90/p99 = 20k/87.9k/125.5k, output p50/p90/p99 = 80/811/6.6k, top 1% sessions hold 46.5% of input-token mass. Next: add cache-hit joined records for actual reuse decomposition. - supported_by_existing_artifact: Static PD separation is worse than combined in existing 200-request GPU A/B. Supporting data: outputs/gpu_ab_combined vs outputs/gpu_ab_pdsep metrics.summary.json. Next: Refresh with PD matrix, multiple seeds, cudagraph-enabled methodology.
- supported_by_existing_artifact: Elastic transfer-based migration does not improve high-contention 500-request run. Supporting data: outputs/contention_16s_ts10 vs outputs/contention_16s_elastic metrics.summary.json and gpu_util.csv. Next: Attribute whether failure is trigger quality, transfer overhead, or wrong load regime.
- not_yet_supported: PD-colo prefill/decode interference is not yet directly proven by step-level data in this package. Supporting data: No decode-step and prefill-overlap timestamp artifact found in summarized runs. Next: Run Batch 2 controlled same-worker/different-worker injection with step timestamps.
- partially_supported: Session hot-spot residual imbalance is suggested but not fully attributed. Supporting data: gpu_util.csv shows per-GPU mean-util imbalance in existing runs. Next: Collect per-worker queue delay, session-to-worker map, and per-session token mass per worker.
- not_yet_supported: SRR is not measured by existing fixed-request runs. Supporting data: No arrival-rate sweep artifacts found. Next: Implement Batch 4 Poisson session-arrival SRR sweep.
Main Reviewer Risks
- high: Session sequentiality not proven - Add dispatch/finish timestamps and run Batch 0 before SRR claims.
- medium: Legacy PD-sep data may not match final methodology - Use fresh PD matrix for paper-grade claims.
- medium: GPU util is not a sufficient hot-spot proof - Add route-decision and per-worker queue logs for Batch 3.
- medium: Cache reuse decomposition is incomplete without joined hash/cache-hit data - Emit hash_ids/session_id/cached_tokens in the same per-request record.