Joint/temporal characterizations of the full 051315 cluster trace (2.11M
req / 1.31M sessions / 2h), beyond the existing single-variable marginals:
- C1 mixture: 90.3% sessions single-turn, but multi-turn (9.7%) = 44% reqs /
67% prefill mass; continuation hazard rises 10%->94% (Lindy); heaviness
unpredictable at turn 1 (corr 0.04-0.15) => reactive routing justified.
- C2 resident/delta: resident context 11k->56k while new-prefill 2.7k->~200;
per-turn reuse ->99.6%; resident/delta ("PD tax") ->~250-450x.
- C3 prefill/decode: token mass 98.7% input / 1.3% output, BUT decode ~70% of
TIME (robust 68-71%); "decode negligible" is wrong (tokens != time). Correct
colo argument = roofline complementarity, not "no decode".
Maps each to (1) PD-colocation and (2) routing. compute_chars.py + chars.json
+ figs/workload_chars/. Raw-file exact validation (cached_tokens, real
timings) pending.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Agentic workload characterization C1–C3 (full 051315 production trace)
Date 2026-05-29. Source: trace-glm5.1-formatted/051315-051317.jsonl on dash1
(release file, 2,114,220 requests / 1,307,276 sessions / 2h, type=100% coder).
This release file is the full cluster-level production trace — session skew
reproduces 46.5/66.5/74.6/87.5/96.0 exactly. Compute: compute_chars.py
(2-pass, ~65s, ~/ali-trace/.venv python). Numbers: chars.json.
⚠️ Cluster-level, not per-instance. This is one cluster's aggregate stream. Concurrent-session counts have NO denominator of "8 instances" — do not compare them to a single deployment's instance count.
These three are NOT in the existing 13 analyzer figures (which are single-variable marginals on the older 041x traces). C1–C3 are joint/temporal and argument-bearing.
C1 — the workload is a MIXTURE, not "multi-turn agentic" (c1_session_mixture.png)
- 90.3% of sessions are single-turn; mean 1.62 turns, p99=18, max=3091.
- But multi-turn sessions (9.7%) = 44.2% of requests and 66.9% of input (prefill) mass. Single-turn = 60.2% of output (decode) mass.
- Continuation hazard P(reach k+1 | reached k): turn1→2 only 10.2%, but turn2→3 50.6%, turn5→6 87%, turn12→13 94.3% (Lindy / Pareto).
- Predictability of heaviness at cold-start is near-zero: corr(turn1_input, session_mass)=0.15, corr(turn1_input, n_turns)=0.04.
Routing: heaviness is unpredictable at session start → proactive placement
cannot pre-empt hot-pin → a REACTIVE mechanism (observable-load routing /
migration) is required. But once a session has shown depth, it almost surely
continues → "observed accumulated load" is the signal that works (not turn-1
features, not cost-model prediction). The single/multi optimal strategies are
opposite (load-balance the 90% one-shot sea vs affinity-pin the deep tail) and
you can't tell them apart at turn 1 → the only viable policy starts everyone
load-balanced and becomes sticky as turns accrue. This is exactly LPWL's
emergent behavior (new_uncached≈input→by-load; new_uncached≈0→sticks), so
C1 explains why a cache-aware-load score is the right shape — it auto-segments
the mixture with no classifier.
C2 — marginal work collapses while resident state explodes (c2_work_amortization.png)
Per turn: resident context grows 11k→56k+ tokens while new prefill collapses 2.7k→~200 tokens; per-turn reuse climbs 83%→99.6%; resident/new ratio ("the PD tax") grows to ~250× by turn 12, ~450× by turn 30.
PD-colocation: the dominant cost is keeping ~50k+ resident KV available for
the next turn's tiny delta. Disaggregation physically splits a turn's prefill-KV
(P) and decode-KV (D), and the next turn's prefix = [prevPrompt + prevAnswer]
spans both → must be gathered/transferred; colocation keeps it local for free.
Routing: route on delta (input − cache_hit), never total input — C2 is the
trace-level justification for LPWL's score function.
C3 — prefill/decode BALANCE (honest reframe) (c3_prefill_decode_balance.png)
- Token mass: 98.7% input / 1.3% output; of input, 60% reused-prefix, 40% new-prefill (28.6B new-prefill tokens vs 0.94B decode tokens).
- But tokens ≠ time. Under a per-request latency model (prefill@7k tok/s, TPOT 10ms), aggregate decode-time share ≈ 70% (robust 68–71% across constants) — each decode token costs ~70–140× a prefill token. So this is NOT a "decode is negligible" workload.
- Per-request the bottleneck FLIPS within a session: turn-1 (and the 90% single-turn) is prefill-bound; turns ≥3 are strongly decode-bound.
PD-colocation (correct argument): the workload has substantial work on both sides of the roofline — compute-bound prefill (~30% of time) and memory-bound decode (~70%). Colocation interleaves them on one GPU (chunked prefill + continuous batching) so compute and HBM bandwidth are both used; static disaggregation strands P-instances bandwidth-idle and D-instances compute-idle. The earlier "decode is 1.3% so nothing to isolate" instinct was WRONG (token vs time confusion) — C3b is the correction.
Caveat: C3b's 70% is a per-request-latency-weighted estimate; batched decode
throughput will shift it. Ground-truth needs -raw.jsonl (usage.cached_tokens
for exact reuse; backend_first_response_time_ms / total_cost_time_ms for real
prefill vs decode wall time). Sampling that 522GB file is the next step.
Goal mapping
| argue PD-colocation | guide routing | |
|---|---|---|
| C1 mixture + hazard | both segments favor colo (diff reasons) | reactive + auto-segment ⇒ LPWL shape |
| C2 resident/delta | the PD tax (transfer/split resident KV) | route on delta, not total |
| C3 prefill/decode | roofline complementarity (interleave) | per-req bottleneck flips within session |