Files
agentic-kvc/analysis/workload_chars
Gahow Wang cf812b6264 Workload characterization C1-C3 on full production trace
Joint/temporal characterizations of the full 051315 cluster trace (2.11M
req / 1.31M sessions / 2h), beyond the existing single-variable marginals:

- C1 mixture: 90.3% sessions single-turn, but multi-turn (9.7%) = 44% reqs /
  67% prefill mass; continuation hazard rises 10%->94% (Lindy); heaviness
  unpredictable at turn 1 (corr 0.04-0.15) => reactive routing justified.
- C2 resident/delta: resident context 11k->56k while new-prefill 2.7k->~200;
  per-turn reuse ->99.6%; resident/delta ("PD tax") ->~250-450x.
- C3 prefill/decode: token mass 98.7% input / 1.3% output, BUT decode ~70% of
  TIME (robust 68-71%); "decode negligible" is wrong (tokens != time). Correct
  colo argument = roofline complementarity, not "no decode".

Maps each to (1) PD-colocation and (2) routing. compute_chars.py + chars.json
+ figs/workload_chars/. Raw-file exact validation (cached_tokens, real
timings) pending.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:19:39 +08:00
..

Agentic workload characterization C1C3 (full 051315 production trace)

Date 2026-05-29. Source: trace-glm5.1-formatted/051315-051317.jsonl on dash1 (release file, 2,114,220 requests / 1,307,276 sessions / 2h, type=100% coder). This release file is the full cluster-level production trace — session skew reproduces 46.5/66.5/74.6/87.5/96.0 exactly. Compute: compute_chars.py (2-pass, ~65s, ~/ali-trace/.venv python). Numbers: chars.json.

⚠️ Cluster-level, not per-instance. This is one cluster's aggregate stream. Concurrent-session counts have NO denominator of "8 instances" — do not compare them to a single deployment's instance count.

These three are NOT in the existing 13 analyzer figures (which are single-variable marginals on the older 041x traces). C1C3 are joint/temporal and argument-bearing.

C1 — the workload is a MIXTURE, not "multi-turn agentic" (c1_session_mixture.png)

  • 90.3% of sessions are single-turn; mean 1.62 turns, p99=18, max=3091.
  • But multi-turn sessions (9.7%) = 44.2% of requests and 66.9% of input (prefill) mass. Single-turn = 60.2% of output (decode) mass.
  • Continuation hazard P(reach k+1 | reached k): turn1→2 only 10.2%, but turn2→3 50.6%, turn5→6 87%, turn12→13 94.3% (Lindy / Pareto).
  • Predictability of heaviness at cold-start is near-zero: corr(turn1_input, session_mass)=0.15, corr(turn1_input, n_turns)=0.04.

Routing: heaviness is unpredictable at session start → proactive placement cannot pre-empt hot-pin → a REACTIVE mechanism (observable-load routing / migration) is required. But once a session has shown depth, it almost surely continues → "observed accumulated load" is the signal that works (not turn-1 features, not cost-model prediction). The single/multi optimal strategies are opposite (load-balance the 90% one-shot sea vs affinity-pin the deep tail) and you can't tell them apart at turn 1 → the only viable policy starts everyone load-balanced and becomes sticky as turns accrue. This is exactly LPWL's emergent behavior (new_uncached≈input→by-load; new_uncached≈0→sticks), so C1 explains why a cache-aware-load score is the right shape — it auto-segments the mixture with no classifier.

C2 — marginal work collapses while resident state explodes (c2_work_amortization.png)

Per turn: resident context grows 11k→56k+ tokens while new prefill collapses 2.7k→~200 tokens; per-turn reuse climbs 83%→99.6%; resident/new ratio ("the PD tax") grows to ~250× by turn 12, ~450× by turn 30.

PD-colocation: the dominant cost is keeping ~50k+ resident KV available for the next turn's tiny delta. Disaggregation physically splits a turn's prefill-KV (P) and decode-KV (D), and the next turn's prefix = [prevPrompt + prevAnswer] spans both → must be gathered/transferred; colocation keeps it local for free. Routing: route on delta (input cache_hit), never total input — C2 is the trace-level justification for LPWL's score function.

C3 — prefill/decode BALANCE (honest reframe) (c3_prefill_decode_balance.png)

  • Token mass: 98.7% input / 1.3% output; of input, 60% reused-prefix, 40% new-prefill (28.6B new-prefill tokens vs 0.94B decode tokens).
  • But tokens ≠ time. Under a per-request latency model (prefill@7k tok/s, TPOT 10ms), aggregate decode-time share ≈ 70% (robust 6871% across constants) — each decode token costs ~70140× a prefill token. So this is NOT a "decode is negligible" workload.
  • Per-request the bottleneck FLIPS within a session: turn-1 (and the 90% single-turn) is prefill-bound; turns ≥3 are strongly decode-bound.

PD-colocation (correct argument): the workload has substantial work on both sides of the roofline — compute-bound prefill (~30% of time) and memory-bound decode (~70%). Colocation interleaves them on one GPU (chunked prefill + continuous batching) so compute and HBM bandwidth are both used; static disaggregation strands P-instances bandwidth-idle and D-instances compute-idle. The earlier "decode is 1.3% so nothing to isolate" instinct was WRONG (token vs time confusion) — C3b is the correction.

Caveat: C3b's 70% is a per-request-latency-weighted estimate; batched decode throughput will shift it. Ground-truth needs -raw.jsonl (usage.cached_tokens for exact reuse; backend_first_response_time_ms / total_cost_time_ms for real prefill vs decode wall time). Sampling that 522GB file is the next step.

Goal mapping

argue PD-colocation guide routing
C1 mixture + hazard both segments favor colo (diff reasons) reactive + auto-segment ⇒ LPWL shape
C2 resident/delta the PD tax (transfer/split resident KV) route on delta, not total
C3 prefill/decode roofline complementarity (interleave) per-req bottleneck flips within session