# Agentic Workload Characterization TODO Status: execution checklist for interns Date: 2026-05-25 Last progress audit: 2026-05-25 ## Progress Snapshot (2026-05-25, post-Window-1) | Batch | State | Evidence | |---|---|---| | B0 Substrate audit | **DONE for new runs**, legacy still partial | A1+A2 instrumentation lands per-request unix timestamps and X-Request-Id passthrough; B3 sweep 2026-05-25 achieves 100% join coverage on all 5 policy runs | | B1 Workload characterization | **DONE** | `window_1_results/kv_footprint_summary.json` (98304 B/token, p99 = 11.49 GiB); real reuse decomposition (`lmetric_reuse.json`: 93.2% intra-session, 5.7% cross, 1.1% shared); theoretical APC ceilings (`apc_upper_w600.json`: 79.6% intra / 80.3% any) | | B2 PD interference | **DONE** | `outputs/b2_microbench/sweep/` 5 × 2 cells. Different-worker control idx 0.92-1.02 across 32× prefill size variation; same-worker TTFT idx scales 2.15× → 218×. Causal proof complete. | | B3 5-policy routing sweep | **DONE** | `outputs/b3_sweep_20260525_095043/` lmetric/load_only/sticky (warm-cache) + unified/capped (isolated cold-start). Aggregated in `b3_policy_comparison.json`. Unified hits APC 79.4% (97% of ceiling) AND TTFT p90 7.24 s. | | B4 SRR sweep | NOT DONE | Window 2 task. A4 loadgen + per-class SLO + λ binary search per policy. | | B5 Failure attribution | NOT DONE | Window 2 task. Depends on B4 SRR boundaries. | | B6 Audit package | **DONE for Window 1** | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` refreshed; Window 1 results aggregated in `window_1_results.md` + 8 PNG figures | Reusable assets already in repo: - `analysis/characterization/analyze.py` — B0+B1 CPU-only analyzer - `analysis/characterization/summarize_runs.py` — existing-run audit producing the B6 scaffold - `analysis/characterization/plot_current_results.py` — figure regeneration script - `analysis/characterization/protocols.md` — B2–B6 protocol with required instrumentation, sweep, pass condition - `analysis/characterization/current_results/` — current audit package (claim matrix + risk register + allowed-runs gate + 6 PNG figures) Hard gates still blocking main claims: 1. Replayer/proxy must emit per-request dispatch + finish/error wall-clock timestamps (blocks B0 actual sequentiality, B4 SRR validity). 2. Per-request record must carry `session_id` + `hash_ids` + `cached_tokens` jointly (blocks B1 reuse decomposition). 3. Engine/proxy must log decode-step and prefill-chunk timestamps with worker id (blocks B2 interference index). 4. Proxy must log route decision, chosen worker, candidate scores, per-worker queue/KV/APC snapshot (blocks B3 hot-spot proof). ## 0. Purpose We are not starting from the assumption that Unified routing or PUSH migration is already the answer. The first goal is to build a rigorous characterization package that proves: 1. which dimensions make agentic serving different; 2. where static PD-disaggregation works poorly; 3. where PD-colocation/cache-aware routing still has residual failure modes; 4. how these failure modes reduce sustainable request rate under SLO. Only after these facts are established should we refine the positive system design. Primary system goal: ```text maximize sustainable request rate under SLO ``` Prefill-decode interference and session hot-spot imbalance are mechanisms that may reduce SRR. They are not the final metric by themselves. ## 1. Global Delivery Rules Every task must produce data, figures, and an audit trail. A task is not complete if it only produces a written conclusion. Use this output layout: ```text outputs/characterization/// ├── manifest.json ├── raw/ ├── summary.json ├── summary.md ├── figures/ └── audit.md ``` Required fields in `manifest.json`: ```json { "git_commit": "", "host": "", "gpu_type": "", "gpu_count": 0, "trace_path": "", "trace_sha256": "", "policy": "", "launch_command": "", "request_limit": null, "time_scale": null, "session_sampling_method": "", "session_sequential": true, "start_time": "", "end_time": "" } ``` Every comparison must report: - attempted requests - completed requests - errors / timeouts - goodput - TTFT p50/p90/p99 - E2E p50/p90/p99 - TPOT p50/p90/p99 - per-worker queue metrics - per-worker GPU utilization - per-worker KV occupancy if available - per-worker APC / cache-hit metrics Every figure must be reproducible from raw data by a script committed or saved alongside the artifact. ## 2. Batch 0: Benchmark Substrate Audit Status: analyzer DONE (`analyze.py`); legacy-run sequentiality claim BLOCKED by missing dispatch/finish timestamps in `metrics.jsonl`. New replayer must add those fields before any `online_realistic` classification is allowed. ### Goal Prove the load generator and trace replay are valid before trusting any performance result. The most important invariant: ```text For online agentic serving, each session must have at most one in-flight turn. Turn N+1 must not be sent before turn N completes. ``` ### TODO 1. Implement or run an analyzer that reconstructs per-session request intervals: - dispatch timestamp - first-token timestamp - finish timestamp - error / timeout timestamp 2. Compute max concurrent in-flight turns per session. 3. Compute session start-time distribution. 4. Compute turn inter-arrival distribution. 5. Classify each existing run as one of: - `online_realistic` - `burst_stress` - `synthetic_microbench` - `invalid_for_online_claim` 6. For any run where session sequentiality is violated, write down exactly which claim it can still support. ### Data Artifacts - `session_concurrency.json` - `session_arrival_stats.json` - `turn_interval_stats.json` - `trace_profile.json` - `invalid_runs.md` ### Figures - session start-time CDF - per-session max in-flight histogram - turns per session CDF - turn inter-arrival CDF ### Audit Checks The `audit.md` must answer: 1. Does the main trace satisfy `max_inflight_per_session == 1`? 2. If not, is the run explicitly labeled as stress or invalid? 3. Are attempted/completed/error counts included? 4. Are latency percentiles computed only over successes, and if so, is goodput also reported? ### Pass Criteria - Main online-serving experiments must have `max_inflight_per_session == 1`. - Any violation must be clearly labeled and excluded from SRR claims. ## 3. Batch 1: Workload Characterization Status: trace-shape items (1, 2, 3, 6, 8) DONE on full 7200 s GLM-5.1 trace; recorded in `current_results/full_trace_summary.json`. Items 4 (KV footprint), 5 (reuse decomposition), 7 (uncached append delta) are PENDING because they need `--kv-bytes-per-token` for the production model and joinable `cached_tokens`+`hash_ids` per request. ### Goal Establish agentic workload facts independent of any proposed system. Required facts: 1. long input, short output; 2. large per-request KV footprint; 3. reuse is mostly intra-session; 4. session token mass is heavy-tailed; 5. total prompt length and effective uncached prefill work are different. ### TODO 1. Compute input token CDF. 2. Compute output token CDF. 3. Compute input/output ratio. 4. Estimate KV footprint per request: ```text kv_bytes_per_request = input_tokens * kv_bytes_per_token ``` 5. Decompose reusable KV into: - intra-session reuse - cross-session reuse - shared/system-prefix reuse 6. Compute session-level skew: - turns per session - cumulative input tokens per session - cumulative output tokens per session - cumulative uncached tokens per session - top-k session contribution 7. Compute append / effective-prefill distribution: ```text uncached_tokens = input_tokens - cached_tokens ``` 8. Compare total input length vs uncached tokens. ### Data Artifacts - `workload_summary.json` - `kv_footprint_summary.json` - `reuse_decomposition.json` - `session_skew.json` - `append_delta_stats.json` ### Figures - input/output token CDF - input/output ratio CDF - KV footprint CDF - reuse decomposition stacked bar - turns per session CDF - per-session token mass Lorenz curve - top-k sessions token contribution bar - total input vs uncached tokens scatter ### Audit Checks The `audit.md` must answer: 1. What are input p50/p90/p99? 2. What are output p50/p90/p99? 3. What is the estimated KV footprint p50/p90/p99? 4. What fraction of reuse is intra-session? 5. What fraction of total token mass comes from top 1% / 5% sessions? 6. Are long prompts often small appends after cache reuse? ### Pass Criteria The batch passes only if these facts can be stated numerically with raw data links and plotted figures. ## 4. Batch 2: PD-Colo Prefill-Decode Interference Proof Status: protocol DONE (`analysis/characterization/protocols.md` §"Batch 2 Protocol"); execution NOT STARTED — needs new engine instrumentation for decode-step and prefill-chunk timestamps. ### Goal Prove that PD-colocation can suffer from prefill-decode interference under high load, and quantify how much this affects TPOT, decode queueing, and SLO. Hypothesis: ```text When heavy uncached prefill overlaps with active decode on the same worker, decode TPOT and/or decode queue delay increases. ``` ### TODO 1. Run controlled microbenchmarks: - decode-only steady load; - decode load plus same-worker heavy prefill injection; - decode load plus different-worker heavy prefill injection. 2. Sweep uncached prefill sizes: - 2k - 8k - 16k - 32k - 64k 3. If supported, sweep chunked prefill size. 4. Log timestamps for: - decode steps; - prefill start/end; - prefill chunks; - queue admission; - request completion. 5. In trace replay, label decode steps by whether they overlap with same-worker prefill. 6. Compute: ```text interference_index = TPOT_p90(decode steps overlapping same-worker prefill) / TPOT_p90(decode steps without same-worker prefill) ``` 7. Compare same-worker vs different-worker controls. ### Data Artifacts - `interference_microbench_summary.json` - `decode_step_timeseries.csv` - `prefill_overlap_events.jsonl` - `interference_index.json` - `trace_overlap_summary.json` ### Figures - TPOT time series with prefill overlap annotation - interference index vs uncached prefill size - same-worker vs different-worker TPOT boxplot - chunk size vs TTFT/TPOT tradeoff - trace replay overlap vs non-overlap TPOT comparison ### Audit Checks The `audit.md` must answer: 1. Is the interference observed on the same worker? 2. Is the different-worker control significantly weaker? 3. Does interference grow with uncached prefill size? 4. Does the phenomenon appear in real trace replay, not only microbench? 5. Could the result be explained by global load instead of local colocation? ### Pass Criteria - Same-worker overlap must measurably increase TPOT or decode queue delay. - The effect must be weaker or absent in the different-worker control. - The effect must be visible in at least one trace replay setting. ## 5. Batch 3: Session Hot-Spot Residual Imbalance Proof Status: protocol DONE; partial signal from legacy `gpu_util.csv` (GPU-util imbalance visible) but causal proof NOT STARTED — needs per-worker queue/KV/APC and session→worker map from instrumented proxy. ### Goal Prove that cache-aware/LMetric is a strong baseline but still leaves residual hot-worker imbalance due to session skew and locality. Hypothesis: ```text Cache-aware routing preserves locality by attracting future turns to cached workers. This is usually good, but heavy-tailed sessions can create hot workers whose queue delay/SLO violations are much worse than the median worker even when other workers still have headroom. ``` ### TODO 1. Run the same session-causal trace with: - corrected LMetric/cache-aware; - load-only routing; - hard sticky routing; - current Unified hybrid, if available. 2. For each worker, record: - assigned session count; - cumulative input tokens; - cumulative uncached tokens; - cumulative output tokens; - request queue delay; - decode queue delay; - GPU utilization; - KV occupancy; - APC / cache-hit rate; - SLO violations. 3. For each session, record: - worker set used; - primary worker; - cumulative token mass; - number of turns; - latency contribution; - whether it appears in slow-request set. 4. Create a session-mass capped or equalized replay: - cap max session turns or token mass; - rerun LMetric/cache-aware; - compare hot-spot index. 5. Compute: ```text hotspot_index = max_worker_queue_delay_p90 / median_worker_queue_delay_p90 ``` 6. Compute locality/load tradeoff: ```text locality_gain = APC(policy) - APC(load_only) imbalance_cost = max_worker_latency_p90(policy) - median_worker_latency_p90(policy) ``` ### Data Artifacts - `worker_balance_summary.json` - `session_to_worker_map.json` - `session_mass_summary.json` - `routing_policy_comparison.json` - `hotspot_index.json` - `capped_session_replay_summary.json` ### Figures - per-worker queue delay bar - per-worker token mass bar - GPU utilization timeline by worker - KV occupancy timeline by worker - APC vs queue delay scatter - top sessions contribution bar - policy tradeoff plot: APC vs hotspot_index - original vs session-capped hot-spot comparison ### Audit Checks The `audit.md` must answer: 1. Does LMetric/cache-aware still show worker-level skew? 2. Are SLO violations concentrated on hot workers or hot sessions? 3. Does load-only routing improve balance but reduce APC/locality? 4. Does hard sticky improve locality but worsen hot-spot/HOL? 5. Does session-mass capping reduce hot spots? ### Pass Criteria - LMetric/cache-aware must be shown as strong but imperfect. - There must be measurable residual hot-worker imbalance. - The imbalance must correlate with session token mass or locality. ## 6. Batch 4: Sustainable Request Rate Sweep Status: protocol DONE; execution NOT STARTED — requires open-loop session-causal loadgen and policy-comparable arrival process. ### Goal Connect interference and hot-spot mechanisms to the final metric: ```text SRR(SLO) = max arrival rate satisfying SLO in steady state ``` ### TODO 1. Define provisional SLO thresholds. Use configurable values, for example: ```text TTFT_p90 <= T_ttft E2E_p90 <= T_e2e TPOT_p90 <= T_tpot error_rate <= epsilon queue length stable KV occupancy stable ``` 2. Implement arrival-rate sweep: - Poisson session arrivals; - session-internal sequentiality; - warmup window; - steady-state measurement window. 3. For each arrival rate `lambda`, run: - PD-colo cache-aware/LMetric; - static PD-disagg; - current Unified hybrid; - optional hard sticky; - optional load-only. 4. Find maximum sustainable lambda for each policy. 5. Report instability reasons: - SLO violation; - queue growth; - KV occupancy growth; - error/timeout growth. ### Data Artifacts - `srr_curve.json` - `lambda_runs//summary.json` - `slo_violation_reason.json` - `goodput_vs_arrival_rate.json` - `stability_summary.json` ### Figures - SRR bar chart - TTFT p90 vs arrival rate - E2E p90 vs arrival rate - TPOT p90 vs arrival rate - goodput vs arrival rate - error rate vs arrival rate - queue length over time near failure point - KV occupancy over time near failure point ### Audit Checks The `audit.md` must answer: 1. Are session arrivals open-loop and Poisson? 2. Is session-internal sequentiality enforced? 3. How long are warmup and steady-state windows? 4. Is SRR failure persistent rather than transient? 5. Are completed/requested counts reported at every lambda? 6. Are policies compared on the same trace and same arrival process? ### Pass Criteria - Each policy must have a measured SRR under the same SLO. - Failure must be attributed to persistent SLO violation, queue growth, KV growth, or error growth. - Data must be session-causal. ## 7. Batch 5: Failure Attribution Near SRR Boundary Status: protocol DONE; execution NOT STARTED — depends on B2 instrumentation and B4 SRR boundary. ### Goal At and around the PD-colo/LMetric failure point, determine whether SLO violations are caused by prefill-decode interference, session hot spots, KV pressure, cache misses, or other mechanisms. ### TODO 1. Select three arrival rates: ```text lambda = 0.9 * SRR lambda = 1.0 * SRR lambda = 1.1 * SRR ``` 2. For every slow or SLO-violating request, assign labels: - same-worker prefill overlap; - hot worker queue; - high KV occupancy; - cache miss / large uncached append; - transfer wait; - P queue wait; - D admission wait; - unknown. 3. Produce per-request waterfall for representative slow requests. 4. Produce per-worker timeline around failure windows. 5. Summarize cause distribution. ### Data Artifacts - `slow_request_attribution.jsonl` - `failure_breakdown.json` - `case_studies.md` - `worker_failure_windows.json` ### Figures - SLO violation cause stacked bar - slow request waterfall - worker timeline near failure - prefill/decode/KV/queue stacked breakdown - failure cause vs arrival rate ### Audit Checks The `audit.md` must answer: 1. What fraction of slow requests overlap same-worker prefill? 2. What fraction are on hot workers? 3. What fraction happen under high KV occupancy? 4. What fraction are large uncached append requests? 5. For PD-disagg/Unified migration, how much time is transfer/P queue/D wait? 6. What remains unexplained? ### Pass Criteria The batch must answer: 1. Why PD-colo/LMetric hits its SRR limit. 2. Why static PD-disagg hits its SRR limit. 3. If Unified/PUSH underperforms, whether the cause is trigger quality, cost model, transfer overhead, wrong load regime, or something else. ## 8. Batch 6: Audit Package Status: scaffold DONE — all five final artifacts exist under `analysis/characterization/current_results/` and are regenerated by `summarize_runs.py` + `plot_current_results.py`. Future B2–B5 outputs must be merged into the same package by re-running `summarize_runs.py` after new runs. ### Goal Make the whole characterization package reviewable by a strict systems reviewer. ### TODO 1. Write a claim matrix: ```text claim -> data artifact -> figure -> script -> caveat -> reviewer risk ``` 2. Write a figure index: - figure filename; - source data; - generation command; - intended claim. 3. Write a reviewer risk register: - loadgen validity risks; - trace representativeness risks; - metric bias risks; - implementation-specific risks; - generalization risks. 4. Write a reproduction script or command list. 5. Mark experiments that cannot support main claims. ### Final Artifacts - `characterization_claim_matrix.md` - `all_figures_index.md` - `reviewer_risk_register.md` - `reproduction_commands.sh` - `main_claim_allowed_runs.md` ### Audit Checks The final package must satisfy: 1. Every claim links to raw data. 2. Every figure can be regenerated. 3. Every experiment has a manifest. 4. Every caveat is explicit. 5. Invalid or stress-only runs are not used for online-serving claims. ## 9. Priority Order ### Priority 1 Do these first: 1. Batch 0: Benchmark Substrate Audit 2. Batch 1: Workload Characterization 3. Batch 3: Session Hot-Spot Residual Imbalance Proof Reason: These define whether the trace and routing problem are real. Without them, SRR sweeps and system experiments are not trustworthy. ### Priority 2 Do these after the substrate and workload facts are stable: 1. Batch 2: PD-Colo Prefill-Decode Interference Proof 2. Batch 5: Failure Attribution Near SRR Boundary Reason: These explain the mechanisms behind SLO/SRR failure and determine what the positive system should actually fix. ### Priority 3 Do these after instrumentation and attribution are ready: 1. Batch 4: Sustainable Request Rate Sweep 2. Batch 6: Audit Package Reason: SRR sweeps are expensive. They should run only after trace validity, logging, and attribution labels are ready. ## 10. Non-Negotiable Reviewer Rules 1. Do not use session-nonsequential loadgen for online-serving claims. 2. Do not compare latency percentiles without attempted/completed/error counts. 3. Do not use APC alone as a success metric. 4. Do not use average GPU utilization as proof of load balance. 5. Do not compare policies on different traces unless explicitly labeled. 6. Do not hide failed requests or timeouts. 7. Do not claim Unified/PUSH is the answer before failure attribution proves the relevant bottleneck and cost budget. 8. Treat corrected LMetric/cache-aware PD-colo as the main baseline. 9. Treat static PD-disagg as an important baseline, not a strawman. 10. Every result must be reproducible from raw artifacts and commands.