Files

Gahow Wang 4722883903 Audit package refresh: Window 1 supported claims + risk register

Refresh the standing audit package now that B1' / B2 / B3 are complete.

current_results/characterization_claim_matrix.md
  Flips seven entries from "not_yet_supported" / "partially_supported"
  to "supported" with pointers into window_1_results/. New entries
  cover per-session sequentiality, KV per request, real reuse
  decomposition, theoretical APC ceiling, the LMetric locality gap,
  Unified breaking the locality-vs-latency tradeoff, B2 causal
  interference proof, sticky's interference inflation, and the
  partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay
  "not_yet_supported" (Window 2 work).

current_results/main_claim_allowed_runs.md
  New "Allowed For Routing-Policy Comparison" section pins the five
  B3 policy directories. New "Allowed For PD-colo Interference"
  section pins the B2 sweep. Legacy section retained for the
  pre-instrumentation 200/500/1000-req runs.

current_results/reviewer_risk_register.md
  Marks the two old "high"-severity risks (sequentiality / reuse
  decomposition) as resolved; adds new entries for the APC
  contamination empirics, the b3_analyze.sh truncate-write bug that
  cost unified's interference index, the GPU-0 EngineCore ghost
  cleanup, the saturated-replay caveat for trace-timestamp dispatch,
  and the synthetic B2 decode workload.

current_results/all_figures_index.md
  Adds the 8 new Window 1 figures alongside the existing 6 from the
  legacy summarize_runs run.

current_results/reproduction_commands.sh
  Records the full B3 + B2 + figure pipeline.

analysis/characterization_todo_for_interns.md
  Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE;
  only B4 and B5 remain (Window 2).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 23:25:27 +08:00

21 KiB

Raw Blame History

Agentic Workload Characterization TODO

Status: execution checklist for interns Date: 2026-05-25 Last progress audit: 2026-05-25

Progress Snapshot (2026-05-25, post-Window-1)

Batch	State	Evidence
B0 Substrate audit	DONE for new runs, legacy still partial	A1+A2 instrumentation lands per-request unix timestamps and X-Request-Id passthrough; B3 sweep 2026-05-25 achieves 100% join coverage on all 5 policy runs
B1 Workload characterization	DONE	`window_1_results/kv_footprint_summary.json` (98304 B/token, p99 = 11.49 GiB); real reuse decomposition (`lmetric_reuse.json`: 93.2% intra-session, 5.7% cross, 1.1% shared); theoretical APC ceilings (`apc_upper_w600.json`: 79.6% intra / 80.3% any)
B2 PD interference	DONE	`outputs/b2_microbench/sweep/` 5 × 2 cells. Different-worker control idx 0.92-1.02 across 32× prefill size variation; same-worker TTFT idx scales 2.15× → 218×. Causal proof complete.
B3 5-policy routing sweep	DONE	`outputs/b3_sweep_20260525_095043/` lmetric/load_only/sticky (warm-cache) + unified/capped (isolated cold-start). Aggregated in `b3_policy_comparison.json`. Unified hits APC 79.4% (97% of ceiling) AND TTFT p90 7.24 s.
B4 SRR sweep	NOT DONE	Window 2 task. A4 loadgen + per-class SLO + λ binary search per policy.
B5 Failure attribution	NOT DONE	Window 2 task. Depends on B4 SRR boundaries.
B6 Audit package	DONE for Window 1	`current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` refreshed; Window 1 results aggregated in `window_1_results.md` + 8 PNG figures

Reusable assets already in repo:

analysis/characterization/analyze.py — B0+B1 CPU-only analyzer
analysis/characterization/summarize_runs.py — existing-run audit producing the B6 scaffold
analysis/characterization/plot_current_results.py — figure regeneration script
analysis/characterization/protocols.md — B2–B6 protocol with required instrumentation, sweep, pass condition
analysis/characterization/current_results/ — current audit package (claim matrix + risk register + allowed-runs gate + 6 PNG figures)

Hard gates still blocking main claims:

Replayer/proxy must emit per-request dispatch + finish/error wall-clock timestamps (blocks B0 actual sequentiality, B4 SRR validity).
Per-request record must carry session_id + hash_ids + cached_tokens jointly (blocks B1 reuse decomposition).
Engine/proxy must log decode-step and prefill-chunk timestamps with worker id (blocks B2 interference index).
Proxy must log route decision, chosen worker, candidate scores, per-worker queue/KV/APC snapshot (blocks B3 hot-spot proof).

0. Purpose

We are not starting from the assumption that Unified routing or PUSH migration is already the answer.

The first goal is to build a rigorous characterization package that proves:

which dimensions make agentic serving different;
where static PD-disaggregation works poorly;
where PD-colocation/cache-aware routing still has residual failure modes;
how these failure modes reduce sustainable request rate under SLO.

Only after these facts are established should we refine the positive system design.

Primary system goal:

maximize sustainable request rate under SLO

Prefill-decode interference and session hot-spot imbalance are mechanisms that may reduce SRR. They are not the final metric by themselves.

1. Global Delivery Rules

Every task must produce data, figures, and an audit trail. A task is not complete if it only produces a written conclusion.

Use this output layout:

outputs/characterization/<date>/<task_name>/
├── manifest.json
├── raw/
├── summary.json
├── summary.md
├── figures/
└── audit.md

Required fields in manifest.json:

{
  "git_commit": "",
  "host": "",
  "gpu_type": "",
  "gpu_count": 0,
  "trace_path": "",
  "trace_sha256": "",
  "policy": "",
  "launch_command": "",
  "request_limit": null,
  "time_scale": null,
  "session_sampling_method": "",
  "session_sequential": true,
  "start_time": "",
  "end_time": ""
}

Every comparison must report:

attempted requests
completed requests
errors / timeouts
goodput
TTFT p50/p90/p99
E2E p50/p90/p99
TPOT p50/p90/p99
per-worker queue metrics
per-worker GPU utilization
per-worker KV occupancy if available
per-worker APC / cache-hit metrics

Every figure must be reproducible from raw data by a script committed or saved alongside the artifact.

2. Batch 0: Benchmark Substrate Audit

Status: analyzer DONE (analyze.py); legacy-run sequentiality claim BLOCKED by missing dispatch/finish timestamps in metrics.jsonl. New replayer must add those fields before any online_realistic classification is allowed.

Goal

Prove the load generator and trace replay are valid before trusting any performance result.

The most important invariant:

For online agentic serving, each session must have at most one in-flight turn.
Turn N+1 must not be sent before turn N completes.

TODO

Implement or run an analyzer that reconstructs per-session request intervals:
- dispatch timestamp
- first-token timestamp
- finish timestamp
- error / timeout timestamp
Compute max concurrent in-flight turns per session.
Compute session start-time distribution.
Compute turn inter-arrival distribution.
Classify each existing run as one of:
- online_realistic
- burst_stress
- synthetic_microbench
- invalid_for_online_claim
For any run where session sequentiality is violated, write down exactly which claim it can still support.

Data Artifacts

session_concurrency.json
session_arrival_stats.json
turn_interval_stats.json
trace_profile.json
invalid_runs.md

Figures

session start-time CDF
per-session max in-flight histogram
turns per session CDF
turn inter-arrival CDF

Audit Checks

The audit.md must answer:

Does the main trace satisfy max_inflight_per_session == 1?
If not, is the run explicitly labeled as stress or invalid?
Are attempted/completed/error counts included?
Are latency percentiles computed only over successes, and if so, is goodput also reported?

Pass Criteria

Main online-serving experiments must have max_inflight_per_session == 1.
Any violation must be clearly labeled and excluded from SRR claims.

3. Batch 1: Workload Characterization

Status: trace-shape items (1, 2, 3, 6, 8) DONE on full 7200 s GLM-5.1 trace; recorded in current_results/full_trace_summary.json. Items 4 (KV footprint), 5 (reuse decomposition), 7 (uncached append delta) are PENDING because they need --kv-bytes-per-token for the production model and joinable cached_tokens+hash_ids per request.

Goal

Establish agentic workload facts independent of any proposed system.

Required facts:

long input, short output;
large per-request KV footprint;
reuse is mostly intra-session;
session token mass is heavy-tailed;
total prompt length and effective uncached prefill work are different.

TODO

Compute input token CDF.
Compute output token CDF.
Compute input/output ratio.

Estimate KV footprint per request:

kv_bytes_per_request = input_tokens * kv_bytes_per_token

Decompose reusable KV into:
- intra-session reuse
- cross-session reuse
- shared/system-prefix reuse
Compute session-level skew:
- turns per session
- cumulative input tokens per session
- cumulative output tokens per session
- cumulative uncached tokens per session
- top-k session contribution

Compute append / effective-prefill distribution:

uncached_tokens = input_tokens - cached_tokens

Compare total input length vs uncached tokens.

Data Artifacts

workload_summary.json
kv_footprint_summary.json
reuse_decomposition.json
session_skew.json
append_delta_stats.json

Figures

input/output token CDF
input/output ratio CDF
KV footprint CDF
reuse decomposition stacked bar
turns per session CDF
per-session token mass Lorenz curve
top-k sessions token contribution bar
total input vs uncached tokens scatter

Audit Checks

The audit.md must answer:

What are input p50/p90/p99?
What are output p50/p90/p99?
What is the estimated KV footprint p50/p90/p99?
What fraction of reuse is intra-session?
What fraction of total token mass comes from top 1% / 5% sessions?
Are long prompts often small appends after cache reuse?

Pass Criteria

The batch passes only if these facts can be stated numerically with raw data links and plotted figures.

4. Batch 2: PD-Colo Prefill-Decode Interference Proof

Status: protocol DONE (analysis/characterization/protocols.md §"Batch 2 Protocol"); execution NOT STARTED — needs new engine instrumentation for decode-step and prefill-chunk timestamps.

Goal

Prove that PD-colocation can suffer from prefill-decode interference under high load, and quantify how much this affects TPOT, decode queueing, and SLO.

Hypothesis:

When heavy uncached prefill overlaps with active decode on the same worker,
decode TPOT and/or decode queue delay increases.

TODO

Run controlled microbenchmarks:
- decode-only steady load;
- decode load plus same-worker heavy prefill injection;
- decode load plus different-worker heavy prefill injection.
Sweep uncached prefill sizes:
- 2k
- 8k
- 16k
- 32k
- 64k
If supported, sweep chunked prefill size.
Log timestamps for:
- decode steps;
- prefill start/end;
- prefill chunks;
- queue admission;
- request completion.
In trace replay, label decode steps by whether they overlap with same-worker prefill.

Compute:

interference_index =
  TPOT_p90(decode steps overlapping same-worker prefill)
  / TPOT_p90(decode steps without same-worker prefill)

Compare same-worker vs different-worker controls.

Data Artifacts

interference_microbench_summary.json
decode_step_timeseries.csv
prefill_overlap_events.jsonl
interference_index.json
trace_overlap_summary.json

Figures

TPOT time series with prefill overlap annotation
interference index vs uncached prefill size
same-worker vs different-worker TPOT boxplot
chunk size vs TTFT/TPOT tradeoff
trace replay overlap vs non-overlap TPOT comparison

Audit Checks

The audit.md must answer:

Is the interference observed on the same worker?
Is the different-worker control significantly weaker?
Does interference grow with uncached prefill size?
Does the phenomenon appear in real trace replay, not only microbench?
Could the result be explained by global load instead of local colocation?

Pass Criteria

Same-worker overlap must measurably increase TPOT or decode queue delay.
The effect must be weaker or absent in the different-worker control.
The effect must be visible in at least one trace replay setting.

5. Batch 3: Session Hot-Spot Residual Imbalance Proof

Status: protocol DONE; partial signal from legacy gpu_util.csv (GPU-util imbalance visible) but causal proof NOT STARTED — needs per-worker queue/KV/APC and session→worker map from instrumented proxy.

Goal

Prove that cache-aware/LMetric is a strong baseline but still leaves residual hot-worker imbalance due to session skew and locality.

Hypothesis:

Cache-aware routing preserves locality by attracting future turns to cached
workers. This is usually good, but heavy-tailed sessions can create hot
workers whose queue delay/SLO violations are much worse than the median
worker even when other workers still have headroom.

TODO

Run the same session-causal trace with:
- corrected LMetric/cache-aware;
- load-only routing;
- hard sticky routing;
- current Unified hybrid, if available.
For each worker, record:
- assigned session count;
- cumulative input tokens;
- cumulative uncached tokens;
- cumulative output tokens;
- request queue delay;
- decode queue delay;
- GPU utilization;
- KV occupancy;
- APC / cache-hit rate;
- SLO violations.
For each session, record:
- worker set used;
- primary worker;
- cumulative token mass;
- number of turns;
- latency contribution;
- whether it appears in slow-request set.
Create a session-mass capped or equalized replay:
- cap max session turns or token mass;
- rerun LMetric/cache-aware;
- compare hot-spot index.

Compute:

hotspot_index =
  max_worker_queue_delay_p90 / median_worker_queue_delay_p90

Compute locality/load tradeoff:

locality_gain = APC(policy) - APC(load_only)
imbalance_cost =
  max_worker_latency_p90(policy) - median_worker_latency_p90(policy)

Data Artifacts

worker_balance_summary.json
session_to_worker_map.json
session_mass_summary.json
routing_policy_comparison.json
hotspot_index.json
capped_session_replay_summary.json

Figures

per-worker queue delay bar
per-worker token mass bar
GPU utilization timeline by worker
KV occupancy timeline by worker
APC vs queue delay scatter
top sessions contribution bar
policy tradeoff plot: APC vs hotspot_index
original vs session-capped hot-spot comparison

Audit Checks

The audit.md must answer:

Does LMetric/cache-aware still show worker-level skew?
Are SLO violations concentrated on hot workers or hot sessions?
Does load-only routing improve balance but reduce APC/locality?
Does hard sticky improve locality but worsen hot-spot/HOL?
Does session-mass capping reduce hot spots?

Pass Criteria

LMetric/cache-aware must be shown as strong but imperfect.
There must be measurable residual hot-worker imbalance.
The imbalance must correlate with session token mass or locality.

6. Batch 4: Sustainable Request Rate Sweep

Status: protocol DONE; execution NOT STARTED — requires open-loop session-causal loadgen and policy-comparable arrival process.

Goal

Connect interference and hot-spot mechanisms to the final metric:

SRR(SLO) = max arrival rate satisfying SLO in steady state

TODO

Define provisional SLO thresholds. Use configurable values, for example:

TTFT_p90 <= T_ttft
E2E_p90  <= T_e2e
TPOT_p90 <= T_tpot
error_rate <= epsilon
queue length stable
KV occupancy stable

Implement arrival-rate sweep:
- Poisson session arrivals;
- session-internal sequentiality;
- warmup window;
- steady-state measurement window.
For each arrival rate lambda, run:
- PD-colo cache-aware/LMetric;
- static PD-disagg;
- current Unified hybrid;
- optional hard sticky;
- optional load-only.
Find maximum sustainable lambda for each policy.
Report instability reasons:
- SLO violation;
- queue growth;
- KV occupancy growth;
- error/timeout growth.

Data Artifacts

srr_curve.json
lambda_runs/<lambda>/summary.json
slo_violation_reason.json
goodput_vs_arrival_rate.json
stability_summary.json

Figures

SRR bar chart
TTFT p90 vs arrival rate
E2E p90 vs arrival rate
TPOT p90 vs arrival rate
goodput vs arrival rate
error rate vs arrival rate
queue length over time near failure point
KV occupancy over time near failure point

Audit Checks

The audit.md must answer:

Are session arrivals open-loop and Poisson?
Is session-internal sequentiality enforced?
How long are warmup and steady-state windows?
Is SRR failure persistent rather than transient?
Are completed/requested counts reported at every lambda?
Are policies compared on the same trace and same arrival process?

Pass Criteria

Each policy must have a measured SRR under the same SLO.
Failure must be attributed to persistent SLO violation, queue growth, KV growth, or error growth.
Data must be session-causal.

7. Batch 5: Failure Attribution Near SRR Boundary

Status: protocol DONE; execution NOT STARTED — depends on B2 instrumentation and B4 SRR boundary.

Goal

At and around the PD-colo/LMetric failure point, determine whether SLO violations are caused by prefill-decode interference, session hot spots, KV pressure, cache misses, or other mechanisms.

TODO

Select three arrival rates:

lambda = 0.9 * SRR
lambda = 1.0 * SRR
lambda = 1.1 * SRR

For every slow or SLO-violating request, assign labels:
- same-worker prefill overlap;
- hot worker queue;
- high KV occupancy;
- cache miss / large uncached append;
- transfer wait;
- P queue wait;
- D admission wait;
- unknown.
Produce per-request waterfall for representative slow requests.
Produce per-worker timeline around failure windows.
Summarize cause distribution.

Data Artifacts

slow_request_attribution.jsonl
failure_breakdown.json
case_studies.md
worker_failure_windows.json

Figures

SLO violation cause stacked bar
slow request waterfall
worker timeline near failure
prefill/decode/KV/queue stacked breakdown
failure cause vs arrival rate

Audit Checks

The audit.md must answer:

What fraction of slow requests overlap same-worker prefill?
What fraction are on hot workers?
What fraction happen under high KV occupancy?
What fraction are large uncached append requests?
For PD-disagg/Unified migration, how much time is transfer/P queue/D wait?
What remains unexplained?

Pass Criteria

The batch must answer:

Why PD-colo/LMetric hits its SRR limit.
Why static PD-disagg hits its SRR limit.
If Unified/PUSH underperforms, whether the cause is trigger quality, cost model, transfer overhead, wrong load regime, or something else.

8. Batch 6: Audit Package

Status: scaffold DONE — all five final artifacts exist under analysis/characterization/current_results/ and are regenerated by summarize_runs.py + plot_current_results.py. Future B2–B5 outputs must be merged into the same package by re-running summarize_runs.py after new runs.

Goal

Make the whole characterization package reviewable by a strict systems reviewer.

TODO

Write a claim matrix:

claim -> data artifact -> figure -> script -> caveat -> reviewer risk

Write a figure index:
- figure filename;
- source data;
- generation command;
- intended claim.
Write a reviewer risk register:
- loadgen validity risks;
- trace representativeness risks;
- metric bias risks;
- implementation-specific risks;
- generalization risks.
Write a reproduction script or command list.
Mark experiments that cannot support main claims.

Final Artifacts

characterization_claim_matrix.md
all_figures_index.md
reviewer_risk_register.md
reproduction_commands.sh
main_claim_allowed_runs.md

Audit Checks

The final package must satisfy:

Every claim links to raw data.
Every figure can be regenerated.
Every experiment has a manifest.
Every caveat is explicit.
Invalid or stress-only runs are not used for online-serving claims.

9. Priority Order

Priority 1

Do these first:

Batch 0: Benchmark Substrate Audit
Batch 1: Workload Characterization
Batch 3: Session Hot-Spot Residual Imbalance Proof

Reason:

These define whether the trace and routing problem are real. Without them, SRR sweeps and system experiments are not trustworthy.

Priority 2

Do these after the substrate and workload facts are stable:

Batch 2: PD-Colo Prefill-Decode Interference Proof
Batch 5: Failure Attribution Near SRR Boundary

Reason:

These explain the mechanisms behind SLO/SRR failure and determine what the positive system should actually fix.

Priority 3

Do these after instrumentation and attribution are ready:

Batch 4: Sustainable Request Rate Sweep
Batch 6: Audit Package

Reason:

SRR sweeps are expensive. They should run only after trace validity, logging, and attribution labels are ready.

10. Non-Negotiable Reviewer Rules

Do not use session-nonsequential loadgen for online-serving claims.
Do not compare latency percentiles without attempted/completed/error counts.
Do not use APC alone as a success metric.
Do not use average GPU utilization as proof of load balance.
Do not compare policies on different traces unless explicitly labeled.
Do not hide failed requests or timeouts.
Do not claim Unified/PUSH is the answer before failure attribution proves the relevant bottleneck and cost budget.
Treat corrected LMetric/cache-aware PD-colo as the main baseline.
Treat static PD-disagg as an important baseline, not a strawman.
Every result must be reproducible from raw artifacts and commands.

21 KiB Raw Blame History Unescape Escape

Agentic Workload Characterization TODO

Progress Snapshot (2026-05-25, post-Window-1)

0. Purpose

1. Global Delivery Rules

2. Batch 0: Benchmark Substrate Audit

Goal

TODO

Data Artifacts

Figures

Audit Checks

Pass Criteria

3. Batch 1: Workload Characterization

Goal

TODO

Data Artifacts

Figures

Audit Checks

Pass Criteria

4. Batch 2: PD-Colo Prefill-Decode Interference Proof

Goal

TODO

Data Artifacts

Figures

Audit Checks

Pass Criteria

5. Batch 3: Session Hot-Spot Residual Imbalance Proof

Goal

TODO

Data Artifacts

Figures

Audit Checks

Pass Criteria

6. Batch 4: Sustainable Request Rate Sweep

Goal

TODO

Data Artifacts

Figures

Audit Checks

Pass Criteria

7. Batch 5: Failure Attribution Near SRR Boundary

Goal

TODO

Data Artifacts

Figures

Audit Checks

Pass Criteria

8. Batch 6: Audit Package

Goal

TODO

Final Artifacts

Audit Checks

9. Priority Order

Priority 1

Priority 2

Priority 3

10. Non-Negotiable Reviewer Rules

21 KiB

Raw Blame History