Files
agentic-kvc/analysis/characterization_todo_for_interns.md
Gahow Wang 4722883903 Audit package refresh: Window 1 supported claims + risk register
Refresh the standing audit package now that B1' / B2 / B3 are complete.

current_results/characterization_claim_matrix.md
  Flips seven entries from "not_yet_supported" / "partially_supported"
  to "supported" with pointers into window_1_results/. New entries
  cover per-session sequentiality, KV per request, real reuse
  decomposition, theoretical APC ceiling, the LMetric locality gap,
  Unified breaking the locality-vs-latency tradeoff, B2 causal
  interference proof, sticky's interference inflation, and the
  partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay
  "not_yet_supported" (Window 2 work).

current_results/main_claim_allowed_runs.md
  New "Allowed For Routing-Policy Comparison" section pins the five
  B3 policy directories. New "Allowed For PD-colo Interference"
  section pins the B2 sweep. Legacy section retained for the
  pre-instrumentation 200/500/1000-req runs.

current_results/reviewer_risk_register.md
  Marks the two old "high"-severity risks (sequentiality / reuse
  decomposition) as resolved; adds new entries for the APC
  contamination empirics, the b3_analyze.sh truncate-write bug that
  cost unified's interference index, the GPU-0 EngineCore ghost
  cleanup, the saturated-replay caveat for trace-timestamp dispatch,
  and the synthetic B2 decode workload.

current_results/all_figures_index.md
  Adds the 8 new Window 1 figures alongside the existing 6 from the
  legacy summarize_runs run.

current_results/reproduction_commands.sh
  Records the full B3 + B2 + figure pipeline.

analysis/characterization_todo_for_interns.md
  Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE;
  only B4 and B5 remain (Window 2).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 23:25:27 +08:00

21 KiB
Raw Blame History

Agentic Workload Characterization TODO

Status: execution checklist for interns Date: 2026-05-25 Last progress audit: 2026-05-25

Progress Snapshot (2026-05-25, post-Window-1)

Batch State Evidence
B0 Substrate audit DONE for new runs, legacy still partial A1+A2 instrumentation lands per-request unix timestamps and X-Request-Id passthrough; B3 sweep 2026-05-25 achieves 100% join coverage on all 5 policy runs
B1 Workload characterization DONE window_1_results/kv_footprint_summary.json (98304 B/token, p99 = 11.49 GiB); real reuse decomposition (lmetric_reuse.json: 93.2% intra-session, 5.7% cross, 1.1% shared); theoretical APC ceilings (apc_upper_w600.json: 79.6% intra / 80.3% any)
B2 PD interference DONE outputs/b2_microbench/sweep/ 5 × 2 cells. Different-worker control idx 0.92-1.02 across 32× prefill size variation; same-worker TTFT idx scales 2.15× → 218×. Causal proof complete.
B3 5-policy routing sweep DONE outputs/b3_sweep_20260525_095043/ lmetric/load_only/sticky (warm-cache) + unified/capped (isolated cold-start). Aggregated in b3_policy_comparison.json. Unified hits APC 79.4% (97% of ceiling) AND TTFT p90 7.24 s.
B4 SRR sweep NOT DONE Window 2 task. A4 loadgen + per-class SLO + λ binary search per policy.
B5 Failure attribution NOT DONE Window 2 task. Depends on B4 SRR boundaries.
B6 Audit package DONE for Window 1 current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh} refreshed; Window 1 results aggregated in window_1_results.md + 8 PNG figures

Reusable assets already in repo:

  • analysis/characterization/analyze.py — B0+B1 CPU-only analyzer
  • analysis/characterization/summarize_runs.py — existing-run audit producing the B6 scaffold
  • analysis/characterization/plot_current_results.py — figure regeneration script
  • analysis/characterization/protocols.md — B2B6 protocol with required instrumentation, sweep, pass condition
  • analysis/characterization/current_results/ — current audit package (claim matrix + risk register + allowed-runs gate + 6 PNG figures)

Hard gates still blocking main claims:

  1. Replayer/proxy must emit per-request dispatch + finish/error wall-clock timestamps (blocks B0 actual sequentiality, B4 SRR validity).
  2. Per-request record must carry session_id + hash_ids + cached_tokens jointly (blocks B1 reuse decomposition).
  3. Engine/proxy must log decode-step and prefill-chunk timestamps with worker id (blocks B2 interference index).
  4. Proxy must log route decision, chosen worker, candidate scores, per-worker queue/KV/APC snapshot (blocks B3 hot-spot proof).

0. Purpose

We are not starting from the assumption that Unified routing or PUSH migration is already the answer.

The first goal is to build a rigorous characterization package that proves:

  1. which dimensions make agentic serving different;
  2. where static PD-disaggregation works poorly;
  3. where PD-colocation/cache-aware routing still has residual failure modes;
  4. how these failure modes reduce sustainable request rate under SLO.

Only after these facts are established should we refine the positive system design.

Primary system goal:

maximize sustainable request rate under SLO

Prefill-decode interference and session hot-spot imbalance are mechanisms that may reduce SRR. They are not the final metric by themselves.

1. Global Delivery Rules

Every task must produce data, figures, and an audit trail. A task is not complete if it only produces a written conclusion.

Use this output layout:

outputs/characterization/<date>/<task_name>/
├── manifest.json
├── raw/
├── summary.json
├── summary.md
├── figures/
└── audit.md

Required fields in manifest.json:

{
  "git_commit": "",
  "host": "",
  "gpu_type": "",
  "gpu_count": 0,
  "trace_path": "",
  "trace_sha256": "",
  "policy": "",
  "launch_command": "",
  "request_limit": null,
  "time_scale": null,
  "session_sampling_method": "",
  "session_sequential": true,
  "start_time": "",
  "end_time": ""
}

Every comparison must report:

  • attempted requests
  • completed requests
  • errors / timeouts
  • goodput
  • TTFT p50/p90/p99
  • E2E p50/p90/p99
  • TPOT p50/p90/p99
  • per-worker queue metrics
  • per-worker GPU utilization
  • per-worker KV occupancy if available
  • per-worker APC / cache-hit metrics

Every figure must be reproducible from raw data by a script committed or saved alongside the artifact.

2. Batch 0: Benchmark Substrate Audit

Status: analyzer DONE (analyze.py); legacy-run sequentiality claim BLOCKED by missing dispatch/finish timestamps in metrics.jsonl. New replayer must add those fields before any online_realistic classification is allowed.

Goal

Prove the load generator and trace replay are valid before trusting any performance result.

The most important invariant:

For online agentic serving, each session must have at most one in-flight turn.
Turn N+1 must not be sent before turn N completes.

TODO

  1. Implement or run an analyzer that reconstructs per-session request intervals:
    • dispatch timestamp
    • first-token timestamp
    • finish timestamp
    • error / timeout timestamp
  2. Compute max concurrent in-flight turns per session.
  3. Compute session start-time distribution.
  4. Compute turn inter-arrival distribution.
  5. Classify each existing run as one of:
    • online_realistic
    • burst_stress
    • synthetic_microbench
    • invalid_for_online_claim
  6. For any run where session sequentiality is violated, write down exactly which claim it can still support.

Data Artifacts

  • session_concurrency.json
  • session_arrival_stats.json
  • turn_interval_stats.json
  • trace_profile.json
  • invalid_runs.md

Figures

  • session start-time CDF
  • per-session max in-flight histogram
  • turns per session CDF
  • turn inter-arrival CDF

Audit Checks

The audit.md must answer:

  1. Does the main trace satisfy max_inflight_per_session == 1?
  2. If not, is the run explicitly labeled as stress or invalid?
  3. Are attempted/completed/error counts included?
  4. Are latency percentiles computed only over successes, and if so, is goodput also reported?

Pass Criteria

  • Main online-serving experiments must have max_inflight_per_session == 1.
  • Any violation must be clearly labeled and excluded from SRR claims.

3. Batch 1: Workload Characterization

Status: trace-shape items (1, 2, 3, 6, 8) DONE on full 7200 s GLM-5.1 trace; recorded in current_results/full_trace_summary.json. Items 4 (KV footprint), 5 (reuse decomposition), 7 (uncached append delta) are PENDING because they need --kv-bytes-per-token for the production model and joinable cached_tokens+hash_ids per request.

Goal

Establish agentic workload facts independent of any proposed system.

Required facts:

  1. long input, short output;
  2. large per-request KV footprint;
  3. reuse is mostly intra-session;
  4. session token mass is heavy-tailed;
  5. total prompt length and effective uncached prefill work are different.

TODO

  1. Compute input token CDF.

  2. Compute output token CDF.

  3. Compute input/output ratio.

  4. Estimate KV footprint per request:

    kv_bytes_per_request = input_tokens * kv_bytes_per_token
    
  5. Decompose reusable KV into:

    • intra-session reuse
    • cross-session reuse
    • shared/system-prefix reuse
  6. Compute session-level skew:

    • turns per session
    • cumulative input tokens per session
    • cumulative output tokens per session
    • cumulative uncached tokens per session
    • top-k session contribution
  7. Compute append / effective-prefill distribution:

    uncached_tokens = input_tokens - cached_tokens
    
  8. Compare total input length vs uncached tokens.

Data Artifacts

  • workload_summary.json
  • kv_footprint_summary.json
  • reuse_decomposition.json
  • session_skew.json
  • append_delta_stats.json

Figures

  • input/output token CDF
  • input/output ratio CDF
  • KV footprint CDF
  • reuse decomposition stacked bar
  • turns per session CDF
  • per-session token mass Lorenz curve
  • top-k sessions token contribution bar
  • total input vs uncached tokens scatter

Audit Checks

The audit.md must answer:

  1. What are input p50/p90/p99?
  2. What are output p50/p90/p99?
  3. What is the estimated KV footprint p50/p90/p99?
  4. What fraction of reuse is intra-session?
  5. What fraction of total token mass comes from top 1% / 5% sessions?
  6. Are long prompts often small appends after cache reuse?

Pass Criteria

The batch passes only if these facts can be stated numerically with raw data links and plotted figures.

4. Batch 2: PD-Colo Prefill-Decode Interference Proof

Status: protocol DONE (analysis/characterization/protocols.md §"Batch 2 Protocol"); execution NOT STARTED — needs new engine instrumentation for decode-step and prefill-chunk timestamps.

Goal

Prove that PD-colocation can suffer from prefill-decode interference under high load, and quantify how much this affects TPOT, decode queueing, and SLO.

Hypothesis:

When heavy uncached prefill overlaps with active decode on the same worker,
decode TPOT and/or decode queue delay increases.

TODO

  1. Run controlled microbenchmarks:

    • decode-only steady load;
    • decode load plus same-worker heavy prefill injection;
    • decode load plus different-worker heavy prefill injection.
  2. Sweep uncached prefill sizes:

    • 2k
    • 8k
    • 16k
    • 32k
    • 64k
  3. If supported, sweep chunked prefill size.

  4. Log timestamps for:

    • decode steps;
    • prefill start/end;
    • prefill chunks;
    • queue admission;
    • request completion.
  5. In trace replay, label decode steps by whether they overlap with same-worker prefill.

  6. Compute:

    interference_index =
      TPOT_p90(decode steps overlapping same-worker prefill)
      / TPOT_p90(decode steps without same-worker prefill)
    
  7. Compare same-worker vs different-worker controls.

Data Artifacts

  • interference_microbench_summary.json
  • decode_step_timeseries.csv
  • prefill_overlap_events.jsonl
  • interference_index.json
  • trace_overlap_summary.json

Figures

  • TPOT time series with prefill overlap annotation
  • interference index vs uncached prefill size
  • same-worker vs different-worker TPOT boxplot
  • chunk size vs TTFT/TPOT tradeoff
  • trace replay overlap vs non-overlap TPOT comparison

Audit Checks

The audit.md must answer:

  1. Is the interference observed on the same worker?
  2. Is the different-worker control significantly weaker?
  3. Does interference grow with uncached prefill size?
  4. Does the phenomenon appear in real trace replay, not only microbench?
  5. Could the result be explained by global load instead of local colocation?

Pass Criteria

  • Same-worker overlap must measurably increase TPOT or decode queue delay.
  • The effect must be weaker or absent in the different-worker control.
  • The effect must be visible in at least one trace replay setting.

5. Batch 3: Session Hot-Spot Residual Imbalance Proof

Status: protocol DONE; partial signal from legacy gpu_util.csv (GPU-util imbalance visible) but causal proof NOT STARTED — needs per-worker queue/KV/APC and session→worker map from instrumented proxy.

Goal

Prove that cache-aware/LMetric is a strong baseline but still leaves residual hot-worker imbalance due to session skew and locality.

Hypothesis:

Cache-aware routing preserves locality by attracting future turns to cached
workers. This is usually good, but heavy-tailed sessions can create hot
workers whose queue delay/SLO violations are much worse than the median
worker even when other workers still have headroom.

TODO

  1. Run the same session-causal trace with:

    • corrected LMetric/cache-aware;
    • load-only routing;
    • hard sticky routing;
    • current Unified hybrid, if available.
  2. For each worker, record:

    • assigned session count;
    • cumulative input tokens;
    • cumulative uncached tokens;
    • cumulative output tokens;
    • request queue delay;
    • decode queue delay;
    • GPU utilization;
    • KV occupancy;
    • APC / cache-hit rate;
    • SLO violations.
  3. For each session, record:

    • worker set used;
    • primary worker;
    • cumulative token mass;
    • number of turns;
    • latency contribution;
    • whether it appears in slow-request set.
  4. Create a session-mass capped or equalized replay:

    • cap max session turns or token mass;
    • rerun LMetric/cache-aware;
    • compare hot-spot index.
  5. Compute:

    hotspot_index =
      max_worker_queue_delay_p90 / median_worker_queue_delay_p90
    
  6. Compute locality/load tradeoff:

    locality_gain = APC(policy) - APC(load_only)
    imbalance_cost =
      max_worker_latency_p90(policy) - median_worker_latency_p90(policy)
    

Data Artifacts

  • worker_balance_summary.json
  • session_to_worker_map.json
  • session_mass_summary.json
  • routing_policy_comparison.json
  • hotspot_index.json
  • capped_session_replay_summary.json

Figures

  • per-worker queue delay bar
  • per-worker token mass bar
  • GPU utilization timeline by worker
  • KV occupancy timeline by worker
  • APC vs queue delay scatter
  • top sessions contribution bar
  • policy tradeoff plot: APC vs hotspot_index
  • original vs session-capped hot-spot comparison

Audit Checks

The audit.md must answer:

  1. Does LMetric/cache-aware still show worker-level skew?
  2. Are SLO violations concentrated on hot workers or hot sessions?
  3. Does load-only routing improve balance but reduce APC/locality?
  4. Does hard sticky improve locality but worsen hot-spot/HOL?
  5. Does session-mass capping reduce hot spots?

Pass Criteria

  • LMetric/cache-aware must be shown as strong but imperfect.
  • There must be measurable residual hot-worker imbalance.
  • The imbalance must correlate with session token mass or locality.

6. Batch 4: Sustainable Request Rate Sweep

Status: protocol DONE; execution NOT STARTED — requires open-loop session-causal loadgen and policy-comparable arrival process.

Goal

Connect interference and hot-spot mechanisms to the final metric:

SRR(SLO) = max arrival rate satisfying SLO in steady state

TODO

  1. Define provisional SLO thresholds. Use configurable values, for example:

    TTFT_p90 <= T_ttft
    E2E_p90  <= T_e2e
    TPOT_p90 <= T_tpot
    error_rate <= epsilon
    queue length stable
    KV occupancy stable
    
  2. Implement arrival-rate sweep:

    • Poisson session arrivals;
    • session-internal sequentiality;
    • warmup window;
    • steady-state measurement window.
  3. For each arrival rate lambda, run:

    • PD-colo cache-aware/LMetric;
    • static PD-disagg;
    • current Unified hybrid;
    • optional hard sticky;
    • optional load-only.
  4. Find maximum sustainable lambda for each policy.

  5. Report instability reasons:

    • SLO violation;
    • queue growth;
    • KV occupancy growth;
    • error/timeout growth.

Data Artifacts

  • srr_curve.json
  • lambda_runs/<lambda>/summary.json
  • slo_violation_reason.json
  • goodput_vs_arrival_rate.json
  • stability_summary.json

Figures

  • SRR bar chart
  • TTFT p90 vs arrival rate
  • E2E p90 vs arrival rate
  • TPOT p90 vs arrival rate
  • goodput vs arrival rate
  • error rate vs arrival rate
  • queue length over time near failure point
  • KV occupancy over time near failure point

Audit Checks

The audit.md must answer:

  1. Are session arrivals open-loop and Poisson?
  2. Is session-internal sequentiality enforced?
  3. How long are warmup and steady-state windows?
  4. Is SRR failure persistent rather than transient?
  5. Are completed/requested counts reported at every lambda?
  6. Are policies compared on the same trace and same arrival process?

Pass Criteria

  • Each policy must have a measured SRR under the same SLO.
  • Failure must be attributed to persistent SLO violation, queue growth, KV growth, or error growth.
  • Data must be session-causal.

7. Batch 5: Failure Attribution Near SRR Boundary

Status: protocol DONE; execution NOT STARTED — depends on B2 instrumentation and B4 SRR boundary.

Goal

At and around the PD-colo/LMetric failure point, determine whether SLO violations are caused by prefill-decode interference, session hot spots, KV pressure, cache misses, or other mechanisms.

TODO

  1. Select three arrival rates:

    lambda = 0.9 * SRR
    lambda = 1.0 * SRR
    lambda = 1.1 * SRR
    
  2. For every slow or SLO-violating request, assign labels:

    • same-worker prefill overlap;
    • hot worker queue;
    • high KV occupancy;
    • cache miss / large uncached append;
    • transfer wait;
    • P queue wait;
    • D admission wait;
    • unknown.
  3. Produce per-request waterfall for representative slow requests.

  4. Produce per-worker timeline around failure windows.

  5. Summarize cause distribution.

Data Artifacts

  • slow_request_attribution.jsonl
  • failure_breakdown.json
  • case_studies.md
  • worker_failure_windows.json

Figures

  • SLO violation cause stacked bar
  • slow request waterfall
  • worker timeline near failure
  • prefill/decode/KV/queue stacked breakdown
  • failure cause vs arrival rate

Audit Checks

The audit.md must answer:

  1. What fraction of slow requests overlap same-worker prefill?
  2. What fraction are on hot workers?
  3. What fraction happen under high KV occupancy?
  4. What fraction are large uncached append requests?
  5. For PD-disagg/Unified migration, how much time is transfer/P queue/D wait?
  6. What remains unexplained?

Pass Criteria

The batch must answer:

  1. Why PD-colo/LMetric hits its SRR limit.
  2. Why static PD-disagg hits its SRR limit.
  3. If Unified/PUSH underperforms, whether the cause is trigger quality, cost model, transfer overhead, wrong load regime, or something else.

8. Batch 6: Audit Package

Status: scaffold DONE — all five final artifacts exist under analysis/characterization/current_results/ and are regenerated by summarize_runs.py + plot_current_results.py. Future B2B5 outputs must be merged into the same package by re-running summarize_runs.py after new runs.

Goal

Make the whole characterization package reviewable by a strict systems reviewer.

TODO

  1. Write a claim matrix:

    claim -> data artifact -> figure -> script -> caveat -> reviewer risk
    
  2. Write a figure index:

    • figure filename;
    • source data;
    • generation command;
    • intended claim.
  3. Write a reviewer risk register:

    • loadgen validity risks;
    • trace representativeness risks;
    • metric bias risks;
    • implementation-specific risks;
    • generalization risks.
  4. Write a reproduction script or command list.

  5. Mark experiments that cannot support main claims.

Final Artifacts

  • characterization_claim_matrix.md
  • all_figures_index.md
  • reviewer_risk_register.md
  • reproduction_commands.sh
  • main_claim_allowed_runs.md

Audit Checks

The final package must satisfy:

  1. Every claim links to raw data.
  2. Every figure can be regenerated.
  3. Every experiment has a manifest.
  4. Every caveat is explicit.
  5. Invalid or stress-only runs are not used for online-serving claims.

9. Priority Order

Priority 1

Do these first:

  1. Batch 0: Benchmark Substrate Audit
  2. Batch 1: Workload Characterization
  3. Batch 3: Session Hot-Spot Residual Imbalance Proof

Reason:

These define whether the trace and routing problem are real. Without them, SRR sweeps and system experiments are not trustworthy.

Priority 2

Do these after the substrate and workload facts are stable:

  1. Batch 2: PD-Colo Prefill-Decode Interference Proof
  2. Batch 5: Failure Attribution Near SRR Boundary

Reason:

These explain the mechanisms behind SLO/SRR failure and determine what the positive system should actually fix.

Priority 3

Do these after instrumentation and attribution are ready:

  1. Batch 4: Sustainable Request Rate Sweep
  2. Batch 6: Audit Package

Reason:

SRR sweeps are expensive. They should run only after trace validity, logging, and attribution labels are ready.

10. Non-Negotiable Reviewer Rules

  1. Do not use session-nonsequential loadgen for online-serving claims.
  2. Do not compare latency percentiles without attempted/completed/error counts.
  3. Do not use APC alone as a success metric.
  4. Do not use average GPU utilization as proof of load balance.
  5. Do not compare policies on different traces unless explicitly labeled.
  6. Do not hide failed requests or timeouts.
  7. Do not claim Unified/PUSH is the answer before failure attribution proves the relevant bottleneck and cost budget.
  8. Treat corrected LMetric/cache-aware PD-colo as the main baseline.
  9. Treat static PD-disagg as an important baseline, not a strawman.
  10. Every result must be reproducible from raw artifacts and commands.