agentic-kvc/analysis/characterization_todo_for_interns.md

# Agentic Workload Characterization TODO

Status: execution checklist for interns
Date: 2026-05-25
Last progress audit: 2026-05-25

## Progress Snapshot (2026-05-25, post-Window-1)

| Batch | State | Evidence |
|---|---|---|
| B0 Substrate audit | **DONE for new runs**, legacy still partial | A1+A2 instrumentation lands per-request unix timestamps and X-Request-Id passthrough; B3 sweep 2026-05-25 achieves 100% join coverage on all 5 policy runs |
| B1 Workload characterization | **DONE** | `window_1_results/kv_footprint_summary.json` (98304 B/token, p99 = 11.49 GiB); real reuse decomposition (`lmetric_reuse.json`: 93.2% intra-session, 5.7% cross, 1.1% shared); theoretical APC ceilings (`apc_upper_w600.json`: 79.6% intra / 80.3% any) |
| B2 PD interference | **DONE** | `outputs/b2_microbench/sweep/` 5 × 2 cells. Different-worker control idx 0.92-1.02 across 32× prefill size variation; same-worker TTFT idx scales 2.15× → 218×. Causal proof complete. |
| B3 5-policy routing sweep | **DONE** | `outputs/b3_sweep_20260525_095043/` lmetric/load_only/sticky (warm-cache) + unified/capped (isolated cold-start). Aggregated in `b3_policy_comparison.json`. Unified hits APC 79.4% (97% of ceiling) AND TTFT p90 7.24 s. |
| B4 SRR sweep | NOT DONE | Window 2 task. A4 loadgen + per-class SLO + λ binary search per policy. |
| B5 Failure attribution | NOT DONE | Window 2 task. Depends on B4 SRR boundaries. |
| B6 Audit package | **DONE for Window 1** | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` refreshed; Window 1 results aggregated in `window_1_results.md` + 8 PNG figures |

Reusable assets already in repo:

- `analysis/characterization/analyze.py` — B0+B1 CPU-only analyzer
- `analysis/characterization/summarize_runs.py` — existing-run audit producing the B6 scaffold
- `analysis/characterization/plot_current_results.py` — figure regeneration script
- `analysis/characterization/protocols.md` — B2–B6 protocol with required instrumentation, sweep, pass condition
- `analysis/characterization/current_results/` — current audit package (claim matrix + risk register + allowed-runs gate + 6 PNG figures)

Hard gates still blocking main claims:

1. Replayer/proxy must emit per-request dispatch + finish/error wall-clock timestamps (blocks B0 actual sequentiality, B4 SRR validity).
2. Per-request record must carry `session_id` + `hash_ids` + `cached_tokens` jointly (blocks B1 reuse decomposition).
3. Engine/proxy must log decode-step and prefill-chunk timestamps with worker id (blocks B2 interference index).
4. Proxy must log route decision, chosen worker, candidate scores, per-worker queue/KV/APC snapshot (blocks B3 hot-spot proof).


## 0. Purpose

We are not starting from the assumption that Unified routing or PUSH
migration is already the answer.

The first goal is to build a rigorous characterization package that proves:

1. which dimensions make agentic serving different;
2. where static PD-disaggregation works poorly;
3. where PD-colocation/cache-aware routing still has residual failure modes;
4. how these failure modes reduce sustainable request rate under SLO.

Only after these facts are established should we refine the positive system
design.

Primary system goal:

```text
maximize sustainable request rate under SLO
```

Prefill-decode interference and session hot-spot imbalance are mechanisms
that may reduce SRR. They are not the final metric by themselves.

## 1. Global Delivery Rules

Every task must produce data, figures, and an audit trail. A task is not
complete if it only produces a written conclusion.

Use this output layout:

```text
outputs/characterization/<date>/<task_name>/
├── manifest.json
├── raw/
├── summary.json
├── summary.md
├── figures/
└── audit.md
```

Required fields in `manifest.json`:

```json
{
  "git_commit": "",
  "host": "",
  "gpu_type": "",
  "gpu_count": 0,
  "trace_path": "",
  "trace_sha256": "",
  "policy": "",
  "launch_command": "",
  "request_limit": null,
  "time_scale": null,
  "session_sampling_method": "",
  "session_sequential": true,
  "start_time": "",
  "end_time": ""
}
```

Every comparison must report:

- attempted requests
- completed requests
- errors / timeouts
- goodput
- TTFT p50/p90/p99
- E2E p50/p90/p99
- TPOT p50/p90/p99
- per-worker queue metrics
- per-worker GPU utilization
- per-worker KV occupancy if available
- per-worker APC / cache-hit metrics

Every figure must be reproducible from raw data by a script committed or
saved alongside the artifact.

## 2. Batch 0: Benchmark Substrate Audit

Status: analyzer DONE (`analyze.py`); legacy-run sequentiality claim BLOCKED by missing dispatch/finish timestamps in `metrics.jsonl`. New replayer must add those fields before any `online_realistic` classification is allowed.

### Goal

Prove the load generator and trace replay are valid before trusting any
performance result.

The most important invariant:

```text
For online agentic serving, each session must have at most one in-flight turn.
Turn N+1 must not be sent before turn N completes.
```

### TODO

1. Implement or run an analyzer that reconstructs per-session request
   intervals:
   - dispatch timestamp
   - first-token timestamp
   - finish timestamp
   - error / timeout timestamp
2. Compute max concurrent in-flight turns per session.
3. Compute session start-time distribution.
4. Compute turn inter-arrival distribution.
5. Classify each existing run as one of:
   - `online_realistic`
   - `burst_stress`
   - `synthetic_microbench`
   - `invalid_for_online_claim`
6. For any run where session sequentiality is violated, write down exactly
   which claim it can still support.

### Data Artifacts

- `session_concurrency.json`
- `session_arrival_stats.json`
- `turn_interval_stats.json`
- `trace_profile.json`
- `invalid_runs.md`

### Figures

- session start-time CDF
- per-session max in-flight histogram
- turns per session CDF
- turn inter-arrival CDF

### Audit Checks

The `audit.md` must answer:

1. Does the main trace satisfy `max_inflight_per_session == 1`?
2. If not, is the run explicitly labeled as stress or invalid?
3. Are attempted/completed/error counts included?
4. Are latency percentiles computed only over successes, and if so, is
   goodput also reported?

### Pass Criteria

- Main online-serving experiments must have `max_inflight_per_session == 1`.
- Any violation must be clearly labeled and excluded from SRR claims.

## 3. Batch 1: Workload Characterization

Status: trace-shape items (1, 2, 3, 6, 8) DONE on full 7200 s GLM-5.1 trace; recorded in `current_results/full_trace_summary.json`. Items 4 (KV footprint), 5 (reuse decomposition), 7 (uncached append delta) are PENDING because they need `--kv-bytes-per-token` for the production model and joinable `cached_tokens`+`hash_ids` per request.

### Goal

Establish agentic workload facts independent of any proposed system.

Required facts:

1. long input, short output;
2. large per-request KV footprint;
3. reuse is mostly intra-session;
4. session token mass is heavy-tailed;
5. total prompt length and effective uncached prefill work are different.

### TODO

1. Compute input token CDF.
2. Compute output token CDF.
3. Compute input/output ratio.
4. Estimate KV footprint per request:

   ```text
   kv_bytes_per_request = input_tokens * kv_bytes_per_token
   ```

5. Decompose reusable KV into:
   - intra-session reuse
   - cross-session reuse
   - shared/system-prefix reuse
6. Compute session-level skew:
   - turns per session
   - cumulative input tokens per session
   - cumulative output tokens per session
   - cumulative uncached tokens per session
   - top-k session contribution
7. Compute append / effective-prefill distribution:

   ```text
   uncached_tokens = input_tokens - cached_tokens
   ```

8. Compare total input length vs uncached tokens.

### Data Artifacts

- `workload_summary.json`
- `kv_footprint_summary.json`
- `reuse_decomposition.json`
- `session_skew.json`
- `append_delta_stats.json`

### Figures

- input/output token CDF
- input/output ratio CDF
- KV footprint CDF
- reuse decomposition stacked bar
- turns per session CDF
- per-session token mass Lorenz curve
- top-k sessions token contribution bar
- total input vs uncached tokens scatter

### Audit Checks

The `audit.md` must answer:

1. What are input p50/p90/p99?
2. What are output p50/p90/p99?
3. What is the estimated KV footprint p50/p90/p99?
4. What fraction of reuse is intra-session?
5. What fraction of total token mass comes from top 1% / 5% sessions?
6. Are long prompts often small appends after cache reuse?

### Pass Criteria

The batch passes only if these facts can be stated numerically with raw data
links and plotted figures.

## 4. Batch 2: PD-Colo Prefill-Decode Interference Proof

Status: protocol DONE (`analysis/characterization/protocols.md` §"Batch 2 Protocol"); execution NOT STARTED — needs new engine instrumentation for decode-step and prefill-chunk timestamps.

### Goal

Prove that PD-colocation can suffer from prefill-decode interference under
high load, and quantify how much this affects TPOT, decode queueing, and SLO.

Hypothesis:

```text
When heavy uncached prefill overlaps with active decode on the same worker,
decode TPOT and/or decode queue delay increases.
```

### TODO

1. Run controlled microbenchmarks:
   - decode-only steady load;
   - decode load plus same-worker heavy prefill injection;
   - decode load plus different-worker heavy prefill injection.
2. Sweep uncached prefill sizes:
   - 2k
   - 8k
   - 16k
   - 32k
   - 64k
3. If supported, sweep chunked prefill size.
4. Log timestamps for:
   - decode steps;
   - prefill start/end;
   - prefill chunks;
   - queue admission;
   - request completion.
5. In trace replay, label decode steps by whether they overlap with
   same-worker prefill.
6. Compute:

   ```text
   interference_index =
     TPOT_p90(decode steps overlapping same-worker prefill)
     / TPOT_p90(decode steps without same-worker prefill)
   ```

7. Compare same-worker vs different-worker controls.

### Data Artifacts

- `interference_microbench_summary.json`
- `decode_step_timeseries.csv`
- `prefill_overlap_events.jsonl`
- `interference_index.json`
- `trace_overlap_summary.json`

### Figures

- TPOT time series with prefill overlap annotation
- interference index vs uncached prefill size
- same-worker vs different-worker TPOT boxplot
- chunk size vs TTFT/TPOT tradeoff
- trace replay overlap vs non-overlap TPOT comparison

### Audit Checks

The `audit.md` must answer:

1. Is the interference observed on the same worker?
2. Is the different-worker control significantly weaker?
3. Does interference grow with uncached prefill size?
4. Does the phenomenon appear in real trace replay, not only microbench?
5. Could the result be explained by global load instead of local colocation?

### Pass Criteria

- Same-worker overlap must measurably increase TPOT or decode queue delay.
- The effect must be weaker or absent in the different-worker control.
- The effect must be visible in at least one trace replay setting.

## 5. Batch 3: Session Hot-Spot Residual Imbalance Proof

Status: protocol DONE; partial signal from legacy `gpu_util.csv` (GPU-util imbalance visible) but causal proof NOT STARTED — needs per-worker queue/KV/APC and session→worker map from instrumented proxy.

### Goal

Prove that cache-aware/LMetric is a strong baseline but still leaves residual
hot-worker imbalance due to session skew and locality.

Hypothesis:

```text
Cache-aware routing preserves locality by attracting future turns to cached
workers. This is usually good, but heavy-tailed sessions can create hot
workers whose queue delay/SLO violations are much worse than the median
worker even when other workers still have headroom.
```

### TODO

1. Run the same session-causal trace with:
   - corrected LMetric/cache-aware;
   - load-only routing;
   - hard sticky routing;
   - current Unified hybrid, if available.
2. For each worker, record:
   - assigned session count;
   - cumulative input tokens;
   - cumulative uncached tokens;
   - cumulative output tokens;
   - request queue delay;
   - decode queue delay;
   - GPU utilization;
   - KV occupancy;
   - APC / cache-hit rate;
   - SLO violations.
3. For each session, record:
   - worker set used;
   - primary worker;
   - cumulative token mass;
   - number of turns;
   - latency contribution;
   - whether it appears in slow-request set.
4. Create a session-mass capped or equalized replay:
   - cap max session turns or token mass;
   - rerun LMetric/cache-aware;
   - compare hot-spot index.
5. Compute:

   ```text
   hotspot_index =
     max_worker_queue_delay_p90 / median_worker_queue_delay_p90
   ```

6. Compute locality/load tradeoff:

   ```text
   locality_gain = APC(policy) - APC(load_only)
   imbalance_cost =
     max_worker_latency_p90(policy) - median_worker_latency_p90(policy)
   ```

### Data Artifacts

- `worker_balance_summary.json`
- `session_to_worker_map.json`
- `session_mass_summary.json`
- `routing_policy_comparison.json`
- `hotspot_index.json`
- `capped_session_replay_summary.json`

### Figures

- per-worker queue delay bar
- per-worker token mass bar
- GPU utilization timeline by worker
- KV occupancy timeline by worker
- APC vs queue delay scatter
- top sessions contribution bar
- policy tradeoff plot: APC vs hotspot_index
- original vs session-capped hot-spot comparison

### Audit Checks

The `audit.md` must answer:

1. Does LMetric/cache-aware still show worker-level skew?
2. Are SLO violations concentrated on hot workers or hot sessions?
3. Does load-only routing improve balance but reduce APC/locality?
4. Does hard sticky improve locality but worsen hot-spot/HOL?
5. Does session-mass capping reduce hot spots?

### Pass Criteria

- LMetric/cache-aware must be shown as strong but imperfect.
- There must be measurable residual hot-worker imbalance.
- The imbalance must correlate with session token mass or locality.

## 6. Batch 4: Sustainable Request Rate Sweep

Status: protocol DONE; execution NOT STARTED — requires open-loop session-causal loadgen and policy-comparable arrival process.

### Goal

Connect interference and hot-spot mechanisms to the final metric:

```text
SRR(SLO) = max arrival rate satisfying SLO in steady state
```

### TODO

1. Define provisional SLO thresholds. Use configurable values, for example:

   ```text
   TTFT_p90 <= T_ttft
   E2E_p90  <= T_e2e
   TPOT_p90 <= T_tpot
   error_rate <= epsilon
   queue length stable
   KV occupancy stable
   ```

2. Implement arrival-rate sweep:
   - Poisson session arrivals;
   - session-internal sequentiality;
   - warmup window;
   - steady-state measurement window.
3. For each arrival rate `lambda`, run:
   - PD-colo cache-aware/LMetric;
   - static PD-disagg;
   - current Unified hybrid;
   - optional hard sticky;
   - optional load-only.
4. Find maximum sustainable lambda for each policy.
5. Report instability reasons:
   - SLO violation;
   - queue growth;
   - KV occupancy growth;
   - error/timeout growth.

### Data Artifacts

- `srr_curve.json`
- `lambda_runs/<lambda>/summary.json`
- `slo_violation_reason.json`
- `goodput_vs_arrival_rate.json`
- `stability_summary.json`

### Figures

- SRR bar chart
- TTFT p90 vs arrival rate
- E2E p90 vs arrival rate
- TPOT p90 vs arrival rate
- goodput vs arrival rate
- error rate vs arrival rate
- queue length over time near failure point
- KV occupancy over time near failure point

### Audit Checks

The `audit.md` must answer:

1. Are session arrivals open-loop and Poisson?
2. Is session-internal sequentiality enforced?
3. How long are warmup and steady-state windows?
4. Is SRR failure persistent rather than transient?
5. Are completed/requested counts reported at every lambda?
6. Are policies compared on the same trace and same arrival process?

### Pass Criteria

- Each policy must have a measured SRR under the same SLO.
- Failure must be attributed to persistent SLO violation, queue growth, KV
  growth, or error growth.
- Data must be session-causal.

## 7. Batch 5: Failure Attribution Near SRR Boundary

Status: protocol DONE; execution NOT STARTED — depends on B2 instrumentation and B4 SRR boundary.

### Goal

At and around the PD-colo/LMetric failure point, determine whether SLO
violations are caused by prefill-decode interference, session hot spots, KV
pressure, cache misses, or other mechanisms.

### TODO

1. Select three arrival rates:

   ```text
   lambda = 0.9 * SRR
   lambda = 1.0 * SRR
   lambda = 1.1 * SRR
   ```

2. For every slow or SLO-violating request, assign labels:
   - same-worker prefill overlap;
   - hot worker queue;
   - high KV occupancy;
   - cache miss / large uncached append;
   - transfer wait;
   - P queue wait;
   - D admission wait;
   - unknown.
3. Produce per-request waterfall for representative slow requests.
4. Produce per-worker timeline around failure windows.
5. Summarize cause distribution.

### Data Artifacts

- `slow_request_attribution.jsonl`
- `failure_breakdown.json`
- `case_studies.md`
- `worker_failure_windows.json`

### Figures

- SLO violation cause stacked bar
- slow request waterfall
- worker timeline near failure
- prefill/decode/KV/queue stacked breakdown
- failure cause vs arrival rate

### Audit Checks

The `audit.md` must answer:

1. What fraction of slow requests overlap same-worker prefill?
2. What fraction are on hot workers?
3. What fraction happen under high KV occupancy?
4. What fraction are large uncached append requests?
5. For PD-disagg/Unified migration, how much time is transfer/P queue/D wait?
6. What remains unexplained?

### Pass Criteria

The batch must answer:

1. Why PD-colo/LMetric hits its SRR limit.
2. Why static PD-disagg hits its SRR limit.
3. If Unified/PUSH underperforms, whether the cause is trigger quality, cost
   model, transfer overhead, wrong load regime, or something else.

## 8. Batch 6: Audit Package

Status: scaffold DONE — all five final artifacts exist under `analysis/characterization/current_results/` and are regenerated by `summarize_runs.py` + `plot_current_results.py`. Future B2–B5 outputs must be merged into the same package by re-running `summarize_runs.py` after new runs.

### Goal

Make the whole characterization package reviewable by a strict systems
reviewer.

### TODO

1. Write a claim matrix:

   ```text
   claim -> data artifact -> figure -> script -> caveat -> reviewer risk
   ```

2. Write a figure index:
   - figure filename;
   - source data;
   - generation command;
   - intended claim.
3. Write a reviewer risk register:
   - loadgen validity risks;
   - trace representativeness risks;
   - metric bias risks;
   - implementation-specific risks;
   - generalization risks.
4. Write a reproduction script or command list.
5. Mark experiments that cannot support main claims.

### Final Artifacts

- `characterization_claim_matrix.md`
- `all_figures_index.md`
- `reviewer_risk_register.md`
- `reproduction_commands.sh`
- `main_claim_allowed_runs.md`

### Audit Checks

The final package must satisfy:

1. Every claim links to raw data.
2. Every figure can be regenerated.
3. Every experiment has a manifest.
4. Every caveat is explicit.
5. Invalid or stress-only runs are not used for online-serving claims.

## 9. Priority Order

### Priority 1

Do these first:

1. Batch 0: Benchmark Substrate Audit
2. Batch 1: Workload Characterization
3. Batch 3: Session Hot-Spot Residual Imbalance Proof

Reason:

These define whether the trace and routing problem are real. Without them,
SRR sweeps and system experiments are not trustworthy.

### Priority 2

Do these after the substrate and workload facts are stable:

1. Batch 2: PD-Colo Prefill-Decode Interference Proof
2. Batch 5: Failure Attribution Near SRR Boundary

Reason:

These explain the mechanisms behind SLO/SRR failure and determine what the
positive system should actually fix.

### Priority 3

Do these after instrumentation and attribution are ready:

1. Batch 4: Sustainable Request Rate Sweep
2. Batch 6: Audit Package

Reason:

SRR sweeps are expensive. They should run only after trace validity,
logging, and attribution labels are ready.

## 10. Non-Negotiable Reviewer Rules

1. Do not use session-nonsequential loadgen for online-serving claims.
2. Do not compare latency percentiles without attempted/completed/error counts.
3. Do not use APC alone as a success metric.
4. Do not use average GPU utilization as proof of load balance.
5. Do not compare policies on different traces unless explicitly labeled.
6. Do not hide failed requests or timeouts.
7. Do not claim Unified/PUSH is the answer before failure attribution proves
   the relevant bottleneck and cost budget.
8. Treat corrected LMetric/cache-aware PD-colo as the main baseline.
9. Treat static PD-disagg as an important baseline, not a strawman.
10. Every result must be reproducible from raw artifacts and commands.