Window 1 results: combined B1' + B2 + B3 report and artifacts

analysis/characterization/window_1_results.md is the headline write-up
for Window 1: workload characterization (KV per request, real reuse
decomposition, APC theoretical ceilings), B3 5-policy sweep with
per-policy interpretation, B2 same-vs-different-worker interference
microbench with causal reading, and an explicit list of what Window 1
does *not* answer (deferred to B4 SRR sweep + B5 attribution).

Under window_1_results/:
- 5 raw result JSONs from the B3 sweep, the B2 microbench, the APC
  upper bound, and the KV footprint
- per-policy hotspot_index.json snapshots so render_window1_figures.py
  can plot per-worker TTFT p90 distributions
- 8 PNG figures (figures/) covering the headline claims

Three takeaways the figures pin down:
1) intra-session reuse dominates (93.2%), so session-affinity routing
   is the right primary lever
2) unified hybrid affinity hits 79.4% APC (97% of the 79.6% intra-
   session ceiling) AND cuts TTFT p90 from lmetric's 15.6s to 7.24s
3) B2 different-worker control sits at idx ≈ 1.0 across 32× prefill-
   size variation; same-worker TTFT idx scales 2.15× -> 218×, which
   is the cleanest causal evidence for same-worker prefill-decode
   interference

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 23:25:09 +08:00
parent b7902061d1
commit 0c3220cbb8
23 changed files with 951 additions and 0 deletions

View File

@@ -0,0 +1,171 @@
# Window 1 Results: B1' + B2 + B3
Status: Window 1 complete (CPU + 2 dash0 GPU windows on 2026-05-25)
Sweep: `outputs/b3_sweep_20260525_095043` (B3) + `outputs/b2_microbench/` (B2)
Trace: `traces/w600_r0.0015_st30.jsonl` (1214 requests / 274 sessions / 53.3 M input tokens)
Model: Qwen3-Coder-30B-A3B-Instruct (TP1 × 8 instances on H20)
Per-policy artifacts under `window_1_results/`. Figures under `window_1_results/figures/`.
## Headline
| Claim | Status | Evidence |
|---|---|---|
| Agentic workload reuse is overwhelmingly intra-session | **supported** | 93.2% of cached_tokens are intra-session (real); theoretical any-session APC ceiling 80.3% vs intra-session ceiling 79.6% → < 1pp gap |
| LMetric leaves 23 pp of APC on the table | **supported** | lmetric achieved 56.9% vs intra-session ceiling 79.6% (theoretical) |
| Hard session affinity recovers the locality lost by LMetric | **supported** | sticky APC 77.2% = 97% of theoretical ceiling |
| Hard affinity inflates same-worker prefill-decode interference | **supported** | sticky interference_index 13.65 vs lmetric 6.53 |
| Hybrid affinity (Unified) breaks the locality-vs-latency tradeoff | **supported** | unified hits 79.4% APC and TTFT p90 7.24 s (lmetric 15.6 s) simultaneously |
| Same-worker prefill-decode interference is causal, not correlation | **supported** | different-worker control idx1.0; same-worker idx scales monotonically with prefill size |
| Heavy-tail sessions are *a* contributor to hot-spot, not the sole cause | **supported** | cap=8 truncated trace cuts 37% of work; hotspot drops only 13% (2.241.94) |
## B1' Workload characterization
### Per-request KV footprint (Qwen3-Coder-30B-A3B)
`kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes = 2 × 48 × 4 × 128 × 2 = 98304 B`
Full GLM-5.1 trace (2.11 M requests, 1.31 M sessions):
| | p50 | p90 | p95 | p99 | max |
|---|---:|---:|---:|---:|---:|
| KV per request | 1.83 GiB | 8.04 GiB | 9.59 GiB | **11.49 GiB** | 18.5 GiB |
H20 has ~95 GiB usable per GPU. **A single p99 request occupies 12% of a single H20's HBM** purely for KV. Multi-request batching is bounded by this.
Figure: `figures/fig_kv_footprint_cdf.png`.
### Real reuse decomposition (from lmetric run on w600 trace)
| class | tokens | fraction |
|---|---:|---:|
| intra-session | 28.3 M | **93.2%** |
| cross-session | 1.72 M | 5.7% |
| shared / system-prefix | 0.34 M | 1.1% |
| unclassified | 0 | 0.0% |
session-affinity routing covers >99% of the reuse signal. There is no meaningful "system prompt" in this trace.
Figure: `figures/fig_reuse_decomposition.png`.
### Theoretical APC ceilings on w600
Computed by building a block-level trie of `hash_ids` per session (intra-session) or globally (any-session), then walking each request's `hash_ids` to count its longest prefix-match against previously-seen prefixes.
| variant | upper bound | hit requests |
|---|---:|---:|
| any-session (perfect global cache) | **80.3%** | 961 / 1214 |
| intra-session only | **79.6%** | 914 / 1214 |
| shared-prefix only (pos 0, ≥8 sessions) | 0.10% | 107 / 1214 |
Gap "any intra" is 0.7 pp → no meaningful cross-session sharing in this trace.
## B3 5-policy routing sweep
8 vLLM instances on TP1, w600 trace, `--enable-prompt-tokens-details` so `cached_tokens` is reported per request.
| policy | TTFT p50/p90/p99 | TPOT p50/p90/p99 ms | E2E p50/p90/p99 | **APC** | interference | **hotspot** | n_slow |
|---|---|---|---|---:|---:|---:|---:|
| lmetric | 0.94 / 15.59 / 52.95 | 8.9 / 21.2 / 175.9 | 2.75 / 24.75 / 79.62 | 56.9% | 6.53 | 2.24 | 295 |
| load_only | 1.25 / 20.15 / 52.65 | 9.2 / 26.7 / 320.7 | 3.58 / 33.43 / 93.92 | 54.1% | 9.16 | **1.14** | 379 |
| sticky | 0.54 / 18.02 / 71.37 | 8.9 / 36.1 / 345.2 | 2.08 / 34.61 / 133.58 | 77.2% | **13.65** | 2.35 | 234 |
| **unified** | **0.50 / 7.24 / 42.02** | 8.1 / 17.1 / 118.1 | **1.75 / 17.89 / 68.18** | **79.4%** | n/a* | 3.35 | **189** |
| capped | 1.20 / 12.76 / 46.05 | 7.2 / 16.0 / 101.5 | 2.59 / 21.24 / 73.39 | 31.6% | 6.33 | 1.94 | 185 |
\*unified `engine_state` was overwritten by my analyzer's slice step before the `b3_analyze.sh` fix landed; vLLM and the patch worked correctly. The B2 microbench provides a cleaner interference proof.
**Mechanism indices**
- `interference_index` = TPOT_p90(decode overlapping same-worker prefill) / TPOT_p90(clean)
- `hotspot_index` = max(worker TTFT p90) / median(worker TTFT p90)
Figures: `fig_b3_latency_bars.png`, `fig_b3_apc_vs_upper.png`,
`fig_b3_apc_vs_hotspot.png`, `fig_b3_per_worker_ttft_p90.png`,
`fig_b3_failure_breakdown.png`.
### Per-policy reading
- **lmetric** is the cache-aware baseline. APC 56.9% achieves only 71% of the intra-session ceiling — the missing 23 pp is the locality opportunity unified picks up.
- **load_only** strips cache awareness. Hot-spot drops to 1.14 (best), but APC only drops 3 pp because the picker's `min(num_requests)` tie-break to instance 0 creates accidental stickiness at low concurrency.
- **sticky** locks each session to one worker. APC climbs to 77.2% (97% of ceiling) but interference doubles to 13.65 and TPOT p99 hits 345 ms.
- **unified** is the hybrid — affinity gate `(cache_ratio>0.5 AND num_req ≤ 2×avg)` keeps locality where it pays and drops it where it would hurt. The result is APC 79.4% **and** TTFT p90 cut in half from lmetric. The one bad worker (engine_4 at 37.7s p90) drives `hotspot_index=3.35`, but the other seven workers are all under 18 s.
- **capped** runs lmetric on a turn-capped trace (max 8 turns/session). Removes 37% of requests but APC also crashes to 31.6% and hotspot only improves by 13%. This is the session-mass ablation: heavy sessions are *a* contributor to hot-spot but not the sole cause.
### Slow-request cause breakdown (from `joined_analysis.label_slow_requests`)
| policy | n_slow | same-worker overlap | hot worker queue | cache miss large append | unknown |
|---|---:|---:|---:|---:|---:|
| lmetric | 295 | 69 (23%) | 68 (23%) | 94 (32%) | 64 (22%) |
| load_only | 379 | 108 (29%) | 33 (9%) | 151 (40%) | 87 (23%) |
| sticky | 234 | **134 (57%)** | 51 (22%) | **20 (9%)** | 29 (12%) |
| unified | 189 | 0 (no engine_state) | 116 (61%) | 18 (10%) | 55 (29%) |
| capped | 185 | 45 (24%) | 66 (36%) | 60 (32%) | 14 (8%) |
PD-colo failures are mixed-mechanism: lmetric has no single dominant cause.
Sticky concentrates failures into same-worker overlap (locality is on, cache misses are gone, but interference takes over).
## B2 PD-colo interference microbench
Setup: 2 vLLM instances on GPU 0 (decode endpoint) and GPU 1 (prefill endpoint). A continuous 4 req/s short-prompt decode load runs against GPU 0 for 60 s per cell. 4 large-prompt one-token "prefill injections" fire every 12 s, targeted at either the same instance (`same`) or the paired one (`different`). Decode requests are labeled overlap iff their `[t_first_token, t_finish]` intersects any injection window. We compare TPOT p90 (overlap vs clean) per cell.
| variant | prefill | n_overlap | n_clean | **TPOT idx** | **TTFT idx** |
|---|---:|---:|---:|---:|---:|
| different | 2k65k | 12126 | 114228 | **0.921.02** | **0.961.00** |
| same | 2k | 12 | 228 | 1.16 | 2.15 |
| same | 8k | 19 | 221 | 1.90 | **12.1×** |
| same | 16k | 37 | 203 | 3.37 | **30.8×** |
| same | 32k | 67 | 173 | **7.89** | **94.6×** |
| same | 65k | 130 | 110 | 2.26* | **218×** |
\*65k TPOT idx is suppressed because n_overlap > n_clean — by the time the 65k prefill is finishing, the 4-second gap to the next injection has already started decoding overlap. The "clean" decodes left are the ones that randomly hit the brief gaps between injections.
Figures: `fig_b2_tpot_vs_prefill.png`, `fig_b2_ttft_vs_prefill.png`.
**Why this matters**
- The `different-worker` control sits at idx ≈ 1.0 across 32× variation in prefill size. This is the cleanest possible disproof of "any prefill anywhere hurts decode": prefill on a *different* worker is invisible to the decode worker.
- The `same-worker` curve is monotone in prefill size for TTFT (218× at 65k) and monotone-up-to-32k for TPOT (7.89×). The two ablations together establish causation: prefill-decode interference is a same-worker phenomenon and scales sharply with prefill mass.
- This is the mechanism behind the B3 sticky interference jump (13.65) and unified's single hot worker (engine_4 at 37.7 s TTFT p90).
## What Window 1 does *not* answer
These need Window 2 (B4 SRR sweep + B5 failure attribution near SRR boundary):
1. **Sustainable arrival rate (SRR) per policy under SLO**. B3 was driven by trace timestamps with strict session sequentiality; when 8 instances cannot keep up, requests pile up and the *effective* dispatch window stretches (lmetric: trace claims 600 s, actual replay 49 min). We measured *saturated* behavior but not the saturation point. B4 needs the A4 open-loop Poisson loadgen with per-class SLO thresholds.
2. **Failure breakdown at the SRR boundary**. B5 will rerun each policy at 0.9× / 1.0× / 1.1× of its SRR_max and label each SLO-violating request — gives the paper its causal failure-attribution table.
Optional / paper-polish runs (not blocking the story):
3. unified isolated rerun to capture `interference_index` (B2 already provides cleaner causal proof; skip unless reviewer asks).
4. B2 with the proxy in path — measure whether the production cache_aware routing actually pushes prefill and decode onto different workers in practice.
5. KV-occupancy timeline per worker — needs polling `vllm:gpu_cache_usage` during B3 reruns; useful for "KV pressure drives cache miss" subsection.
## Caveats and known data hygiene issues
- **APC contamination across B3 hot-sweep**: `lmetric` ran from cold; `load_only` and `sticky` ran on the same 8 vLLMs without restart. Empirical contamination is < 1% (verified by first-turn cached_tokens distribution), but `unified` and `capped` were rerun cold-start specifically to remove any residual concern.
- **Unified's `interference_index` is missing** because the original `b3_analyze.sh` unconditionally truncate-wrote sliced engine_state files; isolated runs that wrote engine_state into their own per-policy directory were overwritten. Fixed in commit `df32499`; capped was the first run to benefit and survived with intact 86 MB engine_state.
- **w600 is not the full GLM-5.1 trace** (1214 req vs 2.11 M). All B3/B2 percentiles are on the sample. The full-trace KV-footprint stats are on the full trace.
## Reproduction commands
```bash
# B3 5-policy sweep
bash scripts/b3_sweep.sh # lmetric, load_only, sticky (hot-cache)
bash scripts/b3_isolated_policy.sh unified <trace> <dir> # isolated cold-start
bash scripts/b3_isolated_policy.sh lmetric <capped> <dir> # capped variant
bash scripts/b3_analyze.sh outputs/b3_sweep_<TS>
python3 scripts/render_b3_report.py --sweep-dir outputs/b3_sweep_<TS>
# B2 interference microbench
# (launch 2 vLLM on ports 8100/8101 with --enable-prompt-tokens-details first)
python3 scripts/b2_interference.py \
--decode-endpoint http://127.0.0.1:8100 \
--prefill-endpoint http://127.0.0.1:8101 \
--model <model> \
--out-dir outputs/b2_microbench/sweep
python3 analysis/characterization/b2_sweep_analysis.py --sweep-dir outputs/b2_microbench/sweep
# Figures
python3 analysis/characterization/render_window1_figures.py \
--results-dir analysis/characterization/window_1_results \
--out-dir analysis/characterization/window_1_results/figures
```

View File

@@ -0,0 +1,18 @@
{
"trace": "/home/admin/cpfs/wjh/agentic-kv/traces/w600_r0.0015_st30.jsonl",
"n_requests": 1214,
"n_sessions": 274,
"block_size": 512,
"shared_prefix_min_sessions": 8,
"total_input_tokens": 53335690,
"apc_upper_any_session": 0.8030439654947747,
"apc_upper_intra_session": 0.7956783534627564,
"apc_upper_shared_prefix_only": 0.0010271546126055554,
"cached_tokens_any_session": 42830904,
"cached_tokens_intra_session": 42438054,
"cached_tokens_shared_prefix_only": 54784,
"n_requests_any_hit": 961,
"n_requests_intra_hit": 914,
"n_requests_shared_hit": 107,
"n_shared_pos0_blocks": 1
}

View File

@@ -0,0 +1,194 @@
{
"rows": [
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 0.9868436853823819,
"n_decode_clean": 207,
"n_decode_overlap": 33,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8101",
"prefill_size": 16384,
"tpot_p50_clean_s": 0.0061757058808297825,
"tpot_p50_overlap_s": 0.006127697048765241,
"tpot_p90_clean_s": 0.006862485770023231,
"tpot_p90_overlap_s": 0.006772200748173878,
"tpot_p99_clean_s": 0.007128368820806946,
"tpot_p99_overlap_s": 0.0070623818792478,
"ttft_p90_clean_s": 0.043039703369140626,
"ttft_p90_overlap_s": 0.04307723045349121,
"variant": "different"
},
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 1.0176125863449343,
"n_decode_clean": 228,
"n_decode_overlap": 12,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8101",
"prefill_size": 2048,
"tpot_p50_clean_s": 0.0062349300191860005,
"tpot_p50_overlap_s": 0.006218204594621754,
"tpot_p90_clean_s": 0.006892242576136734,
"tpot_p90_overlap_s": 0.007013632793619174,
"tpot_p99_clean_s": 0.007111345902837888,
"tpot_p99_overlap_s": 0.007131954732567373,
"ttft_p90_clean_s": 0.04290406703948975,
"ttft_p90_overlap_s": 0.040976309776306154,
"variant": "different"
},
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 0.9221676118155049,
"n_decode_clean": 176,
"n_decode_overlap": 64,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8101",
"prefill_size": 32768,
"tpot_p50_clean_s": 0.00620933012528853,
"tpot_p50_overlap_s": 0.005991364970351711,
"tpot_p90_clean_s": 0.0069098352181791054,
"tpot_p90_overlap_s": 0.006372026241186894,
"tpot_p99_clean_s": 0.007242970394365715,
"tpot_p99_overlap_s": 0.006935877366499467,
"ttft_p90_clean_s": 0.04308474063873291,
"ttft_p90_overlap_s": 0.04266033172607422,
"variant": "different"
},
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 1.0162810692345416,
"n_decode_clean": 114,
"n_decode_overlap": 126,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8101",
"prefill_size": 65536,
"tpot_p50_clean_s": 0.006080349286397299,
"tpot_p50_overlap_s": 0.006312949488861392,
"tpot_p90_clean_s": 0.0068880830148253785,
"tpot_p90_overlap_s": 0.007000228371283021,
"tpot_p99_clean_s": 0.007222196574162956,
"tpot_p99_overlap_s": 0.00723441562267265,
"ttft_p90_clean_s": 0.04367616176605225,
"ttft_p90_overlap_s": 0.04332089424133301,
"variant": "different"
},
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 0.92169565663476,
"n_decode_clean": 220,
"n_decode_overlap": 20,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8101",
"prefill_size": 8192,
"tpot_p50_clean_s": 0.006260122915711066,
"tpot_p50_overlap_s": 0.006120474651606396,
"tpot_p90_clean_s": 0.006968991684191154,
"tpot_p90_overlap_s": 0.006423289366442748,
"tpot_p99_clean_s": 0.007601349209294174,
"tpot_p99_overlap_s": 0.006715166592838788,
"ttft_p90_clean_s": 0.04314079284667969,
"ttft_p90_overlap_s": 0.042817187309265134,
"variant": "different"
},
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 3.3716068170318985,
"n_decode_clean": 203,
"n_decode_overlap": 37,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8100",
"prefill_size": 16384,
"tpot_p50_clean_s": 0.006435276281954062,
"tpot_p50_overlap_s": 0.009116151116111061,
"tpot_p90_clean_s": 0.0071605749804564195,
"tpot_p90_overlap_s": 0.024142643417974917,
"tpot_p99_clean_s": 0.008356584539317119,
"tpot_p99_overlap_s": 0.024809808827409838,
"ttft_p90_clean_s": 0.04402604103088379,
"ttft_p90_overlap_s": 1.3574100017547606,
"variant": "same"
},
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 1.1589170446597312,
"n_decode_clean": 228,
"n_decode_overlap": 12,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8100",
"prefill_size": 2048,
"tpot_p50_clean_s": 0.006142637946388938,
"tpot_p50_overlap_s": 0.007610858088791972,
"tpot_p90_clean_s": 0.006933137142296993,
"tpot_p90_overlap_s": 0.008034930807171445,
"tpot_p99_clean_s": 0.007201877651792584,
"tpot_p99_overlap_s": 0.0084272463153107,
"ttft_p90_clean_s": 0.043091440200805665,
"ttft_p90_overlap_s": 0.09247522354125978,
"variant": "same"
},
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 7.891276559921504,
"n_decode_clean": 173,
"n_decode_overlap": 67,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8100",
"prefill_size": 32768,
"tpot_p50_clean_s": 0.006226602226796776,
"tpot_p50_overlap_s": 0.012180752224392362,
"tpot_p90_clean_s": 0.00694006813897027,
"tpot_p90_overlap_s": 0.054765997029314145,
"tpot_p99_clean_s": 0.010443444107518053,
"tpot_p99_overlap_s": 0.058983875428787386,
"ttft_p90_clean_s": 0.04411859512329101,
"ttft_p90_overlap_s": 4.174754428863525,
"variant": "same"
},
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 2.259323176730457,
"n_decode_clean": 110,
"n_decode_overlap": 130,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8100",
"prefill_size": 65536,
"tpot_p50_clean_s": 0.0064652375500611585,
"tpot_p50_overlap_s": 0.020095128001588764,
"tpot_p90_clean_s": 0.009607415488272014,
"tpot_p90_overlap_s": 0.021706256481132124,
"tpot_p99_clean_s": 0.016912007837584522,
"tpot_p99_overlap_s": 0.16948255478733715,
"ttft_p90_clean_s": 0.06447408199310305,
"ttft_p90_overlap_s": 14.060086917877197,
"variant": "same"
},
{
"decode_endpoint": "http://127.0.0.1:8100",
"interference_index": 1.8961314610807898,
"n_decode_clean": 221,
"n_decode_overlap": 19,
"n_decode_total": 240,
"n_prefill_injections": 4,
"prefill_endpoint": "http://127.0.0.1:8100",
"prefill_size": 8192,
"tpot_p50_clean_s": 0.00617263052198622,
"tpot_p50_overlap_s": 0.008303543533941712,
"tpot_p90_clean_s": 0.007060385713673601,
"tpot_p90_overlap_s": 0.013387419479061859,
"tpot_p99_clean_s": 0.0076809098022152696,
"tpot_p99_overlap_s": 0.013849472662415166,
"ttft_p90_clean_s": 0.04307150840759277,
"ttft_p90_overlap_s": 0.52073073387146,
"variant": "same"
}
]
}

View File

@@ -0,0 +1,133 @@
{
"rows": [
{
"policy": "capped",
"n_ok": 770,
"n_total": 770,
"ttft_p50_s": 1.195636051998008,
"ttft_p90_s": 12.762421467981767,
"ttft_p99_s": 46.05476947501302,
"tpot_p50_s": 0.007229394937166944,
"tpot_p90_s": 0.015995440982929352,
"tpot_p99_s": 0.10145225453431651,
"e2e_p50_s": 2.5921602529706433,
"e2e_p90_s": 21.238469071977306,
"e2e_p99_s": 73.38509433099534,
"apc_ratio": 0.3158312503528108,
"interference_index": 6.331064378362814,
"hotspot_index_ttft_p90": 1.9366915542605314,
"reuse_intra_frac": 0.9192657105586233,
"reuse_cross_frac": 0.0602232594931501,
"n_slow": 185,
"failure_counts": {
"cache_miss_large_append": 60,
"hot_worker_queue": 66,
"same_worker_prefill_overlap": 45,
"unknown": 14
}
},
{
"policy": "lmetric",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 0.9369571270071901,
"ttft_p90_s": 15.592678204004187,
"ttft_p99_s": 52.95170431700535,
"tpot_p50_s": 0.008851506907892485,
"tpot_p90_s": 0.02120516549011311,
"tpot_p99_s": 0.17592118933357093,
"e2e_p50_s": 2.7527842019917443,
"e2e_p90_s": 24.75416105298791,
"e2e_p99_s": 79.61890332301846,
"apc_ratio": 0.5694312382571595,
"interference_index": 6.530231061794441,
"hotspot_index_ttft_p90": 2.237981740718548,
"reuse_intra_frac": 0.9321238805590836,
"reuse_cross_frac": 0.05679481258506571,
"n_slow": 295,
"failure_counts": {
"cache_miss_large_append": 94,
"hot_worker_queue": 68,
"same_worker_prefill_overlap": 69,
"unknown": 64
}
},
{
"policy": "load_only",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 1.2542553890380077,
"ttft_p90_s": 20.14692750602262,
"ttft_p99_s": 52.64810254302574,
"tpot_p50_s": 0.00923045912795929,
"tpot_p90_s": 0.02672785480314115,
"tpot_p99_s": 0.3207044094773148,
"e2e_p50_s": 3.584156609023921,
"e2e_p90_s": 33.42658680601744,
"e2e_p99_s": 93.91839688795153,
"apc_ratio": 0.5412093853102866,
"interference_index": 9.16424627504275,
"hotspot_index_ttft_p90": 1.1400531308102801,
"reuse_intra_frac": 0.9353191550754928,
"reuse_cross_frac": 0.053372184678592026,
"n_slow": 379,
"failure_counts": {
"cache_miss_large_append": 151,
"hot_worker_queue": 33,
"same_worker_prefill_overlap": 108,
"unknown": 87
}
},
{
"policy": "sticky",
"n_ok": 1214,
"n_total": 1214,
"ttft_p50_s": 0.540947844972834,
"ttft_p90_s": 18.016640832996927,
"ttft_p99_s": 71.37327494798228,
"tpot_p50_s": 0.00894752275507555,
"tpot_p90_s": 0.0360956137329512,
"tpot_p99_s": 0.34523129428917954,
"e2e_p50_s": 2.0788628259906545,
"e2e_p90_s": 34.605129147996195,
"e2e_p99_s": 133.5824547969969,
"apc_ratio": 0.7720092868396378,
"interference_index": 13.651718321568111,
"hotspot_index_ttft_p90": 2.3493858974059214,
"reuse_intra_frac": 0.9327723488279339,
"reuse_cross_frac": 0.05495149683864246,
"n_slow": 234,
"failure_counts": {
"cache_miss_large_append": 20,
"hot_worker_queue": 51,
"same_worker_prefill_overlap": 134,
"unknown": 29
}
},
{
"policy": "unified",
"n_ok": 1213,
"n_total": 1214,
"ttft_p50_s": 0.4997710260213353,
"ttft_p90_s": 7.239999514014926,
"ttft_p99_s": 42.022206099005416,
"tpot_p50_s": 0.008079791456705824,
"tpot_p90_s": 0.017107906969874808,
"tpot_p99_s": 0.11808861252148231,
"e2e_p50_s": 1.7495028690318577,
"e2e_p90_s": 17.893827292020433,
"e2e_p99_s": 68.18008507299237,
"apc_ratio": 0.794261466256467,
"interference_index": null,
"hotspot_index_ttft_p90": 3.3497107140827365,
"reuse_intra_frac": 0.9311187350942534,
"reuse_cross_frac": 0.056702150437367635,
"n_slow": 189,
"failure_counts": {
"cache_miss_large_append": 18,
"hot_worker_queue": 116,
"unknown": 55
}
}
]
}

View File

@@ -0,0 +1,114 @@
# B3 Routing Sweep Report
Sweep dir: `b3_sweep_20260525_095043`
Trace: w600_r0.0015_st30.jsonl (~1.2k reqs, 8 × TP1)
Policies present: lmetric, load_only, sticky, unified, capped
Policies pending: —
## Headline latencies + APC
| policy | ok/total | TTFT p50/p90/p99 (s) | TPOT p50/p90/p99 (ms) | E2E p50/p90/p99 (s) | APC |
|---|---:|---|---|---|---:|
| **lmetric** | 1214/1214 | 0.94/15.59/52.95 | 8.9/21.2/175.9 | 2.75/24.75/79.62 | 56.9% |
| **load_only** | 1214/1214 | 1.25/20.15/52.65 | 9.2/26.7/320.7 | 3.58/33.43/93.92 | 54.1% |
| **sticky** | 1214/1214 | 0.54/18.02/71.37 | 8.9/36.1/345.2 | 2.08/34.61/133.58 | 77.2% |
| **unified** | 1213/1214 | 0.50/7.24/42.02 | 8.1/17.1/118.1 | 1.75/17.89/68.18 | 79.4% |
| **capped** | 770/770 | 1.20/12.76/46.05 | 7.2/16.0/101.5 | 2.59/21.24/73.39 | 31.6% |
## Mechanism indices
| policy | interference_index | hotspot_index (TTFT p90) | intra-session reuse | cross-session reuse | n_slow |
|---|---:|---:|---:|---:|---:|
| **lmetric** | 6.53 | 2.24 | 93.2% | 5.7% | 295 |
| **load_only** | 9.16 | 1.14 | 93.5% | 5.3% | 379 |
| **sticky** | 13.65 | 2.35 | 93.3% | 5.5% | 234 |
| **unified** | — | 3.35 | 93.1% | 5.7% | 189 |
| **capped** | 6.33 | 1.94 | 91.9% | 6.0% | 185 |
- **interference_index** = TPOT_p90(decode overlapping same-worker prefill) / TPOT_p90(clean)
- **hotspot_index** = max(worker TTFT_p90) / median(worker TTFT_p90)
## Slow-request cause breakdown
| policy | n_slow | same-worker overlap | hot worker queue | cache miss large append | high KV | unknown |
|---|---:|---:|---:|---:|---:|---:|
| **lmetric** | 295 | 69 | 68 | 94 | 0 | 64 |
| **load_only** | 379 | 108 | 33 | 151 | 0 | 87 |
| **sticky** | 234 | 134 | 51 | 20 | 0 | 29 |
| **unified** | 189 | 0 | 116 | 18 | 0 | 55 |
| **capped** | 185 | 45 | 66 | 60 | 0 | 14 |
## Policy notes
- **lmetric** — cache-aware P_tokens × BS (main baseline)
- **load_only** — control: min(num_requests), no cache, no affinity
- **sticky** — control: hard session affinity (never break)
- **unified** — hybrid affinity + LMetric fallback
- **capped** — lmetric on per-session turn-capped trace
## Per-policy per-worker TTFT p90 (s)
### lmetric
| worker | TTFT p90 (s) |
|---|---:|
| http://127.0.0.1:8000 | 28.18 |
| http://127.0.0.1:8001 | 13.15 |
| http://127.0.0.1:8002 | 13.82 |
| http://127.0.0.1:8003 | 14.00 |
| http://127.0.0.1:8004 | 31.34 |
| http://127.0.0.1:8005 | 7.87 |
| http://127.0.0.1:8006 | 14.15 |
| http://127.0.0.1:8007 | 11.78 |
### load_only
| worker | TTFT p90 (s) |
|---|---:|
| http://127.0.0.1:8000 | 22.06 |
| http://127.0.0.1:8001 | 16.43 |
| http://127.0.0.1:8002 | 16.81 |
| http://127.0.0.1:8003 | 23.58 |
| http://127.0.0.1:8004 | 25.14 |
| http://127.0.0.1:8005 | 16.08 |
| http://127.0.0.1:8006 | 23.96 |
| http://127.0.0.1:8007 | 13.95 |
### sticky
| worker | TTFT p90 (s) |
|---|---:|
| http://127.0.0.1:8000 | 12.28 |
| http://127.0.0.1:8001 | 23.57 |
| http://127.0.0.1:8002 | 5.20 |
| http://127.0.0.1:8003 | 55.38 |
| http://127.0.0.1:8004 | 17.03 |
| http://127.0.0.1:8005 | 25.49 |
| http://127.0.0.1:8006 | 36.31 |
| http://127.0.0.1:8007 | 2.50 |
### unified
| worker | TTFT p90 (s) |
|---|---:|
| http://127.0.0.1:8000 | 11.26 |
| http://127.0.0.1:8001 | 3.61 |
| http://127.0.0.1:8002 | 16.18 |
| http://127.0.0.1:8003 | 9.31 |
| http://127.0.0.1:8004 | 37.73 |
| http://127.0.0.1:8005 | 18.33 |
| http://127.0.0.1:8006 | 3.63 |
| http://127.0.0.1:8007 | 7.77 |
### capped
| worker | TTFT p90 (s) |
|---|---:|
| http://127.0.0.1:8000 | 19.77 |
| http://127.0.0.1:8001 | 15.79 |
| http://127.0.0.1:8002 | 20.40 |
| http://127.0.0.1:8003 | 10.54 |
| http://127.0.0.1:8004 | 9.52 |
| http://127.0.0.1:8005 | 9.46 |
| http://127.0.0.1:8006 | 7.38 |
| http://127.0.0.1:8007 | 9.66 |

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 79 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 51 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

View File

@@ -0,0 +1,26 @@
{
"formula": "kv_bytes_per_request = input_tokens * kv_bytes_per_token",
"kv_bytes_per_request": {
"count": 2114220,
"max": 19893878784.0,
"mean": 3306689367.3278427,
"min": 0.0,
"p50": 1969029120.0,
"p90": 8636507750.40001,
"p95": 10296164352.0,
"p99": 12339806208.0
},
"kv_bytes_per_token": 98304.0,
"kv_mib_per_request": {
"count": 2114220,
"max": 18972.28125,
"mean": 3153.5047219541957,
"min": 0.0,
"p50": 1877.8125,
"p90": 8236.415625000009,
"p95": 9819.1875,
"p99": 11768.15625
},
"status": "available",
"total_kv_gib": 6510940.188720703
}

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 2.237981740718548,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 34.71445541951107,
"http://127.0.0.1:8001": 21.922988962882666,
"http://127.0.0.1:8002": 23.936190764518685,
"http://127.0.0.1:8003": 26.22220957049285,
"http://127.0.0.1:8004": 40.318757307820505,
"http://127.0.0.1:8005": 12.26559703698149,
"http://127.0.0.1:8006": 27.904838753980588,
"http://127.0.0.1:8007": 18.430557113309625
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 28.18261351052206,
"http://127.0.0.1:8001": 13.147308969072796,
"http://127.0.0.1:8002": 13.818959677941162,
"http://127.0.0.1:8003": 14.003642184572524,
"http://127.0.0.1:8004": 31.339895512629305,
"http://127.0.0.1:8005": 7.870992770011071,
"http://127.0.0.1:8006": 14.149156623415186,
"http://127.0.0.1:8007": 11.777357225219024
},
"status": "supported"
}

View File

@@ -0,0 +1,15 @@
{
"cross_session_tokens": 1723017,
"fractions": {
"cross": 0.05679481258506571,
"intra": 0.9321238805590836,
"shared": 0.011081306855850749,
"unclassified": 0.0
},
"intra_session_tokens": 28278380,
"shared_prefix_min_sessions": 8,
"shared_prefix_tokens": 336180,
"status": "supported",
"total_cached_tokens": 30371008,
"unclassified_tokens": 0
}

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 1.9366915542605314,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 23.81083881931848,
"http://127.0.0.1:8001": 18.139674991380897,
"http://127.0.0.1:8002": 29.116712999995805,
"http://127.0.0.1:8003": 19.245074290811324,
"http://127.0.0.1:8004": 17.230851700413044,
"http://127.0.0.1:8005": 15.86663371440958,
"http://127.0.0.1:8006": 16.707309890014592,
"http://127.0.0.1:8007": 23.93718611740042
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 19.772570010094213,
"http://127.0.0.1:8001": 15.786850639013576,
"http://127.0.0.1:8002": 20.403525242628533,
"http://127.0.0.1:8003": 10.535247699997853,
"http://127.0.0.1:8004": 9.52290979558602,
"http://127.0.0.1:8005": 9.455131393985376,
"http://127.0.0.1:8006": 7.379608143202497,
"http://127.0.0.1:8007": 9.661995008389932
},
"status": "supported"
}

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 2.237981740718548,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 34.71445541951107,
"http://127.0.0.1:8001": 21.922988962882666,
"http://127.0.0.1:8002": 23.936190764518685,
"http://127.0.0.1:8003": 26.22220957049285,
"http://127.0.0.1:8004": 40.318757307820505,
"http://127.0.0.1:8005": 12.26559703698149,
"http://127.0.0.1:8006": 27.904838753980588,
"http://127.0.0.1:8007": 18.430557113309625
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 28.18261351052206,
"http://127.0.0.1:8001": 13.147308969072796,
"http://127.0.0.1:8002": 13.818959677941162,
"http://127.0.0.1:8003": 14.003642184572524,
"http://127.0.0.1:8004": 31.339895512629305,
"http://127.0.0.1:8005": 7.870992770011071,
"http://127.0.0.1:8006": 14.149156623415186,
"http://127.0.0.1:8007": 11.777357225219024
},
"status": "supported"
}

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 1.1400531308102801,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 33.51168999259829,
"http://127.0.0.1:8001": 29.20308109278556,
"http://127.0.0.1:8002": 27.126518827211115,
"http://127.0.0.1:8003": 38.597240307606995,
"http://127.0.0.1:8004": 36.607777832809376,
"http://127.0.0.1:8005": 28.097025175404276,
"http://127.0.0.1:8006": 49.29610514297965,
"http://127.0.0.1:8007": 20.958507975534303
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 22.055091864388675,
"http://127.0.0.1:8001": 16.425856862741057,
"http://127.0.0.1:8002": 16.806352904380766,
"http://127.0.0.1:8003": 23.581166115606912,
"http://127.0.0.1:8004": 25.14397653030465,
"http://127.0.0.1:8005": 16.080231266201018,
"http://127.0.0.1:8006": 23.960470345703648,
"http://127.0.0.1:8007": 13.95184187250561
},
"status": "supported"
}

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 2.3493858974059214,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 30.185792533413043,
"http://127.0.0.1:8001": 47.49661003401852,
"http://127.0.0.1:8002": 22.069474861002554,
"http://127.0.0.1:8003": 83.73774532350944,
"http://127.0.0.1:8004": 22.03310715127737,
"http://127.0.0.1:8005": 33.024566102202556,
"http://127.0.0.1:8006": 61.65600914339302,
"http://127.0.0.1:8007": 6.077459598158019
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 12.284569517592924,
"http://127.0.0.1:8001": 23.570226482005094,
"http://127.0.0.1:8002": 5.202772857400123,
"http://127.0.0.1:8003": 55.37555769548635,
"http://127.0.0.1:8004": 17.031311958114394,
"http://127.0.0.1:8005": 25.48531596700202,
"http://127.0.0.1:8006": 36.31029207323453,
"http://127.0.0.1:8007": 2.4984901855932535
},
"status": "supported"
}

View File

@@ -0,0 +1,24 @@
{
"hotspot_index_ttft_p90": 3.3497107140827365,
"per_worker_latency_p90_s": {
"http://127.0.0.1:8000": 41.42001512600109,
"http://127.0.0.1:8001": 12.4878579101933,
"http://127.0.0.1:8002": 22.462878945574648,
"http://127.0.0.1:8003": 15.501050900109117,
"http://127.0.0.1:8004": 39.956250199786155,
"http://127.0.0.1:8005": 36.69850301651168,
"http://127.0.0.1:8006": 10.116177947795954,
"http://127.0.0.1:8007": 20.35038618039107
},
"per_worker_ttft_p90_s": {
"http://127.0.0.1:8000": 11.264844838529825,
"http://127.0.0.1:8001": 3.6063860427122614,
"http://127.0.0.1:8002": 16.175747957825664,
"http://127.0.0.1:8003": 9.314684258581842,
"http://127.0.0.1:8004": 37.73397144810297,
"http://127.0.0.1:8005": 18.328030522551852,
"http://127.0.0.1:8006": 3.6328767628350773,
"http://127.0.0.1:8007": 7.772977900883419
},
"status": "supported"
}

View File

@@ -0,0 +1,136 @@
{
"analyzed_records": 2114220,
"batch0": {
"attempted_requests": 2114220,
"completed_requests": null,
"error_requests": null,
"max_inflight_per_session": null,
"session_concurrency_status": "unavailable",
"session_sequential": null
},
"batch1": {
"append_status": "unavailable",
"input_stats": {
"count": 2114220,
"max": 202371.0,
"mean": 33637.38370084476,
"min": 0.0,
"p50": 20030.0,
"p90": 87855.1000000001,
"p95": 104738.0,
"p99": 125527.0
},
"kv_footprint_status": "available",
"output_stats": {
"count": 2114220,
"max": 132665.0,
"mean": 444.97059624826176,
"min": 0.0,
"p50": 80.0,
"p90": 811.0,
"p95": 2213.0,
"p99": 6614.810000000056
},
"reuse_status": "unavailable"
},
"classification": {
"label": "invalid_for_online_claim",
"reason": "actual dispatch/finish timestamps are unavailable, so online sequentiality cannot be proven",
"source": "auto",
"stress_indicators": []
},
"manifest": {
"canonical_trace_data_sources": {
"dash0_formatted_trace_dir": "~/ali-trace/trace-glm5.1-formatted/",
"dash0_raw_trace_dir": "~/ali-trace/trace-glm5.1/",
"usage_note": "Full trace analysis can be run CPU-only on dash0, or the needed JSONL files can be copied/rsynced from dash0 to this machine before running this analyzer."
},
"end_time": "2026-05-25T09:03:36.499002+00:00",
"figure_status": {
"reason": "matplotlib unavailable: ModuleNotFoundError(\"No module named 'matplotlib'\")",
"status": "skipped"
},
"git_commit": "",
"gpu_count": 0,
"gpu_type": "",
"host": "ds-6348bee4-1-765874c9c4-7zrvf",
"input_requirements": {
"actual_sequentiality_proof": [
"per-request dispatch timestamp",
"per-request finish or error/timeout timestamp",
"request_id join to trace/metrics when timing source is separate"
],
"metrics_jsonl": [
"request_id",
"session_id",
"trace_timestamp_s",
"input_length",
"output_length",
"latency_s",
"ttft_s",
"tpot_s",
"error",
"optional cached_tokens"
],
"reuse_decomposition": [
"cached_tokens or cache_hit",
"hash_ids",
"session_id"
],
"trace_jsonl": [
"chat_id",
"parent_chat_id",
"timestamp",
"input_length",
"output_length",
"turn",
"hash_ids",
"optional session_id"
]
},
"input_status": {
"analyzed_records": 2114220,
"breakdown_records": 0,
"merge_warnings": [],
"metrics_records": 0,
"trace_records": 2114220,
"trace_warnings": [],
"unmatched_breakdown": 0,
"unmatched_metrics": 0
},
"launch_command": "analysis/characterization/analyze.py --trace /home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl --kv-bytes-per-token 98304 --task-name full_trace_with_kv --output-root outputs/characterization --overwrite",
"output_dir": "outputs/characterization/2026-05-25/full_trace_with_kv",
"policy": "",
"request_limit": null,
"session_sampling_method": "",
"session_sequential": null,
"start_time": "2026-05-25T08:59:11.618919+00:00",
"time_scale": null,
"trace_file_info": {
"exists": true,
"mtime_s": 1778772033.2788928,
"path": "/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl",
"sha256": "",
"sha256_status": "skipped_use_--hash-inputs",
"size_bytes": 1561266372
},
"trace_path": "/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl",
"trace_sha256": ""
},
"outputs": [
"append_delta_stats.json",
"invalid_runs.md",
"kv_footprint_summary.json",
"manifest.json",
"raw/merged_requests.jsonl",
"raw/unmatched_breakdown.jsonl",
"raw/unmatched_metrics.jsonl",
"reuse_decomposition.json",
"session_arrival_stats.json",
"session_concurrency.json",
"session_skew.json",
"trace_profile.json",
"turn_interval_stats.json",
"workload_summary.json"
]
}