Window 1 results: combined B1' + B2 + B3 report and artifacts
analysis/characterization/window_1_results.md is the headline write-up for Window 1: workload characterization (KV per request, real reuse decomposition, APC theoretical ceilings), B3 5-policy sweep with per-policy interpretation, B2 same-vs-different-worker interference microbench with causal reading, and an explicit list of what Window 1 does *not* answer (deferred to B4 SRR sweep + B5 attribution). Under window_1_results/: - 5 raw result JSONs from the B3 sweep, the B2 microbench, the APC upper bound, and the KV footprint - per-policy hotspot_index.json snapshots so render_window1_figures.py can plot per-worker TTFT p90 distributions - 8 PNG figures (figures/) covering the headline claims Three takeaways the figures pin down: 1) intra-session reuse dominates (93.2%), so session-affinity routing is the right primary lever 2) unified hybrid affinity hits 79.4% APC (97% of the 79.6% intra- session ceiling) AND cuts TTFT p90 from lmetric's 15.6s to 7.24s 3) B2 different-worker control sits at idx ≈ 1.0 across 32× prefill- size variation; same-worker TTFT idx scales 2.15× -> 218×, which is the cleanest causal evidence for same-worker prefill-decode interference Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
171
analysis/characterization/window_1_results.md
Normal file
@@ -0,0 +1,171 @@
|
||||
# Window 1 Results: B1' + B2 + B3
|
||||
|
||||
Status: Window 1 complete (CPU + 2 dash0 GPU windows on 2026-05-25)
|
||||
Sweep: `outputs/b3_sweep_20260525_095043` (B3) + `outputs/b2_microbench/` (B2)
|
||||
Trace: `traces/w600_r0.0015_st30.jsonl` (1214 requests / 274 sessions / 53.3 M input tokens)
|
||||
Model: Qwen3-Coder-30B-A3B-Instruct (TP1 × 8 instances on H20)
|
||||
|
||||
Per-policy artifacts under `window_1_results/`. Figures under `window_1_results/figures/`.
|
||||
|
||||
## Headline
|
||||
|
||||
| Claim | Status | Evidence |
|
||||
|---|---|---|
|
||||
| Agentic workload reuse is overwhelmingly intra-session | **supported** | 93.2% of cached_tokens are intra-session (real); theoretical any-session APC ceiling 80.3% vs intra-session ceiling 79.6% → < 1pp gap |
|
||||
| LMetric leaves 23 pp of APC on the table | **supported** | lmetric achieved 56.9% vs intra-session ceiling 79.6% (theoretical) |
|
||||
| Hard session affinity recovers the locality lost by LMetric | **supported** | sticky APC 77.2% = 97% of theoretical ceiling |
|
||||
| Hard affinity inflates same-worker prefill-decode interference | **supported** | sticky interference_index 13.65 vs lmetric 6.53 |
|
||||
| Hybrid affinity (Unified) breaks the locality-vs-latency tradeoff | **supported** | unified hits 79.4% APC and TTFT p90 7.24 s (lmetric 15.6 s) simultaneously |
|
||||
| Same-worker prefill-decode interference is causal, not correlation | **supported** | different-worker control idx≈1.0; same-worker idx scales monotonically with prefill size |
|
||||
| Heavy-tail sessions are *a* contributor to hot-spot, not the sole cause | **supported** | cap=8 truncated trace cuts 37% of work; hotspot drops only 13% (2.24→1.94) |
|
||||
|
||||
## B1' Workload characterization
|
||||
|
||||
### Per-request KV footprint (Qwen3-Coder-30B-A3B)
|
||||
|
||||
`kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes = 2 × 48 × 4 × 128 × 2 = 98304 B`
|
||||
|
||||
Full GLM-5.1 trace (2.11 M requests, 1.31 M sessions):
|
||||
|
||||
| | p50 | p90 | p95 | p99 | max |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| KV per request | 1.83 GiB | 8.04 GiB | 9.59 GiB | **11.49 GiB** | 18.5 GiB |
|
||||
|
||||
H20 has ~95 GiB usable per GPU. **A single p99 request occupies 12% of a single H20's HBM** purely for KV. Multi-request batching is bounded by this.
|
||||
|
||||
Figure: `figures/fig_kv_footprint_cdf.png`.
|
||||
|
||||
### Real reuse decomposition (from lmetric run on w600 trace)
|
||||
|
||||
| class | tokens | fraction |
|
||||
|---|---:|---:|
|
||||
| intra-session | 28.3 M | **93.2%** |
|
||||
| cross-session | 1.72 M | 5.7% |
|
||||
| shared / system-prefix | 0.34 M | 1.1% |
|
||||
| unclassified | 0 | 0.0% |
|
||||
|
||||
→ session-affinity routing covers >99% of the reuse signal. There is no meaningful "system prompt" in this trace.
|
||||
|
||||
Figure: `figures/fig_reuse_decomposition.png`.
|
||||
|
||||
### Theoretical APC ceilings on w600
|
||||
|
||||
Computed by building a block-level trie of `hash_ids` per session (intra-session) or globally (any-session), then walking each request's `hash_ids` to count its longest prefix-match against previously-seen prefixes.
|
||||
|
||||
| variant | upper bound | hit requests |
|
||||
|---|---:|---:|
|
||||
| any-session (perfect global cache) | **80.3%** | 961 / 1214 |
|
||||
| intra-session only | **79.6%** | 914 / 1214 |
|
||||
| shared-prefix only (pos 0, ≥8 sessions) | 0.10% | 107 / 1214 |
|
||||
|
||||
Gap "any − intra" is 0.7 pp → no meaningful cross-session sharing in this trace.
|
||||
|
||||
## B3 5-policy routing sweep
|
||||
|
||||
8 vLLM instances on TP1, w600 trace, `--enable-prompt-tokens-details` so `cached_tokens` is reported per request.
|
||||
|
||||
| policy | TTFT p50/p90/p99 | TPOT p50/p90/p99 ms | E2E p50/p90/p99 | **APC** | interference | **hotspot** | n_slow |
|
||||
|---|---|---|---|---:|---:|---:|---:|
|
||||
| lmetric | 0.94 / 15.59 / 52.95 | 8.9 / 21.2 / 175.9 | 2.75 / 24.75 / 79.62 | 56.9% | 6.53 | 2.24 | 295 |
|
||||
| load_only | 1.25 / 20.15 / 52.65 | 9.2 / 26.7 / 320.7 | 3.58 / 33.43 / 93.92 | 54.1% | 9.16 | **1.14** | 379 |
|
||||
| sticky | 0.54 / 18.02 / 71.37 | 8.9 / 36.1 / 345.2 | 2.08 / 34.61 / 133.58 | 77.2% | **13.65** | 2.35 | 234 |
|
||||
| **unified** | **0.50 / 7.24 / 42.02** | 8.1 / 17.1 / 118.1 | **1.75 / 17.89 / 68.18** | **79.4%** | n/a* | 3.35 | **189** |
|
||||
| capped | 1.20 / 12.76 / 46.05 | 7.2 / 16.0 / 101.5 | 2.59 / 21.24 / 73.39 | 31.6% | 6.33 | 1.94 | 185 |
|
||||
|
||||
\*unified `engine_state` was overwritten by my analyzer's slice step before the `b3_analyze.sh` fix landed; vLLM and the patch worked correctly. The B2 microbench provides a cleaner interference proof.
|
||||
|
||||
**Mechanism indices**
|
||||
- `interference_index` = TPOT_p90(decode overlapping same-worker prefill) / TPOT_p90(clean)
|
||||
- `hotspot_index` = max(worker TTFT p90) / median(worker TTFT p90)
|
||||
|
||||
Figures: `fig_b3_latency_bars.png`, `fig_b3_apc_vs_upper.png`,
|
||||
`fig_b3_apc_vs_hotspot.png`, `fig_b3_per_worker_ttft_p90.png`,
|
||||
`fig_b3_failure_breakdown.png`.
|
||||
|
||||
### Per-policy reading
|
||||
|
||||
- **lmetric** is the cache-aware baseline. APC 56.9% achieves only 71% of the intra-session ceiling — the missing 23 pp is the locality opportunity unified picks up.
|
||||
- **load_only** strips cache awareness. Hot-spot drops to 1.14 (best), but APC only drops 3 pp because the picker's `min(num_requests)` tie-break to instance 0 creates accidental stickiness at low concurrency.
|
||||
- **sticky** locks each session to one worker. APC climbs to 77.2% (97% of ceiling) but interference doubles to 13.65 and TPOT p99 hits 345 ms.
|
||||
- **unified** is the hybrid — affinity gate `(cache_ratio>0.5 AND num_req ≤ 2×avg)` keeps locality where it pays and drops it where it would hurt. The result is APC 79.4% **and** TTFT p90 cut in half from lmetric. The one bad worker (engine_4 at 37.7s p90) drives `hotspot_index=3.35`, but the other seven workers are all under 18 s.
|
||||
- **capped** runs lmetric on a turn-capped trace (max 8 turns/session). Removes 37% of requests but APC also crashes to 31.6% and hotspot only improves by 13%. This is the session-mass ablation: heavy sessions are *a* contributor to hot-spot but not the sole cause.
|
||||
|
||||
### Slow-request cause breakdown (from `joined_analysis.label_slow_requests`)
|
||||
|
||||
| policy | n_slow | same-worker overlap | hot worker queue | cache miss large append | unknown |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| lmetric | 295 | 69 (23%) | 68 (23%) | 94 (32%) | 64 (22%) |
|
||||
| load_only | 379 | 108 (29%) | 33 (9%) | 151 (40%) | 87 (23%) |
|
||||
| sticky | 234 | **134 (57%)** | 51 (22%) | **20 (9%)** | 29 (12%) |
|
||||
| unified | 189 | 0 (no engine_state) | 116 (61%) | 18 (10%) | 55 (29%) |
|
||||
| capped | 185 | 45 (24%) | 66 (36%) | 60 (32%) | 14 (8%) |
|
||||
|
||||
PD-colo failures are mixed-mechanism: lmetric has no single dominant cause.
|
||||
Sticky concentrates failures into same-worker overlap (locality is on, cache misses are gone, but interference takes over).
|
||||
|
||||
## B2 PD-colo interference microbench
|
||||
|
||||
Setup: 2 vLLM instances on GPU 0 (decode endpoint) and GPU 1 (prefill endpoint). A continuous 4 req/s short-prompt decode load runs against GPU 0 for 60 s per cell. 4 large-prompt one-token "prefill injections" fire every 12 s, targeted at either the same instance (`same`) or the paired one (`different`). Decode requests are labeled overlap iff their `[t_first_token, t_finish]` intersects any injection window. We compare TPOT p90 (overlap vs clean) per cell.
|
||||
|
||||
| variant | prefill | n_overlap | n_clean | **TPOT idx** | **TTFT idx** |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| different | 2k–65k | 12–126 | 114–228 | **0.92–1.02** | **0.96–1.00** |
|
||||
| same | 2k | 12 | 228 | 1.16 | 2.15 |
|
||||
| same | 8k | 19 | 221 | 1.90 | **12.1×** |
|
||||
| same | 16k | 37 | 203 | 3.37 | **30.8×** |
|
||||
| same | 32k | 67 | 173 | **7.89** | **94.6×** |
|
||||
| same | 65k | 130 | 110 | 2.26* | **218×** |
|
||||
|
||||
\*65k TPOT idx is suppressed because n_overlap > n_clean — by the time the 65k prefill is finishing, the 4-second gap to the next injection has already started decoding overlap. The "clean" decodes left are the ones that randomly hit the brief gaps between injections.
|
||||
|
||||
Figures: `fig_b2_tpot_vs_prefill.png`, `fig_b2_ttft_vs_prefill.png`.
|
||||
|
||||
**Why this matters**
|
||||
- The `different-worker` control sits at idx ≈ 1.0 across 32× variation in prefill size. This is the cleanest possible disproof of "any prefill anywhere hurts decode": prefill on a *different* worker is invisible to the decode worker.
|
||||
- The `same-worker` curve is monotone in prefill size for TTFT (218× at 65k) and monotone-up-to-32k for TPOT (7.89×). The two ablations together establish causation: prefill-decode interference is a same-worker phenomenon and scales sharply with prefill mass.
|
||||
- This is the mechanism behind the B3 sticky interference jump (13.65) and unified's single hot worker (engine_4 at 37.7 s TTFT p90).
|
||||
|
||||
## What Window 1 does *not* answer
|
||||
|
||||
These need Window 2 (B4 SRR sweep + B5 failure attribution near SRR boundary):
|
||||
|
||||
1. **Sustainable arrival rate (SRR) per policy under SLO**. B3 was driven by trace timestamps with strict session sequentiality; when 8 instances cannot keep up, requests pile up and the *effective* dispatch window stretches (lmetric: trace claims 600 s, actual replay 49 min). We measured *saturated* behavior but not the saturation point. B4 needs the A4 open-loop Poisson loadgen with per-class SLO thresholds.
|
||||
2. **Failure breakdown at the SRR boundary**. B5 will rerun each policy at 0.9× / 1.0× / 1.1× of its SRR_max and label each SLO-violating request — gives the paper its causal failure-attribution table.
|
||||
|
||||
Optional / paper-polish runs (not blocking the story):
|
||||
|
||||
3. unified isolated rerun to capture `interference_index` (B2 already provides cleaner causal proof; skip unless reviewer asks).
|
||||
4. B2 with the proxy in path — measure whether the production cache_aware routing actually pushes prefill and decode onto different workers in practice.
|
||||
5. KV-occupancy timeline per worker — needs polling `vllm:gpu_cache_usage` during B3 reruns; useful for "KV pressure drives cache miss" subsection.
|
||||
|
||||
## Caveats and known data hygiene issues
|
||||
|
||||
- **APC contamination across B3 hot-sweep**: `lmetric` ran from cold; `load_only` and `sticky` ran on the same 8 vLLMs without restart. Empirical contamination is < 1% (verified by first-turn cached_tokens distribution), but `unified` and `capped` were rerun cold-start specifically to remove any residual concern.
|
||||
- **Unified's `interference_index` is missing** because the original `b3_analyze.sh` unconditionally truncate-wrote sliced engine_state files; isolated runs that wrote engine_state into their own per-policy directory were overwritten. Fixed in commit `df32499`; capped was the first run to benefit and survived with intact 86 MB engine_state.
|
||||
- **w600 is not the full GLM-5.1 trace** (1214 req vs 2.11 M). All B3/B2 percentiles are on the sample. The full-trace KV-footprint stats are on the full trace.
|
||||
|
||||
## Reproduction commands
|
||||
|
||||
```bash
|
||||
# B3 5-policy sweep
|
||||
bash scripts/b3_sweep.sh # lmetric, load_only, sticky (hot-cache)
|
||||
bash scripts/b3_isolated_policy.sh unified <trace> <dir> # isolated cold-start
|
||||
bash scripts/b3_isolated_policy.sh lmetric <capped> <dir> # capped variant
|
||||
|
||||
bash scripts/b3_analyze.sh outputs/b3_sweep_<TS>
|
||||
python3 scripts/render_b3_report.py --sweep-dir outputs/b3_sweep_<TS>
|
||||
|
||||
# B2 interference microbench
|
||||
# (launch 2 vLLM on ports 8100/8101 with --enable-prompt-tokens-details first)
|
||||
python3 scripts/b2_interference.py \
|
||||
--decode-endpoint http://127.0.0.1:8100 \
|
||||
--prefill-endpoint http://127.0.0.1:8101 \
|
||||
--model <model> \
|
||||
--out-dir outputs/b2_microbench/sweep
|
||||
python3 analysis/characterization/b2_sweep_analysis.py --sweep-dir outputs/b2_microbench/sweep
|
||||
|
||||
# Figures
|
||||
python3 analysis/characterization/render_window1_figures.py \
|
||||
--results-dir analysis/characterization/window_1_results \
|
||||
--out-dir analysis/characterization/window_1_results/figures
|
||||
```
|
||||
@@ -0,0 +1,18 @@
|
||||
{
|
||||
"trace": "/home/admin/cpfs/wjh/agentic-kv/traces/w600_r0.0015_st30.jsonl",
|
||||
"n_requests": 1214,
|
||||
"n_sessions": 274,
|
||||
"block_size": 512,
|
||||
"shared_prefix_min_sessions": 8,
|
||||
"total_input_tokens": 53335690,
|
||||
"apc_upper_any_session": 0.8030439654947747,
|
||||
"apc_upper_intra_session": 0.7956783534627564,
|
||||
"apc_upper_shared_prefix_only": 0.0010271546126055554,
|
||||
"cached_tokens_any_session": 42830904,
|
||||
"cached_tokens_intra_session": 42438054,
|
||||
"cached_tokens_shared_prefix_only": 54784,
|
||||
"n_requests_any_hit": 961,
|
||||
"n_requests_intra_hit": 914,
|
||||
"n_requests_shared_hit": 107,
|
||||
"n_shared_pos0_blocks": 1
|
||||
}
|
||||
194
analysis/characterization/window_1_results/b2_sweep_summary.json
Normal file
@@ -0,0 +1,194 @@
|
||||
{
|
||||
"rows": [
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 0.9868436853823819,
|
||||
"n_decode_clean": 207,
|
||||
"n_decode_overlap": 33,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8101",
|
||||
"prefill_size": 16384,
|
||||
"tpot_p50_clean_s": 0.0061757058808297825,
|
||||
"tpot_p50_overlap_s": 0.006127697048765241,
|
||||
"tpot_p90_clean_s": 0.006862485770023231,
|
||||
"tpot_p90_overlap_s": 0.006772200748173878,
|
||||
"tpot_p99_clean_s": 0.007128368820806946,
|
||||
"tpot_p99_overlap_s": 0.0070623818792478,
|
||||
"ttft_p90_clean_s": 0.043039703369140626,
|
||||
"ttft_p90_overlap_s": 0.04307723045349121,
|
||||
"variant": "different"
|
||||
},
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 1.0176125863449343,
|
||||
"n_decode_clean": 228,
|
||||
"n_decode_overlap": 12,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8101",
|
||||
"prefill_size": 2048,
|
||||
"tpot_p50_clean_s": 0.0062349300191860005,
|
||||
"tpot_p50_overlap_s": 0.006218204594621754,
|
||||
"tpot_p90_clean_s": 0.006892242576136734,
|
||||
"tpot_p90_overlap_s": 0.007013632793619174,
|
||||
"tpot_p99_clean_s": 0.007111345902837888,
|
||||
"tpot_p99_overlap_s": 0.007131954732567373,
|
||||
"ttft_p90_clean_s": 0.04290406703948975,
|
||||
"ttft_p90_overlap_s": 0.040976309776306154,
|
||||
"variant": "different"
|
||||
},
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 0.9221676118155049,
|
||||
"n_decode_clean": 176,
|
||||
"n_decode_overlap": 64,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8101",
|
||||
"prefill_size": 32768,
|
||||
"tpot_p50_clean_s": 0.00620933012528853,
|
||||
"tpot_p50_overlap_s": 0.005991364970351711,
|
||||
"tpot_p90_clean_s": 0.0069098352181791054,
|
||||
"tpot_p90_overlap_s": 0.006372026241186894,
|
||||
"tpot_p99_clean_s": 0.007242970394365715,
|
||||
"tpot_p99_overlap_s": 0.006935877366499467,
|
||||
"ttft_p90_clean_s": 0.04308474063873291,
|
||||
"ttft_p90_overlap_s": 0.04266033172607422,
|
||||
"variant": "different"
|
||||
},
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 1.0162810692345416,
|
||||
"n_decode_clean": 114,
|
||||
"n_decode_overlap": 126,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8101",
|
||||
"prefill_size": 65536,
|
||||
"tpot_p50_clean_s": 0.006080349286397299,
|
||||
"tpot_p50_overlap_s": 0.006312949488861392,
|
||||
"tpot_p90_clean_s": 0.0068880830148253785,
|
||||
"tpot_p90_overlap_s": 0.007000228371283021,
|
||||
"tpot_p99_clean_s": 0.007222196574162956,
|
||||
"tpot_p99_overlap_s": 0.00723441562267265,
|
||||
"ttft_p90_clean_s": 0.04367616176605225,
|
||||
"ttft_p90_overlap_s": 0.04332089424133301,
|
||||
"variant": "different"
|
||||
},
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 0.92169565663476,
|
||||
"n_decode_clean": 220,
|
||||
"n_decode_overlap": 20,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8101",
|
||||
"prefill_size": 8192,
|
||||
"tpot_p50_clean_s": 0.006260122915711066,
|
||||
"tpot_p50_overlap_s": 0.006120474651606396,
|
||||
"tpot_p90_clean_s": 0.006968991684191154,
|
||||
"tpot_p90_overlap_s": 0.006423289366442748,
|
||||
"tpot_p99_clean_s": 0.007601349209294174,
|
||||
"tpot_p99_overlap_s": 0.006715166592838788,
|
||||
"ttft_p90_clean_s": 0.04314079284667969,
|
||||
"ttft_p90_overlap_s": 0.042817187309265134,
|
||||
"variant": "different"
|
||||
},
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 3.3716068170318985,
|
||||
"n_decode_clean": 203,
|
||||
"n_decode_overlap": 37,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8100",
|
||||
"prefill_size": 16384,
|
||||
"tpot_p50_clean_s": 0.006435276281954062,
|
||||
"tpot_p50_overlap_s": 0.009116151116111061,
|
||||
"tpot_p90_clean_s": 0.0071605749804564195,
|
||||
"tpot_p90_overlap_s": 0.024142643417974917,
|
||||
"tpot_p99_clean_s": 0.008356584539317119,
|
||||
"tpot_p99_overlap_s": 0.024809808827409838,
|
||||
"ttft_p90_clean_s": 0.04402604103088379,
|
||||
"ttft_p90_overlap_s": 1.3574100017547606,
|
||||
"variant": "same"
|
||||
},
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 1.1589170446597312,
|
||||
"n_decode_clean": 228,
|
||||
"n_decode_overlap": 12,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8100",
|
||||
"prefill_size": 2048,
|
||||
"tpot_p50_clean_s": 0.006142637946388938,
|
||||
"tpot_p50_overlap_s": 0.007610858088791972,
|
||||
"tpot_p90_clean_s": 0.006933137142296993,
|
||||
"tpot_p90_overlap_s": 0.008034930807171445,
|
||||
"tpot_p99_clean_s": 0.007201877651792584,
|
||||
"tpot_p99_overlap_s": 0.0084272463153107,
|
||||
"ttft_p90_clean_s": 0.043091440200805665,
|
||||
"ttft_p90_overlap_s": 0.09247522354125978,
|
||||
"variant": "same"
|
||||
},
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 7.891276559921504,
|
||||
"n_decode_clean": 173,
|
||||
"n_decode_overlap": 67,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8100",
|
||||
"prefill_size": 32768,
|
||||
"tpot_p50_clean_s": 0.006226602226796776,
|
||||
"tpot_p50_overlap_s": 0.012180752224392362,
|
||||
"tpot_p90_clean_s": 0.00694006813897027,
|
||||
"tpot_p90_overlap_s": 0.054765997029314145,
|
||||
"tpot_p99_clean_s": 0.010443444107518053,
|
||||
"tpot_p99_overlap_s": 0.058983875428787386,
|
||||
"ttft_p90_clean_s": 0.04411859512329101,
|
||||
"ttft_p90_overlap_s": 4.174754428863525,
|
||||
"variant": "same"
|
||||
},
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 2.259323176730457,
|
||||
"n_decode_clean": 110,
|
||||
"n_decode_overlap": 130,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8100",
|
||||
"prefill_size": 65536,
|
||||
"tpot_p50_clean_s": 0.0064652375500611585,
|
||||
"tpot_p50_overlap_s": 0.020095128001588764,
|
||||
"tpot_p90_clean_s": 0.009607415488272014,
|
||||
"tpot_p90_overlap_s": 0.021706256481132124,
|
||||
"tpot_p99_clean_s": 0.016912007837584522,
|
||||
"tpot_p99_overlap_s": 0.16948255478733715,
|
||||
"ttft_p90_clean_s": 0.06447408199310305,
|
||||
"ttft_p90_overlap_s": 14.060086917877197,
|
||||
"variant": "same"
|
||||
},
|
||||
{
|
||||
"decode_endpoint": "http://127.0.0.1:8100",
|
||||
"interference_index": 1.8961314610807898,
|
||||
"n_decode_clean": 221,
|
||||
"n_decode_overlap": 19,
|
||||
"n_decode_total": 240,
|
||||
"n_prefill_injections": 4,
|
||||
"prefill_endpoint": "http://127.0.0.1:8100",
|
||||
"prefill_size": 8192,
|
||||
"tpot_p50_clean_s": 0.00617263052198622,
|
||||
"tpot_p50_overlap_s": 0.008303543533941712,
|
||||
"tpot_p90_clean_s": 0.007060385713673601,
|
||||
"tpot_p90_overlap_s": 0.013387419479061859,
|
||||
"tpot_p99_clean_s": 0.0076809098022152696,
|
||||
"tpot_p99_overlap_s": 0.013849472662415166,
|
||||
"ttft_p90_clean_s": 0.04307150840759277,
|
||||
"ttft_p90_overlap_s": 0.52073073387146,
|
||||
"variant": "same"
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,133 @@
|
||||
{
|
||||
"rows": [
|
||||
{
|
||||
"policy": "capped",
|
||||
"n_ok": 770,
|
||||
"n_total": 770,
|
||||
"ttft_p50_s": 1.195636051998008,
|
||||
"ttft_p90_s": 12.762421467981767,
|
||||
"ttft_p99_s": 46.05476947501302,
|
||||
"tpot_p50_s": 0.007229394937166944,
|
||||
"tpot_p90_s": 0.015995440982929352,
|
||||
"tpot_p99_s": 0.10145225453431651,
|
||||
"e2e_p50_s": 2.5921602529706433,
|
||||
"e2e_p90_s": 21.238469071977306,
|
||||
"e2e_p99_s": 73.38509433099534,
|
||||
"apc_ratio": 0.3158312503528108,
|
||||
"interference_index": 6.331064378362814,
|
||||
"hotspot_index_ttft_p90": 1.9366915542605314,
|
||||
"reuse_intra_frac": 0.9192657105586233,
|
||||
"reuse_cross_frac": 0.0602232594931501,
|
||||
"n_slow": 185,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 60,
|
||||
"hot_worker_queue": 66,
|
||||
"same_worker_prefill_overlap": 45,
|
||||
"unknown": 14
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "lmetric",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.9369571270071901,
|
||||
"ttft_p90_s": 15.592678204004187,
|
||||
"ttft_p99_s": 52.95170431700535,
|
||||
"tpot_p50_s": 0.008851506907892485,
|
||||
"tpot_p90_s": 0.02120516549011311,
|
||||
"tpot_p99_s": 0.17592118933357093,
|
||||
"e2e_p50_s": 2.7527842019917443,
|
||||
"e2e_p90_s": 24.75416105298791,
|
||||
"e2e_p99_s": 79.61890332301846,
|
||||
"apc_ratio": 0.5694312382571595,
|
||||
"interference_index": 6.530231061794441,
|
||||
"hotspot_index_ttft_p90": 2.237981740718548,
|
||||
"reuse_intra_frac": 0.9321238805590836,
|
||||
"reuse_cross_frac": 0.05679481258506571,
|
||||
"n_slow": 295,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 94,
|
||||
"hot_worker_queue": 68,
|
||||
"same_worker_prefill_overlap": 69,
|
||||
"unknown": 64
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "load_only",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 1.2542553890380077,
|
||||
"ttft_p90_s": 20.14692750602262,
|
||||
"ttft_p99_s": 52.64810254302574,
|
||||
"tpot_p50_s": 0.00923045912795929,
|
||||
"tpot_p90_s": 0.02672785480314115,
|
||||
"tpot_p99_s": 0.3207044094773148,
|
||||
"e2e_p50_s": 3.584156609023921,
|
||||
"e2e_p90_s": 33.42658680601744,
|
||||
"e2e_p99_s": 93.91839688795153,
|
||||
"apc_ratio": 0.5412093853102866,
|
||||
"interference_index": 9.16424627504275,
|
||||
"hotspot_index_ttft_p90": 1.1400531308102801,
|
||||
"reuse_intra_frac": 0.9353191550754928,
|
||||
"reuse_cross_frac": 0.053372184678592026,
|
||||
"n_slow": 379,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 151,
|
||||
"hot_worker_queue": 33,
|
||||
"same_worker_prefill_overlap": 108,
|
||||
"unknown": 87
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "sticky",
|
||||
"n_ok": 1214,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.540947844972834,
|
||||
"ttft_p90_s": 18.016640832996927,
|
||||
"ttft_p99_s": 71.37327494798228,
|
||||
"tpot_p50_s": 0.00894752275507555,
|
||||
"tpot_p90_s": 0.0360956137329512,
|
||||
"tpot_p99_s": 0.34523129428917954,
|
||||
"e2e_p50_s": 2.0788628259906545,
|
||||
"e2e_p90_s": 34.605129147996195,
|
||||
"e2e_p99_s": 133.5824547969969,
|
||||
"apc_ratio": 0.7720092868396378,
|
||||
"interference_index": 13.651718321568111,
|
||||
"hotspot_index_ttft_p90": 2.3493858974059214,
|
||||
"reuse_intra_frac": 0.9327723488279339,
|
||||
"reuse_cross_frac": 0.05495149683864246,
|
||||
"n_slow": 234,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 20,
|
||||
"hot_worker_queue": 51,
|
||||
"same_worker_prefill_overlap": 134,
|
||||
"unknown": 29
|
||||
}
|
||||
},
|
||||
{
|
||||
"policy": "unified",
|
||||
"n_ok": 1213,
|
||||
"n_total": 1214,
|
||||
"ttft_p50_s": 0.4997710260213353,
|
||||
"ttft_p90_s": 7.239999514014926,
|
||||
"ttft_p99_s": 42.022206099005416,
|
||||
"tpot_p50_s": 0.008079791456705824,
|
||||
"tpot_p90_s": 0.017107906969874808,
|
||||
"tpot_p99_s": 0.11808861252148231,
|
||||
"e2e_p50_s": 1.7495028690318577,
|
||||
"e2e_p90_s": 17.893827292020433,
|
||||
"e2e_p99_s": 68.18008507299237,
|
||||
"apc_ratio": 0.794261466256467,
|
||||
"interference_index": null,
|
||||
"hotspot_index_ttft_p90": 3.3497107140827365,
|
||||
"reuse_intra_frac": 0.9311187350942534,
|
||||
"reuse_cross_frac": 0.056702150437367635,
|
||||
"n_slow": 189,
|
||||
"failure_counts": {
|
||||
"cache_miss_large_append": 18,
|
||||
"hot_worker_queue": 116,
|
||||
"unknown": 55
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
114
analysis/characterization/window_1_results/b3_report.md
Normal file
@@ -0,0 +1,114 @@
|
||||
# B3 Routing Sweep Report
|
||||
|
||||
Sweep dir: `b3_sweep_20260525_095043`
|
||||
Trace: w600_r0.0015_st30.jsonl (~1.2k reqs, 8 × TP1)
|
||||
Policies present: lmetric, load_only, sticky, unified, capped
|
||||
Policies pending: —
|
||||
|
||||
## Headline latencies + APC
|
||||
|
||||
| policy | ok/total | TTFT p50/p90/p99 (s) | TPOT p50/p90/p99 (ms) | E2E p50/p90/p99 (s) | APC |
|
||||
|---|---:|---|---|---|---:|
|
||||
| **lmetric** | 1214/1214 | 0.94/15.59/52.95 | 8.9/21.2/175.9 | 2.75/24.75/79.62 | 56.9% |
|
||||
| **load_only** | 1214/1214 | 1.25/20.15/52.65 | 9.2/26.7/320.7 | 3.58/33.43/93.92 | 54.1% |
|
||||
| **sticky** | 1214/1214 | 0.54/18.02/71.37 | 8.9/36.1/345.2 | 2.08/34.61/133.58 | 77.2% |
|
||||
| **unified** | 1213/1214 | 0.50/7.24/42.02 | 8.1/17.1/118.1 | 1.75/17.89/68.18 | 79.4% |
|
||||
| **capped** | 770/770 | 1.20/12.76/46.05 | 7.2/16.0/101.5 | 2.59/21.24/73.39 | 31.6% |
|
||||
|
||||
## Mechanism indices
|
||||
|
||||
| policy | interference_index | hotspot_index (TTFT p90) | intra-session reuse | cross-session reuse | n_slow |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| **lmetric** | 6.53 | 2.24 | 93.2% | 5.7% | 295 |
|
||||
| **load_only** | 9.16 | 1.14 | 93.5% | 5.3% | 379 |
|
||||
| **sticky** | 13.65 | 2.35 | 93.3% | 5.5% | 234 |
|
||||
| **unified** | — | 3.35 | 93.1% | 5.7% | 189 |
|
||||
| **capped** | 6.33 | 1.94 | 91.9% | 6.0% | 185 |
|
||||
|
||||
- **interference_index** = TPOT_p90(decode overlapping same-worker prefill) / TPOT_p90(clean)
|
||||
- **hotspot_index** = max(worker TTFT_p90) / median(worker TTFT_p90)
|
||||
|
||||
## Slow-request cause breakdown
|
||||
|
||||
| policy | n_slow | same-worker overlap | hot worker queue | cache miss large append | high KV | unknown |
|
||||
|---|---:|---:|---:|---:|---:|---:|
|
||||
| **lmetric** | 295 | 69 | 68 | 94 | 0 | 64 |
|
||||
| **load_only** | 379 | 108 | 33 | 151 | 0 | 87 |
|
||||
| **sticky** | 234 | 134 | 51 | 20 | 0 | 29 |
|
||||
| **unified** | 189 | 0 | 116 | 18 | 0 | 55 |
|
||||
| **capped** | 185 | 45 | 66 | 60 | 0 | 14 |
|
||||
|
||||
## Policy notes
|
||||
|
||||
- **lmetric** — cache-aware P_tokens × BS (main baseline)
|
||||
- **load_only** — control: min(num_requests), no cache, no affinity
|
||||
- **sticky** — control: hard session affinity (never break)
|
||||
- **unified** — hybrid affinity + LMetric fallback
|
||||
- **capped** — lmetric on per-session turn-capped trace
|
||||
|
||||
## Per-policy per-worker TTFT p90 (s)
|
||||
|
||||
### lmetric
|
||||
|
||||
| worker | TTFT p90 (s) |
|
||||
|---|---:|
|
||||
| http://127.0.0.1:8000 | 28.18 |
|
||||
| http://127.0.0.1:8001 | 13.15 |
|
||||
| http://127.0.0.1:8002 | 13.82 |
|
||||
| http://127.0.0.1:8003 | 14.00 |
|
||||
| http://127.0.0.1:8004 | 31.34 |
|
||||
| http://127.0.0.1:8005 | 7.87 |
|
||||
| http://127.0.0.1:8006 | 14.15 |
|
||||
| http://127.0.0.1:8007 | 11.78 |
|
||||
|
||||
### load_only
|
||||
|
||||
| worker | TTFT p90 (s) |
|
||||
|---|---:|
|
||||
| http://127.0.0.1:8000 | 22.06 |
|
||||
| http://127.0.0.1:8001 | 16.43 |
|
||||
| http://127.0.0.1:8002 | 16.81 |
|
||||
| http://127.0.0.1:8003 | 23.58 |
|
||||
| http://127.0.0.1:8004 | 25.14 |
|
||||
| http://127.0.0.1:8005 | 16.08 |
|
||||
| http://127.0.0.1:8006 | 23.96 |
|
||||
| http://127.0.0.1:8007 | 13.95 |
|
||||
|
||||
### sticky
|
||||
|
||||
| worker | TTFT p90 (s) |
|
||||
|---|---:|
|
||||
| http://127.0.0.1:8000 | 12.28 |
|
||||
| http://127.0.0.1:8001 | 23.57 |
|
||||
| http://127.0.0.1:8002 | 5.20 |
|
||||
| http://127.0.0.1:8003 | 55.38 |
|
||||
| http://127.0.0.1:8004 | 17.03 |
|
||||
| http://127.0.0.1:8005 | 25.49 |
|
||||
| http://127.0.0.1:8006 | 36.31 |
|
||||
| http://127.0.0.1:8007 | 2.50 |
|
||||
|
||||
### unified
|
||||
|
||||
| worker | TTFT p90 (s) |
|
||||
|---|---:|
|
||||
| http://127.0.0.1:8000 | 11.26 |
|
||||
| http://127.0.0.1:8001 | 3.61 |
|
||||
| http://127.0.0.1:8002 | 16.18 |
|
||||
| http://127.0.0.1:8003 | 9.31 |
|
||||
| http://127.0.0.1:8004 | 37.73 |
|
||||
| http://127.0.0.1:8005 | 18.33 |
|
||||
| http://127.0.0.1:8006 | 3.63 |
|
||||
| http://127.0.0.1:8007 | 7.77 |
|
||||
|
||||
### capped
|
||||
|
||||
| worker | TTFT p90 (s) |
|
||||
|---|---:|
|
||||
| http://127.0.0.1:8000 | 19.77 |
|
||||
| http://127.0.0.1:8001 | 15.79 |
|
||||
| http://127.0.0.1:8002 | 20.40 |
|
||||
| http://127.0.0.1:8003 | 10.54 |
|
||||
| http://127.0.0.1:8004 | 9.52 |
|
||||
| http://127.0.0.1:8005 | 9.46 |
|
||||
| http://127.0.0.1:8006 | 7.38 |
|
||||
| http://127.0.0.1:8007 | 9.66 |
|
||||
|
After Width: | Height: | Size: 84 KiB |
|
After Width: | Height: | Size: 79 KiB |
|
After Width: | Height: | Size: 39 KiB |
|
After Width: | Height: | Size: 38 KiB |
|
After Width: | Height: | Size: 49 KiB |
|
After Width: | Height: | Size: 58 KiB |
|
After Width: | Height: | Size: 51 KiB |
|
After Width: | Height: | Size: 36 KiB |
|
After Width: | Height: | Size: 31 KiB |
@@ -0,0 +1,26 @@
|
||||
{
|
||||
"formula": "kv_bytes_per_request = input_tokens * kv_bytes_per_token",
|
||||
"kv_bytes_per_request": {
|
||||
"count": 2114220,
|
||||
"max": 19893878784.0,
|
||||
"mean": 3306689367.3278427,
|
||||
"min": 0.0,
|
||||
"p50": 1969029120.0,
|
||||
"p90": 8636507750.40001,
|
||||
"p95": 10296164352.0,
|
||||
"p99": 12339806208.0
|
||||
},
|
||||
"kv_bytes_per_token": 98304.0,
|
||||
"kv_mib_per_request": {
|
||||
"count": 2114220,
|
||||
"max": 18972.28125,
|
||||
"mean": 3153.5047219541957,
|
||||
"min": 0.0,
|
||||
"p50": 1877.8125,
|
||||
"p90": 8236.415625000009,
|
||||
"p95": 9819.1875,
|
||||
"p99": 11768.15625
|
||||
},
|
||||
"status": "available",
|
||||
"total_kv_gib": 6510940.188720703
|
||||
}
|
||||
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 2.237981740718548,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 34.71445541951107,
|
||||
"http://127.0.0.1:8001": 21.922988962882666,
|
||||
"http://127.0.0.1:8002": 23.936190764518685,
|
||||
"http://127.0.0.1:8003": 26.22220957049285,
|
||||
"http://127.0.0.1:8004": 40.318757307820505,
|
||||
"http://127.0.0.1:8005": 12.26559703698149,
|
||||
"http://127.0.0.1:8006": 27.904838753980588,
|
||||
"http://127.0.0.1:8007": 18.430557113309625
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 28.18261351052206,
|
||||
"http://127.0.0.1:8001": 13.147308969072796,
|
||||
"http://127.0.0.1:8002": 13.818959677941162,
|
||||
"http://127.0.0.1:8003": 14.003642184572524,
|
||||
"http://127.0.0.1:8004": 31.339895512629305,
|
||||
"http://127.0.0.1:8005": 7.870992770011071,
|
||||
"http://127.0.0.1:8006": 14.149156623415186,
|
||||
"http://127.0.0.1:8007": 11.777357225219024
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
@@ -0,0 +1,15 @@
|
||||
{
|
||||
"cross_session_tokens": 1723017,
|
||||
"fractions": {
|
||||
"cross": 0.05679481258506571,
|
||||
"intra": 0.9321238805590836,
|
||||
"shared": 0.011081306855850749,
|
||||
"unclassified": 0.0
|
||||
},
|
||||
"intra_session_tokens": 28278380,
|
||||
"shared_prefix_min_sessions": 8,
|
||||
"shared_prefix_tokens": 336180,
|
||||
"status": "supported",
|
||||
"total_cached_tokens": 30371008,
|
||||
"unclassified_tokens": 0
|
||||
}
|
||||
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 1.9366915542605314,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 23.81083881931848,
|
||||
"http://127.0.0.1:8001": 18.139674991380897,
|
||||
"http://127.0.0.1:8002": 29.116712999995805,
|
||||
"http://127.0.0.1:8003": 19.245074290811324,
|
||||
"http://127.0.0.1:8004": 17.230851700413044,
|
||||
"http://127.0.0.1:8005": 15.86663371440958,
|
||||
"http://127.0.0.1:8006": 16.707309890014592,
|
||||
"http://127.0.0.1:8007": 23.93718611740042
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 19.772570010094213,
|
||||
"http://127.0.0.1:8001": 15.786850639013576,
|
||||
"http://127.0.0.1:8002": 20.403525242628533,
|
||||
"http://127.0.0.1:8003": 10.535247699997853,
|
||||
"http://127.0.0.1:8004": 9.52290979558602,
|
||||
"http://127.0.0.1:8005": 9.455131393985376,
|
||||
"http://127.0.0.1:8006": 7.379608143202497,
|
||||
"http://127.0.0.1:8007": 9.661995008389932
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 2.237981740718548,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 34.71445541951107,
|
||||
"http://127.0.0.1:8001": 21.922988962882666,
|
||||
"http://127.0.0.1:8002": 23.936190764518685,
|
||||
"http://127.0.0.1:8003": 26.22220957049285,
|
||||
"http://127.0.0.1:8004": 40.318757307820505,
|
||||
"http://127.0.0.1:8005": 12.26559703698149,
|
||||
"http://127.0.0.1:8006": 27.904838753980588,
|
||||
"http://127.0.0.1:8007": 18.430557113309625
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 28.18261351052206,
|
||||
"http://127.0.0.1:8001": 13.147308969072796,
|
||||
"http://127.0.0.1:8002": 13.818959677941162,
|
||||
"http://127.0.0.1:8003": 14.003642184572524,
|
||||
"http://127.0.0.1:8004": 31.339895512629305,
|
||||
"http://127.0.0.1:8005": 7.870992770011071,
|
||||
"http://127.0.0.1:8006": 14.149156623415186,
|
||||
"http://127.0.0.1:8007": 11.777357225219024
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 1.1400531308102801,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 33.51168999259829,
|
||||
"http://127.0.0.1:8001": 29.20308109278556,
|
||||
"http://127.0.0.1:8002": 27.126518827211115,
|
||||
"http://127.0.0.1:8003": 38.597240307606995,
|
||||
"http://127.0.0.1:8004": 36.607777832809376,
|
||||
"http://127.0.0.1:8005": 28.097025175404276,
|
||||
"http://127.0.0.1:8006": 49.29610514297965,
|
||||
"http://127.0.0.1:8007": 20.958507975534303
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 22.055091864388675,
|
||||
"http://127.0.0.1:8001": 16.425856862741057,
|
||||
"http://127.0.0.1:8002": 16.806352904380766,
|
||||
"http://127.0.0.1:8003": 23.581166115606912,
|
||||
"http://127.0.0.1:8004": 25.14397653030465,
|
||||
"http://127.0.0.1:8005": 16.080231266201018,
|
||||
"http://127.0.0.1:8006": 23.960470345703648,
|
||||
"http://127.0.0.1:8007": 13.95184187250561
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 2.3493858974059214,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 30.185792533413043,
|
||||
"http://127.0.0.1:8001": 47.49661003401852,
|
||||
"http://127.0.0.1:8002": 22.069474861002554,
|
||||
"http://127.0.0.1:8003": 83.73774532350944,
|
||||
"http://127.0.0.1:8004": 22.03310715127737,
|
||||
"http://127.0.0.1:8005": 33.024566102202556,
|
||||
"http://127.0.0.1:8006": 61.65600914339302,
|
||||
"http://127.0.0.1:8007": 6.077459598158019
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 12.284569517592924,
|
||||
"http://127.0.0.1:8001": 23.570226482005094,
|
||||
"http://127.0.0.1:8002": 5.202772857400123,
|
||||
"http://127.0.0.1:8003": 55.37555769548635,
|
||||
"http://127.0.0.1:8004": 17.031311958114394,
|
||||
"http://127.0.0.1:8005": 25.48531596700202,
|
||||
"http://127.0.0.1:8006": 36.31029207323453,
|
||||
"http://127.0.0.1:8007": 2.4984901855932535
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
@@ -0,0 +1,24 @@
|
||||
{
|
||||
"hotspot_index_ttft_p90": 3.3497107140827365,
|
||||
"per_worker_latency_p90_s": {
|
||||
"http://127.0.0.1:8000": 41.42001512600109,
|
||||
"http://127.0.0.1:8001": 12.4878579101933,
|
||||
"http://127.0.0.1:8002": 22.462878945574648,
|
||||
"http://127.0.0.1:8003": 15.501050900109117,
|
||||
"http://127.0.0.1:8004": 39.956250199786155,
|
||||
"http://127.0.0.1:8005": 36.69850301651168,
|
||||
"http://127.0.0.1:8006": 10.116177947795954,
|
||||
"http://127.0.0.1:8007": 20.35038618039107
|
||||
},
|
||||
"per_worker_ttft_p90_s": {
|
||||
"http://127.0.0.1:8000": 11.264844838529825,
|
||||
"http://127.0.0.1:8001": 3.6063860427122614,
|
||||
"http://127.0.0.1:8002": 16.175747957825664,
|
||||
"http://127.0.0.1:8003": 9.314684258581842,
|
||||
"http://127.0.0.1:8004": 37.73397144810297,
|
||||
"http://127.0.0.1:8005": 18.328030522551852,
|
||||
"http://127.0.0.1:8006": 3.6328767628350773,
|
||||
"http://127.0.0.1:8007": 7.772977900883419
|
||||
},
|
||||
"status": "supported"
|
||||
}
|
||||
136
analysis/characterization/window_1_results/summary.json
Normal file
@@ -0,0 +1,136 @@
|
||||
{
|
||||
"analyzed_records": 2114220,
|
||||
"batch0": {
|
||||
"attempted_requests": 2114220,
|
||||
"completed_requests": null,
|
||||
"error_requests": null,
|
||||
"max_inflight_per_session": null,
|
||||
"session_concurrency_status": "unavailable",
|
||||
"session_sequential": null
|
||||
},
|
||||
"batch1": {
|
||||
"append_status": "unavailable",
|
||||
"input_stats": {
|
||||
"count": 2114220,
|
||||
"max": 202371.0,
|
||||
"mean": 33637.38370084476,
|
||||
"min": 0.0,
|
||||
"p50": 20030.0,
|
||||
"p90": 87855.1000000001,
|
||||
"p95": 104738.0,
|
||||
"p99": 125527.0
|
||||
},
|
||||
"kv_footprint_status": "available",
|
||||
"output_stats": {
|
||||
"count": 2114220,
|
||||
"max": 132665.0,
|
||||
"mean": 444.97059624826176,
|
||||
"min": 0.0,
|
||||
"p50": 80.0,
|
||||
"p90": 811.0,
|
||||
"p95": 2213.0,
|
||||
"p99": 6614.810000000056
|
||||
},
|
||||
"reuse_status": "unavailable"
|
||||
},
|
||||
"classification": {
|
||||
"label": "invalid_for_online_claim",
|
||||
"reason": "actual dispatch/finish timestamps are unavailable, so online sequentiality cannot be proven",
|
||||
"source": "auto",
|
||||
"stress_indicators": []
|
||||
},
|
||||
"manifest": {
|
||||
"canonical_trace_data_sources": {
|
||||
"dash0_formatted_trace_dir": "~/ali-trace/trace-glm5.1-formatted/",
|
||||
"dash0_raw_trace_dir": "~/ali-trace/trace-glm5.1/",
|
||||
"usage_note": "Full trace analysis can be run CPU-only on dash0, or the needed JSONL files can be copied/rsynced from dash0 to this machine before running this analyzer."
|
||||
},
|
||||
"end_time": "2026-05-25T09:03:36.499002+00:00",
|
||||
"figure_status": {
|
||||
"reason": "matplotlib unavailable: ModuleNotFoundError(\"No module named 'matplotlib'\")",
|
||||
"status": "skipped"
|
||||
},
|
||||
"git_commit": "",
|
||||
"gpu_count": 0,
|
||||
"gpu_type": "",
|
||||
"host": "ds-6348bee4-1-765874c9c4-7zrvf",
|
||||
"input_requirements": {
|
||||
"actual_sequentiality_proof": [
|
||||
"per-request dispatch timestamp",
|
||||
"per-request finish or error/timeout timestamp",
|
||||
"request_id join to trace/metrics when timing source is separate"
|
||||
],
|
||||
"metrics_jsonl": [
|
||||
"request_id",
|
||||
"session_id",
|
||||
"trace_timestamp_s",
|
||||
"input_length",
|
||||
"output_length",
|
||||
"latency_s",
|
||||
"ttft_s",
|
||||
"tpot_s",
|
||||
"error",
|
||||
"optional cached_tokens"
|
||||
],
|
||||
"reuse_decomposition": [
|
||||
"cached_tokens or cache_hit",
|
||||
"hash_ids",
|
||||
"session_id"
|
||||
],
|
||||
"trace_jsonl": [
|
||||
"chat_id",
|
||||
"parent_chat_id",
|
||||
"timestamp",
|
||||
"input_length",
|
||||
"output_length",
|
||||
"turn",
|
||||
"hash_ids",
|
||||
"optional session_id"
|
||||
]
|
||||
},
|
||||
"input_status": {
|
||||
"analyzed_records": 2114220,
|
||||
"breakdown_records": 0,
|
||||
"merge_warnings": [],
|
||||
"metrics_records": 0,
|
||||
"trace_records": 2114220,
|
||||
"trace_warnings": [],
|
||||
"unmatched_breakdown": 0,
|
||||
"unmatched_metrics": 0
|
||||
},
|
||||
"launch_command": "analysis/characterization/analyze.py --trace /home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl --kv-bytes-per-token 98304 --task-name full_trace_with_kv --output-root outputs/characterization --overwrite",
|
||||
"output_dir": "outputs/characterization/2026-05-25/full_trace_with_kv",
|
||||
"policy": "",
|
||||
"request_limit": null,
|
||||
"session_sampling_method": "",
|
||||
"session_sequential": null,
|
||||
"start_time": "2026-05-25T08:59:11.618919+00:00",
|
||||
"time_scale": null,
|
||||
"trace_file_info": {
|
||||
"exists": true,
|
||||
"mtime_s": 1778772033.2788928,
|
||||
"path": "/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl",
|
||||
"sha256": "",
|
||||
"sha256_status": "skipped_use_--hash-inputs",
|
||||
"size_bytes": 1561266372
|
||||
},
|
||||
"trace_path": "/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl",
|
||||
"trace_sha256": ""
|
||||
},
|
||||
"outputs": [
|
||||
"append_delta_stats.json",
|
||||
"invalid_runs.md",
|
||||
"kv_footprint_summary.json",
|
||||
"manifest.json",
|
||||
"raw/merged_requests.jsonl",
|
||||
"raw/unmatched_breakdown.jsonl",
|
||||
"raw/unmatched_metrics.jsonl",
|
||||
"reuse_decomposition.json",
|
||||
"session_arrival_stats.json",
|
||||
"session_concurrency.json",
|
||||
"session_skew.json",
|
||||
"trace_profile.json",
|
||||
"turn_interval_stats.json",
|
||||
"workload_summary.json"
|
||||
]
|
||||
}
|
||||