Characterization plan: progress snapshot + Claude work plan
- Add Progress Snapshot table to the intern TODO so per-batch status (DONE / partial / blocked-on-instrumentation) is visible at a glance. - New analysis/claude_characterization_work_plan.md scopes the Phase A instrumentation tasks (A1-A5) plus Window 1 (B1'+B2+B3) and Window 2 (B4+B5) on dash0, with locked decisions for model, topology, trace, SLO style, and GPU phasing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -2,6 +2,35 @@
|
||||
|
||||
Status: execution checklist for interns
|
||||
Date: 2026-05-25
|
||||
Last progress audit: 2026-05-25
|
||||
|
||||
## Progress Snapshot (2026-05-25)
|
||||
|
||||
| Batch | State | Evidence |
|
||||
|---|---|---|
|
||||
| B0 Substrate audit | tool DONE, legacy runs partial | `analysis/characterization/analyze.py` implements per-session concurrency/arrival/inter-turn analyzer; legacy `metrics.jsonl` lacks dispatch/finish timestamps so actual sequentiality cannot be proven on old runs (correctly labeled in `current_results/`) |
|
||||
| B1 Workload characterization | trace-shape DONE, reuse pending | `current_results/full_trace_summary.json` covers 2.11M req / 1.31M sessions from `051315-051317.jsonl`; KV-footprint and reuse decomposition still require `--kv-bytes-per-token` rerun and cached_tokens+hash_ids joined records |
|
||||
| B2 PD interference | protocol DONE, run pending | `analysis/characterization/protocols.md` Batch 2 section ready; needs fresh GPU run with decode-step + prefill-chunk timestamps |
|
||||
| B3 Hot-spot imbalance | partial; needs new instrumentation | Legacy `gpu_util.csv` shows imbalance but lacks per-worker queue delay and session→worker map |
|
||||
| B4 SRR sweep | NOT DONE | No arrival-rate sweep artifacts; depends on session-causal open-loop loadgen |
|
||||
| B5 Failure attribution | NOT DONE | Depends on B2/B4 outputs |
|
||||
| B6 Audit package | scaffold DONE | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` + 6 figures committed |
|
||||
|
||||
Reusable assets already in repo:
|
||||
|
||||
- `analysis/characterization/analyze.py` — B0+B1 CPU-only analyzer
|
||||
- `analysis/characterization/summarize_runs.py` — existing-run audit producing the B6 scaffold
|
||||
- `analysis/characterization/plot_current_results.py` — figure regeneration script
|
||||
- `analysis/characterization/protocols.md` — B2–B6 protocol with required instrumentation, sweep, pass condition
|
||||
- `analysis/characterization/current_results/` — current audit package (claim matrix + risk register + allowed-runs gate + 6 PNG figures)
|
||||
|
||||
Hard gates still blocking main claims:
|
||||
|
||||
1. Replayer/proxy must emit per-request dispatch + finish/error wall-clock timestamps (blocks B0 actual sequentiality, B4 SRR validity).
|
||||
2. Per-request record must carry `session_id` + `hash_ids` + `cached_tokens` jointly (blocks B1 reuse decomposition).
|
||||
3. Engine/proxy must log decode-step and prefill-chunk timestamps with worker id (blocks B2 interference index).
|
||||
4. Proxy must log route decision, chosen worker, candidate scores, per-worker queue/KV/APC snapshot (blocks B3 hot-spot proof).
|
||||
|
||||
|
||||
## 0. Purpose
|
||||
|
||||
@@ -84,6 +113,8 @@ saved alongside the artifact.
|
||||
|
||||
## 2. Batch 0: Benchmark Substrate Audit
|
||||
|
||||
Status: analyzer DONE (`analyze.py`); legacy-run sequentiality claim BLOCKED by missing dispatch/finish timestamps in `metrics.jsonl`. New replayer must add those fields before any `online_realistic` classification is allowed.
|
||||
|
||||
### Goal
|
||||
|
||||
Prove the load generator and trace replay are valid before trusting any
|
||||
@@ -147,6 +178,8 @@ The `audit.md` must answer:
|
||||
|
||||
## 3. Batch 1: Workload Characterization
|
||||
|
||||
Status: trace-shape items (1, 2, 3, 6, 8) DONE on full 7200 s GLM-5.1 trace; recorded in `current_results/full_trace_summary.json`. Items 4 (KV footprint), 5 (reuse decomposition), 7 (uncached append delta) are PENDING because they need `--kv-bytes-per-token` for the production model and joinable `cached_tokens`+`hash_ids` per request.
|
||||
|
||||
### Goal
|
||||
|
||||
Establish agentic workload facts independent of any proposed system.
|
||||
@@ -225,6 +258,8 @@ links and plotted figures.
|
||||
|
||||
## 4. Batch 2: PD-Colo Prefill-Decode Interference Proof
|
||||
|
||||
Status: protocol DONE (`analysis/characterization/protocols.md` §"Batch 2 Protocol"); execution NOT STARTED — needs new engine instrumentation for decode-step and prefill-chunk timestamps.
|
||||
|
||||
### Goal
|
||||
|
||||
Prove that PD-colocation can suffer from prefill-decode interference under
|
||||
@@ -302,6 +337,8 @@ The `audit.md` must answer:
|
||||
|
||||
## 5. Batch 3: Session Hot-Spot Residual Imbalance Proof
|
||||
|
||||
Status: protocol DONE; partial signal from legacy `gpu_util.csv` (GPU-util imbalance visible) but causal proof NOT STARTED — needs per-worker queue/KV/APC and session→worker map from instrumented proxy.
|
||||
|
||||
### Goal
|
||||
|
||||
Prove that cache-aware/LMetric is a strong baseline but still leaves residual
|
||||
@@ -398,6 +435,8 @@ The `audit.md` must answer:
|
||||
|
||||
## 6. Batch 4: Sustainable Request Rate Sweep
|
||||
|
||||
Status: protocol DONE; execution NOT STARTED — requires open-loop session-causal loadgen and policy-comparable arrival process.
|
||||
|
||||
### Goal
|
||||
|
||||
Connect interference and hot-spot mechanisms to the final metric:
|
||||
@@ -476,6 +515,8 @@ The `audit.md` must answer:
|
||||
|
||||
## 7. Batch 5: Failure Attribution Near SRR Boundary
|
||||
|
||||
Status: protocol DONE; execution NOT STARTED — depends on B2 instrumentation and B4 SRR boundary.
|
||||
|
||||
### Goal
|
||||
|
||||
At and around the PD-colo/LMetric failure point, determine whether SLO
|
||||
@@ -542,6 +583,8 @@ The batch must answer:
|
||||
|
||||
## 8. Batch 6: Audit Package
|
||||
|
||||
Status: scaffold DONE — all five final artifacts exist under `analysis/characterization/current_results/` and are regenerated by `summarize_runs.py` + `plot_current_results.py`. Future B2–B5 outputs must be merged into the same package by re-running `summarize_runs.py` after new runs.
|
||||
|
||||
### Goal
|
||||
|
||||
Make the whole characterization package reviewable by a strict systems
|
||||
|
||||
360
analysis/claude_characterization_work_plan.md
Normal file
360
analysis/claude_characterization_work_plan.md
Normal file
@@ -0,0 +1,360 @@
|
||||
# Claude Characterization Work Plan
|
||||
|
||||
Status: planning, awaiting dash0 idle
|
||||
Date: 2026-05-25
|
||||
Owner: Claude (not interns)
|
||||
Source of requirements: `analysis/characterization_todo_for_interns.md`
|
||||
|
||||
## Scope
|
||||
|
||||
This plan covers the four hard gates and the B2–B5 GPU experiments that the
|
||||
intern TODO marks as `NOT DONE` / `protocol DONE`. The B0 analyzer, the
|
||||
B1 trace-shape statistics, and the B6 audit scaffold are already done; this
|
||||
plan does **not** re-do them, only refreshes their inputs.
|
||||
|
||||
The work is split into:
|
||||
|
||||
- **Phase A (CPU-only)** — instrumentation + analyzer extensions. Can run
|
||||
on the local dev box; does **not** need dash0. Must finish before any
|
||||
GPU run.
|
||||
- **Phase B (dash0 GPU)** — controlled microbench + routing sweep + SRR
|
||||
sweep + failure attribution.
|
||||
- **Phase C (CPU-only)** — final audit package refresh.
|
||||
|
||||
## Phase A: Instrumentation + Analyzer (CPU-only, before dash0)
|
||||
|
||||
### A1. Replayer instrumentation — close Gate 1 + Gate 2
|
||||
|
||||
File: `replayer/metrics.py`, `replayer/replay.py`
|
||||
|
||||
Add these fields to `RequestMetrics`:
|
||||
|
||||
```text
|
||||
t_dispatch_unix float # absolute wall-clock when POST starts
|
||||
t_first_token_unix float # absolute wall-clock at first stream chunk
|
||||
t_finish_unix float # absolute wall-clock at stream done or error
|
||||
proxy_request_id str # value sent in X-Request-Id (matches breakdown)
|
||||
endpoint_url str # which proxy/instance the request hit
|
||||
trace_hash_ids list[int] # carried from trace for reuse joins
|
||||
```
|
||||
|
||||
Change `_dispatch_request` to:
|
||||
|
||||
- send a deterministic `X-Request-Id: <session_id>:<turn_id>` header (so
|
||||
proxy breakdown can be joined to metrics by exact key);
|
||||
- record `time.time()` (unix) at dispatch, first token, finish; keep
|
||||
`perf_counter` for the latency arithmetic.
|
||||
|
||||
Acceptance: a 30-request smoke run produces `metrics.jsonl` where every
|
||||
row has those fields; `breakdown.json` rows from the proxy have the same
|
||||
`request_id` keys.
|
||||
|
||||
Effort: 1 small PR. Pure CPU.
|
||||
|
||||
### A2. Proxy instrumentation — close Gate 1 + Gate 3 + Gate 4
|
||||
|
||||
File: `scripts/cache_aware_proxy.py`
|
||||
|
||||
Changes:
|
||||
|
||||
1. Honor incoming `X-Request-Id`: if header present, use it instead of
|
||||
generating a new uuid. Falls back to uuid otherwise.
|
||||
2. Record on every breakdown row:
|
||||
- `session_id` (already on header, not currently stored)
|
||||
- `input_length`
|
||||
- `estimated_new_tokens` (already produced by router)
|
||||
- `candidate_scores` (list of `{url, p_tokens_score, cache_score, bs,
|
||||
occupancy}`)
|
||||
- `chosen_score`
|
||||
3. At route decision time, snapshot per-worker state:
|
||||
- `pending_prefill_tokens` per worker
|
||||
- `running_decode_requests` per worker
|
||||
- `kv_blocks_used` / `kv_blocks_total` per worker
|
||||
- `apc_hits` / `apc_queries` cumulative per worker
|
||||
Write to a separate `worker_state.jsonl` (one line per route decision)
|
||||
with `(t_decision_unix, request_id, per_worker_state)`.
|
||||
4. New endpoint `GET /worker_state` returns the latest snapshot per worker
|
||||
(for sanity / live debugging).
|
||||
|
||||
Acceptance: smoke run produces `breakdown.json` with new fields and a
|
||||
non-empty `worker_state.jsonl` that joins to breakdown by `request_id`.
|
||||
|
||||
Effort: 1 medium PR. Pure CPU + light proxy work.
|
||||
|
||||
### A3. Engine-side step timestamps — close Gate 3 for B2
|
||||
|
||||
vLLM 0.18.1 already exposes:
|
||||
|
||||
- `vllm:request_prefill_time_seconds` (histogram, per-request)
|
||||
- `vllm:request_decode_time_seconds`
|
||||
- `vllm:time_per_output_token_seconds`
|
||||
- step-level scheduler stats via `engine.async_step` logging
|
||||
|
||||
For B2 we need decode-step and prefill-chunk timestamps with worker id.
|
||||
Plan:
|
||||
|
||||
1. Inspect whether the vLLM proxy can be polled at high rate (e.g.
|
||||
100 Hz) for per-engine scheduler counters
|
||||
(`num_running`, `num_waiting`, `gpu_cache_usage`,
|
||||
`prefix_cache_queries`, `prefix_cache_hits`). If yes, sample
|
||||
into `engine_state.jsonl` during runs.
|
||||
2. If finer step-level data is needed, patch one vLLM file
|
||||
(`vllm/engine/async_llm_engine.py` step loop or
|
||||
`vllm/v1/core/sched/scheduler.py`) to emit a JSONL line per
|
||||
scheduler step with `(t_unix, worker_id, num_prefill_tokens_scheduled,
|
||||
num_decode_steps, running_request_ids)`. Patch goes under `patches/`
|
||||
so it can be applied/reverted cleanly.
|
||||
3. Worker id mapping: when running TP1xDP8 or similar, each engine
|
||||
listens on a distinct port; `worker_id == endpoint_url`.
|
||||
|
||||
Acceptance: a single 10-minute run produces `engine_state.jsonl` from
|
||||
which a decode step at time T on worker W can be classified as
|
||||
"overlapping a same-worker prefill chunk" or not.
|
||||
|
||||
Effort: 1 medium investigation (decide poll vs patch) + 1 medium PR.
|
||||
|
||||
### A4. Open-loop session-causal loadgen for B4
|
||||
|
||||
File: `replayer/replay.py` (new mode) or new `replayer/srr_loadgen.py`
|
||||
|
||||
Current replayer dispatches by trace timestamps. SRR sweep needs:
|
||||
|
||||
- pool of session templates (each = ordered list of turns from the
|
||||
trace);
|
||||
- Poisson arrivals of new sessions at rate `lambda`;
|
||||
- within a session: strict sequentiality (turn N+1 waits for turn N
|
||||
finish);
|
||||
- per-run warmup window (e.g. 60s) + steady-state window (e.g. 300s);
|
||||
- attempted / completed / error counters per window.
|
||||
|
||||
Add a new mode `--mode srr --arrival-rate <lambda>
|
||||
--warmup-s 60 --steady-s 300 --session-pool-size N`. The trace
|
||||
file becomes the pool; sessions are drawn with replacement.
|
||||
|
||||
Acceptance: at `lambda = 0.5 sess/s`, the run shows exponential inter-
|
||||
arrival times and per-session sequentiality in `metrics.jsonl`. A
|
||||
`window_summary.json` lists warmup vs steady-state attempted/completed.
|
||||
|
||||
Effort: 1 medium PR.
|
||||
|
||||
### A5. Analyzer extensions
|
||||
|
||||
File: `analysis/characterization/analyze.py` (extend, do not rewrite)
|
||||
|
||||
Add:
|
||||
|
||||
1. **Joined-record builder.** Given `--metrics metrics.jsonl
|
||||
--breakdown breakdown.json --worker-state worker_state.jsonl
|
||||
--engine-state engine_state.jsonl`, produce
|
||||
`joined.jsonl` keyed on `request_id` with all fields merged.
|
||||
2. **Reuse decomposition (real).** Using joined records that carry
|
||||
`session_id` + `hash_ids` + `cached_tokens`, compute
|
||||
`intra_session` / `cross_session` / `shared_prefix` /
|
||||
`unclassified` cached-token mass. Replaces the current
|
||||
`status: unavailable` placeholder when fields are present.
|
||||
3. **Interference index.** Per decode step, label "overlap same-
|
||||
worker prefill" using `engine_state.jsonl`. Compute
|
||||
`TPOT_p90(overlap) / TPOT_p90(no_overlap)`.
|
||||
4. **Hotspot index.** Per worker queue delay p90, output
|
||||
`max_worker_q_p90 / median_worker_q_p90`.
|
||||
5. **Failure label.** For each slow / SLO-violating request, assign
|
||||
one of: `same_worker_prefill_overlap`, `hot_worker_queue`,
|
||||
`high_kv_occupancy`, `cache_miss_large_append`, `transfer_wait`,
|
||||
`p_queue_wait`, `d_admission_wait`, `unknown`.
|
||||
6. **Window summary.** For SRR runs, compute attempted/completed/
|
||||
error/goodput plus latency percentiles on the steady-state
|
||||
window only.
|
||||
|
||||
Acceptance: re-run analyzer on smoke output and confirm `reuse_decomposition`
|
||||
no longer says `unavailable`; `interference_index.json` produced when
|
||||
engine state present; `failure_breakdown.json` populated when
|
||||
labels assigned.
|
||||
|
||||
Effort: 1 large PR. CPU-only.
|
||||
|
||||
## Phase B: GPU experiments (needs dash0)
|
||||
|
||||
### B1' Workload characterization closure
|
||||
|
||||
Inputs: instrumented replayer + small smoke trace (≤500 req).
|
||||
|
||||
Steps:
|
||||
|
||||
1. Pick `kv_bytes_per_token` for the production model. For
|
||||
Qwen3-Coder TP1 the value depends on layer/head config; compute
|
||||
from `vllm.config` once at run start and record in manifest.
|
||||
2. Re-run analyzer on full GLM-5.1 trace with `--kv-bytes-per-token`.
|
||||
Output: KV footprint p50/p90/p99 in `kv_footprint_summary.json`.
|
||||
3. Run a 1k-request session-causal smoke replay with instrumented
|
||||
proxy. Use the joined records to populate real reuse decomposition
|
||||
for the small sample. (Full-trace replay is too expensive; sample
|
||||
is acceptable for the decomposition claim.)
|
||||
|
||||
Wall-clock: ~30 min GPU. Produces 2 figures: KV footprint CDF, reuse
|
||||
decomposition stacked bar.
|
||||
|
||||
### B2 PD-colo interference microbench
|
||||
|
||||
Setup: 1 combined instance on TP1. Two synthetic load generators:
|
||||
|
||||
1. **Decode-only steady load** — short-prompt sessions at fixed
|
||||
per-second arrival, designed to saturate decode without prefill
|
||||
contention.
|
||||
2. **Prefill injector** — single-shot long-prompt requests at
|
||||
controlled cadence; same worker (target the decode worker) vs
|
||||
different worker (route to a paired idle instance).
|
||||
|
||||
Sweep `uncached_prefill_tokens ∈ {2k, 8k, 16k, 32k, 64k}` × `{same,
|
||||
different} worker`.
|
||||
|
||||
Outputs: `interference_microbench_summary.json`,
|
||||
`decode_step_timeseries.csv` (from `engine_state.jsonl`),
|
||||
`prefill_overlap_events.jsonl`, `interference_index.json`,
|
||||
TPOT-with-overlay figure, interference-index-vs-prefill-size figure.
|
||||
|
||||
Wall-clock: ~2–3 h GPU including warm-up between sweeps.
|
||||
|
||||
### B3 Routing sweep on session-causal trace
|
||||
|
||||
Setup: 8 combined instances (TP1 × DP8) with the cache-aware proxy.
|
||||
|
||||
Run the same session-causal trace (e.g. r=0.0015 st=30 850-req config
|
||||
from auto-mem `feedback-bench-config.md`) under five policies:
|
||||
|
||||
1. corrected LMetric / cache-aware (`--policy lmetric`)
|
||||
2. load-only (new policy `--policy load_only` — picks min running)
|
||||
3. hard sticky (new policy `--policy sticky` — once a session lands
|
||||
on a worker, never moves)
|
||||
4. current Unified hybrid (`--policy unified`)
|
||||
5. session-mass capped replay (filter the trace so no session exceeds
|
||||
`cap_turns` or `cap_input_tokens`; rerun policy 1)
|
||||
|
||||
Per run, collect: replayer metrics, proxy breakdown, worker_state,
|
||||
engine_state. Compute per-worker queue delay, GPU util, KV occupancy,
|
||||
APC, session-to-worker map.
|
||||
|
||||
Outputs: `worker_balance_summary.json`, `session_to_worker_map.json`,
|
||||
`session_mass_summary.json`, `routing_policy_comparison.json`,
|
||||
`hotspot_index.json`, `capped_session_replay_summary.json`,
|
||||
8 figures from the TODO list (§5.figures).
|
||||
|
||||
Wall-clock: 5 runs × ~13 min ≈ 1.5 h GPU.
|
||||
|
||||
Implementation note: `load_only` and `sticky` are small additions to
|
||||
`scripts/cache_aware_proxy.py` — they reuse existing affinity / score
|
||||
machinery.
|
||||
|
||||
### B4 Sustainable Request Rate sweep
|
||||
|
||||
Setup: same 8 instances. Use Phase-A `--mode srr` loadgen.
|
||||
|
||||
SLO (locked per-class):
|
||||
|
||||
```text
|
||||
TTFT_p90 <= 2.0 s
|
||||
TPOT_p90 <= 0.15 s
|
||||
error_rate <= 0.5%
|
||||
queue length stable (no monotone growth over steady window)
|
||||
KV occupancy stable
|
||||
E2E_p90 <= T_class[c] for each output-length decile c
|
||||
```
|
||||
|
||||
`T_class[c]` is derived from a low-load reference run as
|
||||
`E2E_p90_low_load(c) * 2` (factor configurable). The reference run
|
||||
is done once and cached as `analysis/characterization/srr/slo_classes.json`.
|
||||
|
||||
Per policy sweep `lambda` from low (clearly safe) to high (clearly
|
||||
broken) using a bisection-ish search:
|
||||
|
||||
```
|
||||
λ_low = 0.1 sess/s
|
||||
λ_high = doubling until first SLO violation
|
||||
binary-search λ_low .. λ_high for max sustainable λ
|
||||
```
|
||||
|
||||
Policies covered: LMetric, static PD-disagg, Unified, hard sticky,
|
||||
load-only.
|
||||
|
||||
Outputs: `srr_curve.json`, `lambda_runs/<lambda>/summary.json`,
|
||||
`slo_violation_reason.json`, `goodput_vs_arrival_rate.json`,
|
||||
`stability_summary.json`, all 8 figures from §6.figures.
|
||||
|
||||
Wall-clock: this is the most expensive batch. With binary search,
|
||||
~6 lambda points × 5 policies × ~8 min (warmup + steady) ≈ 4 h GPU.
|
||||
|
||||
### B5 Failure attribution near SRR boundary
|
||||
|
||||
For each policy: pick `λ ∈ {0.9, 1.0, 1.1} × SRR`, run with full
|
||||
instrumentation, then run the analyzer's failure-label step.
|
||||
|
||||
Outputs: `slow_request_attribution.jsonl`, `failure_breakdown.json`,
|
||||
`case_studies.md`, `worker_failure_windows.json`, 5 figures from §7.
|
||||
|
||||
Wall-clock: 3 lambdas × 5 policies × 8 min ≈ 2 h GPU.
|
||||
|
||||
## Phase C: Audit package refresh (CPU)
|
||||
|
||||
Re-run `summarize_runs.py` and `plot_current_results.py` after each
|
||||
GPU batch. Final pass after B5: refresh `claim_matrix`, `risk_register`,
|
||||
`allowed_runs`, regenerate all figures, update
|
||||
`reproduction_commands.sh`.
|
||||
|
||||
Effort: ~1 h CPU.
|
||||
|
||||
## Sequencing & rough timeline
|
||||
|
||||
```text
|
||||
Phase A (CPU, before dash0):
|
||||
A1 + A2 (parallel) ~half day CPU
|
||||
A3 patch (scheduler.py) ~half day CPU
|
||||
A4 SRR loadgen ~half day CPU
|
||||
A5 analyzer extensions ~1 day CPU
|
||||
|
||||
Window 1 on dash0 (B2 + B3 only, ~5 h GPU):
|
||||
smoke validation of A1–A4 ~30 min GPU
|
||||
B1' KV footprint + reuse decomp ~30 min GPU
|
||||
B2 interference microbench ~3 h GPU
|
||||
B3 routing sweep (5 policies) ~1.5 h GPU
|
||||
Phase C partial refresh ~30 min CPU
|
||||
── HARD STOP, hand results back ──
|
||||
|
||||
Window 2 on dash0 (B4 + B5, ~6 h GPU, only after review):
|
||||
B4 SRR sweep (5 policies × bisect) ~4 h GPU
|
||||
B5 failure attribution ~2 h GPU
|
||||
Phase C final refresh ~1 h CPU
|
||||
```
|
||||
|
||||
## Decisions (locked 2026-05-25)
|
||||
|
||||
1. **Target model**: Qwen3-Coder-30B-A3B. Compute
|
||||
`kv_bytes_per_token` from this model's config at manifest time.
|
||||
2. **GPU topology**: TP1 × 8 vLLM instances (DP8). All proxies and
|
||||
sweeps assume 8 worker endpoints.
|
||||
3. **Trace for B3/B4**: `traces/w600_r0.0015_st30.jsonl` (~850
|
||||
requests). No resampling.
|
||||
4. **E2E SLO**: per-class. Split requests by `requested_output_tokens`
|
||||
decile, set separate E2E thresholds per class. No normalized-E2E
|
||||
headline.
|
||||
5. **vLLM scheduler patch**: accepted. Step-level JSONL log goes
|
||||
through a patch under `patches/`. Polling falls back to per-engine
|
||||
`/metrics` for sanity only.
|
||||
6. **GPU phasing**: hard stop after B2 and B3. Hand results back for
|
||||
review before committing to B4 SRR sweep or B5 attribution.
|
||||
|
||||
## What stays with the interns
|
||||
|
||||
- Re-running `summarize_runs.py` after each GPU batch (mechanical).
|
||||
- Reviewing the auto-generated `current_results.md` for typos.
|
||||
- Maintaining `main_claim_allowed_runs.md` if new traces are added.
|
||||
- Anything reading the audit package — not extending it.
|
||||
|
||||
## Out of scope for this plan
|
||||
|
||||
- New routing policy design (Unified-v2 / PUSH variants).
|
||||
- Production-grade KV transfer engineering.
|
||||
- Any change to the production paper figures in
|
||||
`analysis/pd_sep_paper_section/`.
|
||||
- vLLM upstream contributions.
|
||||
|
||||
These are downstream of characterization; once B2/B3/B5 attribution is
|
||||
in, we decide separately.
|
||||
Reference in New Issue
Block a user