diff --git a/docs/harness-ablation/profile-driven-harness-design.md b/docs/harness-ablation/profile-driven-harness-design.md new file mode 100644 index 0000000..31491a4 --- /dev/null +++ b/docs/harness-ablation/profile-driven-harness-design.md @@ -0,0 +1,250 @@ +# Profile-Driven Harness Design + +## Problem + +The current harness is too rule-like. It can help in a narrow setup, but it can also overfit to a previous observation such as "TP=2 is enough" and then fail when the SLO or workload changes. Adding one more rule for each failure mode is not acceptable. + +The harness should make AITuner behave more like a performance engineer: + +1. Profile the workload and engine behavior. +2. Diagnose the active bottleneck from measurements. +3. Form hypotheses about which knob families can relieve that bottleneck. +4. Choose the next experiment that best disambiguates or improves the system. +5. Interpret the result and update the diagnosis. +6. Stop only when the measured search space is exhausted or the remaining hypotheses have low value. + +The harness should not encode a fixed answer for qwen27b, TP=4, TPOT25, or any testcase-specific threshold. + +## Design Principles + +### 1. Harness is an evidence system, not a prompt trick + +The harness should provide structured evidence and a decision protocol. It should not simply add more natural-language hints to the prompt. + +The output should include: + +- `profile`: measured workload and engine traits. +- `bottleneck_diagnosis`: ranked bottleneck hypotheses with evidence. +- `candidate_actions`: candidate config changes with predicted effect and risk. +- `experiment_plan`: the next config to test, why it is informative, and what outcome would confirm/refute the hypothesis. +- `stop_decision`: only if no useful experiment remains under the current measurement budget. + +### 2. Separate profiling from tuning + +AITuner currently learns mostly from full trial results. That is too coarse. The harness needs a profile layer that extracts bottleneck signals from every probe: + +- workload profile: input/output token distributions, cache reuse, burstiness, selected-request count per sampling threshold; +- latency profile: TTFT and TPOT mean/p50/p95/p99 by threshold and optionally by input/output token bucket; +- engine profile: prefill/decode throughput, waiting/running request counts, KV cache usage, preemptions, GPU utilization if available; +- topology profile: TP/DP/EP product, per-GPU request rate, launch stability, memory pressure; +- measurement profile: search high/low/tolerance, which thresholds were measured, whether failures were early-stop artifacts or true low-load infeasibility. + +This profile should be persisted per trial so later decisions do not depend on prompt memory. + +### 3. Diagnose bottlenecks from counters and slopes + +A professional tuning loop should not infer bottlenecks from a single failed reason. It should compare how metrics change across thresholds and configs. + +Useful diagnostics: + +- TTFT-bound: TTFT p95/p99 rises sharply with request rate, long prompts dominate failures, prefill throughput is saturated, decode TPOT is still healthy. +- TPOT-bound: TPOT p95/p99 fails while TTFT is acceptable, decode throughput or per-token latency dominates, high sequence concurrency or batching creates token latency tails. +- Admission/queueing-bound: waiting requests and arrival lag grow, both TTFT and TPOT may degrade, utilization may be high but service time per request is not the only issue. +- Memory/KV-bound: KV cache usage, preemptions, OOM/launch failures, or max model length/cache pressure limit throughput. +- Topology/communication-bound: increasing TP lowers per-request latency but may reduce per-GPU throughput; DP improves admission but may not reduce per-request latency. + +The diagnosis should be a ranked list with confidence, not a single label: + +```json +{ + "bottlenecks": [ + { + "name": "decode_tpot", + "confidence": 0.62, + "evidence": ["tpot_p95 fails at last infeasible probe", "ttft_p95 remains below SLO"] + }, + { + "name": "prefill_ttft", + "confidence": 0.31, + "evidence": ["long prompt tail", "ttft_p99 grows near knee"] + } + ] +} +``` + +### 4. Candidate actions come from a knob-effect model + +The harness should maintain a generic model of knob effects, not case-specific rules. + +Examples: + +| Knob family | Expected effect | Main risks | +|---|---|---| +| Increase TP | lower per-request compute latency, may help TTFT/TPOT tails | communication overhead, lower per-GPU efficiency, fewer independent replicas | +| Increase DP | more replicas, better admission/queueing | does not reduce per-request compute latency, may waste memory | +| Increase EP | MoE expert distribution, possible decode/prefill balance | launch complexity, communication, only relevant with model/engine evidence | +| Lower max-num-seqs | reduce decode contention and tail TPOT | underutilization, lower throughput | +| Raise max-num-seqs | improve admission/concurrency | TPOT/TTFT tails, memory pressure | +| Raise max-num-batched-tokens | improve prefill batching and GPU occupancy | long-prefill HoL blocking, memory pressure | +| Lower max-num-batched-tokens | reduce long-prefill interference | underfilled prefill batches | +| Change block-size | cache fragmentation/allocator effects | launch/runtime instability, workload-specific | + +The harness should score candidates with: + +```text +score = expected_bottleneck_relief + + information_gain + + launch_safety + - regression_risk + - measurement_cost +``` + +This lets it choose between "exploit current incumbent" and "explore unmeasured topology" from evidence, not hardcoded order. + +### 5. Use hypothesis-driven experiments + +Each trial should answer a specific question. + +Bad: + +- "Try TP=4 because previous run needed TP=4." + +Good: + +- "At the current best TP=2, the last infeasible probe is TPOT-bound. Increasing TP to 4 should reduce per-token compute latency but may reduce per-GPU efficiency. This trial tests whether the SLO is compute-latency limited or topology-overhead limited." + +Every candidate should have: + +- hypothesis; +- expected metric movement; +- risk; +- confirm condition; +- reject condition; +- fallback next family. + +### 6. Stop is a decision under measured search limits + +Early stop should not mean "we found something good." It should mean: + +- the configured `search.high` is saturated by a feasible incumbent; or +- the remaining high-value hypotheses are already measured or invalidated; or +- all remaining candidates have low expected information gain or high launch risk relative to the current best; or +- the baseline is infeasible even at the lowest load and the SLO is too tight. + +For Fig18 raw mode, stop can be disabled or recorded as `stop` rather than filling remaining columns with best-so-far. + +## Proposed Architecture + +### Components + +1. `Profiler` + - Input: trial result, probe history, engine logs/metrics if available, study spec. + - Output: `TrialProfile`. + +2. `BottleneckAnalyzer` + - Input: recent `TrialProfile`s, workload profile, SLO. + - Output: ranked bottleneck hypotheses. + +3. `KnobEffectModel` + - Generic mapping from bottleneck hypotheses to possible knob families. + - Contains no model-specific or SLO-specific constants. + +4. `ExperimentPlanner` + - Generates candidate config patches. + - Scores candidates by expected relief, information gain, risk, and cost. + - Emits the next experiment and rationale. + +5. `StopPolicy` + - Uses measured search coverage and remaining candidate scores. + - Does not stop just because a strong incumbent appears. + +6. `PromptRenderer` + - Renders the structured plan for the LLM when LLM involvement is needed. + - The LLM can choose among candidates or refine rationale, but should not invent arbitrary directions without evidence. + +### Data Flow + +```text +study spec + workload trace + | + v +workload profile + | +trial result/probes/logs --> trial profile + | + v +bottleneck analyzer --> ranked hypotheses + | + v +knob effect model --> candidate action set + | + v +experiment planner --> next config / stop + | + v +AITuner evaluates config over configured search range +``` + +## How This Handles the Current Failure + +Current fixed TTFT4s + TPOT25 data: + +- TP=2 reached `0.1992 req/s/GPU`. +- Later TP=2 local runtime probes did not improve. +- Min-prompt no-harness found TP=4 with `0.2696-0.2758 req/s/GPU`. + +A profile-driven harness should have reasoned: + +1. The current incumbent is good but not proof of optimum. +2. The failed probes still show SLO pressure, especially token-latency/tail behavior. +3. Higher TP is a candidate action because it may reduce per-request compute latency. +4. TP=4 is an unmeasured topology hypothesis within allowed constraints. +5. Before spending many local TP=2 runtime trials or stopping, test TP=4 as a high-information experiment. + +This is not because TP=4 is hardcoded. If TP=4 had already been measured and underperformed, the planner would prefer local runtime refinement or stop. If the bottleneck were admission/queueing with TTFT/TPOT healthy, the planner might prefer DP or max-num-seqs instead. + +## Implementation Plan + +### Phase 1: Make the current harness evidence-first + +- Persist a compact `TrialProfile` for every trial. +- Build ranked bottleneck hypotheses from probe sequences. +- Replace single `active_bottleneck` with ranked hypotheses and evidence. +- Replace deterministic local refinement with candidate scoring. +- Keep existing rules only as entries in the generic `KnobEffectModel`. + +### Phase 2: Candidate scoring + +- Generate candidates from: + - adjacent topology moves; + - same-topology runtime moves; + - rollback/avoidance after failures; + - no-op stop. +- Score candidates by: + - whether they directly target top bottleneck; + - whether they test an unmeasured high-information direction; + - launch risk from prior failures; + - expected impact on request_rate_per_gpu; + - measurement cost. + +### Phase 3: Controlled experiments + +- Add ablations: + - old strong-prompt no-harness; + - min-prompt no-harness; + - current rule harness; + - profile-driven harness. +- Run each under the same SLO/workload/search. +- Report raw `perf[i]`, best-so-far, failed configs, and stop reason. + +## Acceptance Criteria + +The profile-driven harness is acceptable only if: + +1. It does not encode model names, fixed SLO values, or known winning configs. +2. It can explain every proposal as a hypothesis tied to measured bottleneck evidence. +3. It does not early-stop while an unmeasured high-score candidate remains. +4. It improves or matches min-prompt no-harness convergence on at least the qwen27b TTFT4s/TPOT25 setup. +5. It does not regress the previous stepped TTFT/TPOT50 setup. +6. It records enough evidence for a human engineer to audit why each trial was chosen. +