Document profile-driven harness design

2026-05-12 21:09:29 +08:00
parent 2d03b1cd4c
commit 63d6a111f4
1 changed files with 250 additions and 0 deletions
--- a/docs/harness-ablation/profile-driven-harness-design.md
+++ b/docs/harness-ablation/profile-driven-harness-design.md
@@ -0,0 +1,250 @@
+# Profile-Driven Harness Design
+
+## Problem
+
+The current harness is too rule-like. It can help in a narrow setup, but it can also overfit to a previous observation such as "TP=2 is enough" and then fail when the SLO or workload changes. Adding one more rule for each failure mode is not acceptable.
+
+The harness should make AITuner behave more like a performance engineer:
+
+1. Profile the workload and engine behavior.
+2. Diagnose the active bottleneck from measurements.
+3. Form hypotheses about which knob families can relieve that bottleneck.
+4. Choose the next experiment that best disambiguates or improves the system.
+5. Interpret the result and update the diagnosis.
+6. Stop only when the measured search space is exhausted or the remaining hypotheses have low value.
+
+The harness should not encode a fixed answer for qwen27b, TP=4, TPOT25, or any testcase-specific threshold.
+
+## Design Principles
+
+### 1. Harness is an evidence system, not a prompt trick
+
+The harness should provide structured evidence and a decision protocol. It should not simply add more natural-language hints to the prompt.
+
+The output should include:
+
+- `profile`: measured workload and engine traits.
+- `bottleneck_diagnosis`: ranked bottleneck hypotheses with evidence.
+- `candidate_actions`: candidate config changes with predicted effect and risk.
+- `experiment_plan`: the next config to test, why it is informative, and what outcome would confirm/refute the hypothesis.
+- `stop_decision`: only if no useful experiment remains under the current measurement budget.
+
+### 2. Separate profiling from tuning
+
+AITuner currently learns mostly from full trial results. That is too coarse. The harness needs a profile layer that extracts bottleneck signals from every probe:
+
+- workload profile: input/output token distributions, cache reuse, burstiness, selected-request count per sampling threshold;
+- latency profile: TTFT and TPOT mean/p50/p95/p99 by threshold and optionally by input/output token bucket;
+- engine profile: prefill/decode throughput, waiting/running request counts, KV cache usage, preemptions, GPU utilization if available;
+- topology profile: TP/DP/EP product, per-GPU request rate, launch stability, memory pressure;
+- measurement profile: search high/low/tolerance, which thresholds were measured, whether failures were early-stop artifacts or true low-load infeasibility.
+
+This profile should be persisted per trial so later decisions do not depend on prompt memory.
+
+### 3. Diagnose bottlenecks from counters and slopes
+
+A professional tuning loop should not infer bottlenecks from a single failed reason. It should compare how metrics change across thresholds and configs.
+
+Useful diagnostics:
+
+- TTFT-bound: TTFT p95/p99 rises sharply with request rate, long prompts dominate failures, prefill throughput is saturated, decode TPOT is still healthy.
+- TPOT-bound: TPOT p95/p99 fails while TTFT is acceptable, decode throughput or per-token latency dominates, high sequence concurrency or batching creates token latency tails.
+- Admission/queueing-bound: waiting requests and arrival lag grow, both TTFT and TPOT may degrade, utilization may be high but service time per request is not the only issue.
+- Memory/KV-bound: KV cache usage, preemptions, OOM/launch failures, or max model length/cache pressure limit throughput.
+- Topology/communication-bound: increasing TP lowers per-request latency but may reduce per-GPU throughput; DP improves admission but may not reduce per-request latency.
+
+The diagnosis should be a ranked list with confidence, not a single label:
+
+```json
+{
+  "bottlenecks": [
+    {
+      "name": "decode_tpot",
+      "confidence": 0.62,
+      "evidence": ["tpot_p95 fails at last infeasible probe", "ttft_p95 remains below SLO"]
+    },
+    {
+      "name": "prefill_ttft",
+      "confidence": 0.31,
+      "evidence": ["long prompt tail", "ttft_p99 grows near knee"]
+    }
+  ]
+}
+```
+
+### 4. Candidate actions come from a knob-effect model
+
+The harness should maintain a generic model of knob effects, not case-specific rules.
+
+Examples:
+
+| Knob family | Expected effect | Main risks |
+|---|---|---|
+| Increase TP | lower per-request compute latency, may help TTFT/TPOT tails | communication overhead, lower per-GPU efficiency, fewer independent replicas |
+| Increase DP | more replicas, better admission/queueing | does not reduce per-request compute latency, may waste memory |
+| Increase EP | MoE expert distribution, possible decode/prefill balance | launch complexity, communication, only relevant with model/engine evidence |
+| Lower max-num-seqs | reduce decode contention and tail TPOT | underutilization, lower throughput |
+| Raise max-num-seqs | improve admission/concurrency | TPOT/TTFT tails, memory pressure |
+| Raise max-num-batched-tokens | improve prefill batching and GPU occupancy | long-prefill HoL blocking, memory pressure |
+| Lower max-num-batched-tokens | reduce long-prefill interference | underfilled prefill batches |
+| Change block-size | cache fragmentation/allocator effects | launch/runtime instability, workload-specific |
+
+The harness should score candidates with:
+
+```text
+score = expected_bottleneck_relief
+      + information_gain
+      + launch_safety
+      - regression_risk
+      - measurement_cost
+```
+
+This lets it choose between "exploit current incumbent" and "explore unmeasured topology" from evidence, not hardcoded order.
+
+### 5. Use hypothesis-driven experiments
+
+Each trial should answer a specific question.
+
+Bad:
+
+- "Try TP=4 because previous run needed TP=4."
+
+Good:
+
+- "At the current best TP=2, the last infeasible probe is TPOT-bound. Increasing TP to 4 should reduce per-token compute latency but may reduce per-GPU efficiency. This trial tests whether the SLO is compute-latency limited or topology-overhead limited."
+
+Every candidate should have:
+
+- hypothesis;
+- expected metric movement;
+- risk;
+- confirm condition;
+- reject condition;
+- fallback next family.
+
+### 6. Stop is a decision under measured search limits
+
+Early stop should not mean "we found something good." It should mean:
+
+- the configured `search.high` is saturated by a feasible incumbent; or
+- the remaining high-value hypotheses are already measured or invalidated; or
+- all remaining candidates have low expected information gain or high launch risk relative to the current best; or
+- the baseline is infeasible even at the lowest load and the SLO is too tight.
+
+For Fig18 raw mode, stop can be disabled or recorded as `stop` rather than filling remaining columns with best-so-far.
+
+## Proposed Architecture
+
+### Components
+
+1. `Profiler`
+   - Input: trial result, probe history, engine logs/metrics if available, study spec.
+   - Output: `TrialProfile`.
+
+2. `BottleneckAnalyzer`
+   - Input: recent `TrialProfile`s, workload profile, SLO.
+   - Output: ranked bottleneck hypotheses.
+
+3. `KnobEffectModel`
+   - Generic mapping from bottleneck hypotheses to possible knob families.
+   - Contains no model-specific or SLO-specific constants.
+
+4. `ExperimentPlanner`
+   - Generates candidate config patches.
+   - Scores candidates by expected relief, information gain, risk, and cost.
+   - Emits the next experiment and rationale.
+
+5. `StopPolicy`
+   - Uses measured search coverage and remaining candidate scores.
+   - Does not stop just because a strong incumbent appears.
+
+6. `PromptRenderer`
+   - Renders the structured plan for the LLM when LLM involvement is needed.
+   - The LLM can choose among candidates or refine rationale, but should not invent arbitrary directions without evidence.
+
+### Data Flow
+
+```text
+study spec + workload trace
+        |
+        v
+workload profile
+        |
+trial result/probes/logs --> trial profile
+        |
+        v
+bottleneck analyzer --> ranked hypotheses
+        |
+        v
+knob effect model --> candidate action set
+        |
+        v
+experiment planner --> next config / stop
+        |
+        v
+AITuner evaluates config over configured search range
+```
+
+## How This Handles the Current Failure
+
+Current fixed TTFT4s + TPOT25 data:
+
+- TP=2 reached `0.1992 req/s/GPU`.
+- Later TP=2 local runtime probes did not improve.
+- Min-prompt no-harness found TP=4 with `0.2696-0.2758 req/s/GPU`.
+
+A profile-driven harness should have reasoned:
+
+1. The current incumbent is good but not proof of optimum.
+2. The failed probes still show SLO pressure, especially token-latency/tail behavior.
+3. Higher TP is a candidate action because it may reduce per-request compute latency.
+4. TP=4 is an unmeasured topology hypothesis within allowed constraints.
+5. Before spending many local TP=2 runtime trials or stopping, test TP=4 as a high-information experiment.
+
+This is not because TP=4 is hardcoded. If TP=4 had already been measured and underperformed, the planner would prefer local runtime refinement or stop. If the bottleneck were admission/queueing with TTFT/TPOT healthy, the planner might prefer DP or max-num-seqs instead.
+
+## Implementation Plan
+
+### Phase 1: Make the current harness evidence-first
+
+- Persist a compact `TrialProfile` for every trial.
+- Build ranked bottleneck hypotheses from probe sequences.
+- Replace single `active_bottleneck` with ranked hypotheses and evidence.
+- Replace deterministic local refinement with candidate scoring.
+- Keep existing rules only as entries in the generic `KnobEffectModel`.
+
+### Phase 2: Candidate scoring
+
+- Generate candidates from:
+  - adjacent topology moves;
+  - same-topology runtime moves;
+  - rollback/avoidance after failures;
+  - no-op stop.
+- Score candidates by:
+  - whether they directly target top bottleneck;
+  - whether they test an unmeasured high-information direction;
+  - launch risk from prior failures;
+  - expected impact on request_rate_per_gpu;
+  - measurement cost.
+
+### Phase 3: Controlled experiments
+
+- Add ablations:
+  - old strong-prompt no-harness;
+  - min-prompt no-harness;
+  - current rule harness;
+  - profile-driven harness.
+- Run each under the same SLO/workload/search.
+- Report raw `perf[i]`, best-so-far, failed configs, and stop reason.
+
+## Acceptance Criteria
+
+The profile-driven harness is acceptable only if:
+
+1. It does not encode model names, fixed SLO values, or known winning configs.
+2. It can explain every proposal as a hypothesis tied to measured bottleneck evidence.
+3. It does not early-stop while an unmeasured high-score candidate remains.
+4. It improves or matches min-prompt no-harness convergence on at least the qwen27b TTFT4s/TPOT25 setup.
+5. It does not regress the previous stepped TTFT/TPOT50 setup.
+6. It records enough evidence for a human engineer to audit why each trial was chosen.
+