Document profile-driven harness design
This commit is contained in:
250
docs/harness-ablation/profile-driven-harness-design.md
Normal file
250
docs/harness-ablation/profile-driven-harness-design.md
Normal file
@@ -0,0 +1,250 @@
|
||||
# Profile-Driven Harness Design
|
||||
|
||||
## Problem
|
||||
|
||||
The current harness is too rule-like. It can help in a narrow setup, but it can also overfit to a previous observation such as "TP=2 is enough" and then fail when the SLO or workload changes. Adding one more rule for each failure mode is not acceptable.
|
||||
|
||||
The harness should make AITuner behave more like a performance engineer:
|
||||
|
||||
1. Profile the workload and engine behavior.
|
||||
2. Diagnose the active bottleneck from measurements.
|
||||
3. Form hypotheses about which knob families can relieve that bottleneck.
|
||||
4. Choose the next experiment that best disambiguates or improves the system.
|
||||
5. Interpret the result and update the diagnosis.
|
||||
6. Stop only when the measured search space is exhausted or the remaining hypotheses have low value.
|
||||
|
||||
The harness should not encode a fixed answer for qwen27b, TP=4, TPOT25, or any testcase-specific threshold.
|
||||
|
||||
## Design Principles
|
||||
|
||||
### 1. Harness is an evidence system, not a prompt trick
|
||||
|
||||
The harness should provide structured evidence and a decision protocol. It should not simply add more natural-language hints to the prompt.
|
||||
|
||||
The output should include:
|
||||
|
||||
- `profile`: measured workload and engine traits.
|
||||
- `bottleneck_diagnosis`: ranked bottleneck hypotheses with evidence.
|
||||
- `candidate_actions`: candidate config changes with predicted effect and risk.
|
||||
- `experiment_plan`: the next config to test, why it is informative, and what outcome would confirm/refute the hypothesis.
|
||||
- `stop_decision`: only if no useful experiment remains under the current measurement budget.
|
||||
|
||||
### 2. Separate profiling from tuning
|
||||
|
||||
AITuner currently learns mostly from full trial results. That is too coarse. The harness needs a profile layer that extracts bottleneck signals from every probe:
|
||||
|
||||
- workload profile: input/output token distributions, cache reuse, burstiness, selected-request count per sampling threshold;
|
||||
- latency profile: TTFT and TPOT mean/p50/p95/p99 by threshold and optionally by input/output token bucket;
|
||||
- engine profile: prefill/decode throughput, waiting/running request counts, KV cache usage, preemptions, GPU utilization if available;
|
||||
- topology profile: TP/DP/EP product, per-GPU request rate, launch stability, memory pressure;
|
||||
- measurement profile: search high/low/tolerance, which thresholds were measured, whether failures were early-stop artifacts or true low-load infeasibility.
|
||||
|
||||
This profile should be persisted per trial so later decisions do not depend on prompt memory.
|
||||
|
||||
### 3. Diagnose bottlenecks from counters and slopes
|
||||
|
||||
A professional tuning loop should not infer bottlenecks from a single failed reason. It should compare how metrics change across thresholds and configs.
|
||||
|
||||
Useful diagnostics:
|
||||
|
||||
- TTFT-bound: TTFT p95/p99 rises sharply with request rate, long prompts dominate failures, prefill throughput is saturated, decode TPOT is still healthy.
|
||||
- TPOT-bound: TPOT p95/p99 fails while TTFT is acceptable, decode throughput or per-token latency dominates, high sequence concurrency or batching creates token latency tails.
|
||||
- Admission/queueing-bound: waiting requests and arrival lag grow, both TTFT and TPOT may degrade, utilization may be high but service time per request is not the only issue.
|
||||
- Memory/KV-bound: KV cache usage, preemptions, OOM/launch failures, or max model length/cache pressure limit throughput.
|
||||
- Topology/communication-bound: increasing TP lowers per-request latency but may reduce per-GPU throughput; DP improves admission but may not reduce per-request latency.
|
||||
|
||||
The diagnosis should be a ranked list with confidence, not a single label:
|
||||
|
||||
```json
|
||||
{
|
||||
"bottlenecks": [
|
||||
{
|
||||
"name": "decode_tpot",
|
||||
"confidence": 0.62,
|
||||
"evidence": ["tpot_p95 fails at last infeasible probe", "ttft_p95 remains below SLO"]
|
||||
},
|
||||
{
|
||||
"name": "prefill_ttft",
|
||||
"confidence": 0.31,
|
||||
"evidence": ["long prompt tail", "ttft_p99 grows near knee"]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Candidate actions come from a knob-effect model
|
||||
|
||||
The harness should maintain a generic model of knob effects, not case-specific rules.
|
||||
|
||||
Examples:
|
||||
|
||||
| Knob family | Expected effect | Main risks |
|
||||
|---|---|---|
|
||||
| Increase TP | lower per-request compute latency, may help TTFT/TPOT tails | communication overhead, lower per-GPU efficiency, fewer independent replicas |
|
||||
| Increase DP | more replicas, better admission/queueing | does not reduce per-request compute latency, may waste memory |
|
||||
| Increase EP | MoE expert distribution, possible decode/prefill balance | launch complexity, communication, only relevant with model/engine evidence |
|
||||
| Lower max-num-seqs | reduce decode contention and tail TPOT | underutilization, lower throughput |
|
||||
| Raise max-num-seqs | improve admission/concurrency | TPOT/TTFT tails, memory pressure |
|
||||
| Raise max-num-batched-tokens | improve prefill batching and GPU occupancy | long-prefill HoL blocking, memory pressure |
|
||||
| Lower max-num-batched-tokens | reduce long-prefill interference | underfilled prefill batches |
|
||||
| Change block-size | cache fragmentation/allocator effects | launch/runtime instability, workload-specific |
|
||||
|
||||
The harness should score candidates with:
|
||||
|
||||
```text
|
||||
score = expected_bottleneck_relief
|
||||
+ information_gain
|
||||
+ launch_safety
|
||||
- regression_risk
|
||||
- measurement_cost
|
||||
```
|
||||
|
||||
This lets it choose between "exploit current incumbent" and "explore unmeasured topology" from evidence, not hardcoded order.
|
||||
|
||||
### 5. Use hypothesis-driven experiments
|
||||
|
||||
Each trial should answer a specific question.
|
||||
|
||||
Bad:
|
||||
|
||||
- "Try TP=4 because previous run needed TP=4."
|
||||
|
||||
Good:
|
||||
|
||||
- "At the current best TP=2, the last infeasible probe is TPOT-bound. Increasing TP to 4 should reduce per-token compute latency but may reduce per-GPU efficiency. This trial tests whether the SLO is compute-latency limited or topology-overhead limited."
|
||||
|
||||
Every candidate should have:
|
||||
|
||||
- hypothesis;
|
||||
- expected metric movement;
|
||||
- risk;
|
||||
- confirm condition;
|
||||
- reject condition;
|
||||
- fallback next family.
|
||||
|
||||
### 6. Stop is a decision under measured search limits
|
||||
|
||||
Early stop should not mean "we found something good." It should mean:
|
||||
|
||||
- the configured `search.high` is saturated by a feasible incumbent; or
|
||||
- the remaining high-value hypotheses are already measured or invalidated; or
|
||||
- all remaining candidates have low expected information gain or high launch risk relative to the current best; or
|
||||
- the baseline is infeasible even at the lowest load and the SLO is too tight.
|
||||
|
||||
For Fig18 raw mode, stop can be disabled or recorded as `stop` rather than filling remaining columns with best-so-far.
|
||||
|
||||
## Proposed Architecture
|
||||
|
||||
### Components
|
||||
|
||||
1. `Profiler`
|
||||
- Input: trial result, probe history, engine logs/metrics if available, study spec.
|
||||
- Output: `TrialProfile`.
|
||||
|
||||
2. `BottleneckAnalyzer`
|
||||
- Input: recent `TrialProfile`s, workload profile, SLO.
|
||||
- Output: ranked bottleneck hypotheses.
|
||||
|
||||
3. `KnobEffectModel`
|
||||
- Generic mapping from bottleneck hypotheses to possible knob families.
|
||||
- Contains no model-specific or SLO-specific constants.
|
||||
|
||||
4. `ExperimentPlanner`
|
||||
- Generates candidate config patches.
|
||||
- Scores candidates by expected relief, information gain, risk, and cost.
|
||||
- Emits the next experiment and rationale.
|
||||
|
||||
5. `StopPolicy`
|
||||
- Uses measured search coverage and remaining candidate scores.
|
||||
- Does not stop just because a strong incumbent appears.
|
||||
|
||||
6. `PromptRenderer`
|
||||
- Renders the structured plan for the LLM when LLM involvement is needed.
|
||||
- The LLM can choose among candidates or refine rationale, but should not invent arbitrary directions without evidence.
|
||||
|
||||
### Data Flow
|
||||
|
||||
```text
|
||||
study spec + workload trace
|
||||
|
|
||||
v
|
||||
workload profile
|
||||
|
|
||||
trial result/probes/logs --> trial profile
|
||||
|
|
||||
v
|
||||
bottleneck analyzer --> ranked hypotheses
|
||||
|
|
||||
v
|
||||
knob effect model --> candidate action set
|
||||
|
|
||||
v
|
||||
experiment planner --> next config / stop
|
||||
|
|
||||
v
|
||||
AITuner evaluates config over configured search range
|
||||
```
|
||||
|
||||
## How This Handles the Current Failure
|
||||
|
||||
Current fixed TTFT4s + TPOT25 data:
|
||||
|
||||
- TP=2 reached `0.1992 req/s/GPU`.
|
||||
- Later TP=2 local runtime probes did not improve.
|
||||
- Min-prompt no-harness found TP=4 with `0.2696-0.2758 req/s/GPU`.
|
||||
|
||||
A profile-driven harness should have reasoned:
|
||||
|
||||
1. The current incumbent is good but not proof of optimum.
|
||||
2. The failed probes still show SLO pressure, especially token-latency/tail behavior.
|
||||
3. Higher TP is a candidate action because it may reduce per-request compute latency.
|
||||
4. TP=4 is an unmeasured topology hypothesis within allowed constraints.
|
||||
5. Before spending many local TP=2 runtime trials or stopping, test TP=4 as a high-information experiment.
|
||||
|
||||
This is not because TP=4 is hardcoded. If TP=4 had already been measured and underperformed, the planner would prefer local runtime refinement or stop. If the bottleneck were admission/queueing with TTFT/TPOT healthy, the planner might prefer DP or max-num-seqs instead.
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Make the current harness evidence-first
|
||||
|
||||
- Persist a compact `TrialProfile` for every trial.
|
||||
- Build ranked bottleneck hypotheses from probe sequences.
|
||||
- Replace single `active_bottleneck` with ranked hypotheses and evidence.
|
||||
- Replace deterministic local refinement with candidate scoring.
|
||||
- Keep existing rules only as entries in the generic `KnobEffectModel`.
|
||||
|
||||
### Phase 2: Candidate scoring
|
||||
|
||||
- Generate candidates from:
|
||||
- adjacent topology moves;
|
||||
- same-topology runtime moves;
|
||||
- rollback/avoidance after failures;
|
||||
- no-op stop.
|
||||
- Score candidates by:
|
||||
- whether they directly target top bottleneck;
|
||||
- whether they test an unmeasured high-information direction;
|
||||
- launch risk from prior failures;
|
||||
- expected impact on request_rate_per_gpu;
|
||||
- measurement cost.
|
||||
|
||||
### Phase 3: Controlled experiments
|
||||
|
||||
- Add ablations:
|
||||
- old strong-prompt no-harness;
|
||||
- min-prompt no-harness;
|
||||
- current rule harness;
|
||||
- profile-driven harness.
|
||||
- Run each under the same SLO/workload/search.
|
||||
- Report raw `perf[i]`, best-so-far, failed configs, and stop reason.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
The profile-driven harness is acceptable only if:
|
||||
|
||||
1. It does not encode model names, fixed SLO values, or known winning configs.
|
||||
2. It can explain every proposal as a hypothesis tied to measured bottleneck evidence.
|
||||
3. It does not early-stop while an unmeasured high-score candidate remains.
|
||||
4. It improves or matches min-prompt no-harness convergence on at least the qwen27b TTFT4s/TPOT25 setup.
|
||||
5. It does not regress the previous stepped TTFT/TPOT50 setup.
|
||||
6. It records enough evidence for a human engineer to audit why each trial was chosen.
|
||||
|
||||
Reference in New Issue
Block a user