251 lines
10 KiB
Markdown
251 lines
10 KiB
Markdown
# Profile-Driven Harness Design
|
|
|
|
## Problem
|
|
|
|
The current harness is too rule-like. It can help in a narrow setup, but it can also overfit to a previous observation such as "TP=2 is enough" and then fail when the SLO or workload changes. Adding one more rule for each failure mode is not acceptable.
|
|
|
|
The harness should make AITuner behave more like a performance engineer:
|
|
|
|
1. Profile the workload and engine behavior.
|
|
2. Diagnose the active bottleneck from measurements.
|
|
3. Form hypotheses about which knob families can relieve that bottleneck.
|
|
4. Choose the next experiment that best disambiguates or improves the system.
|
|
5. Interpret the result and update the diagnosis.
|
|
6. Stop only when the measured search space is exhausted or the remaining hypotheses have low value.
|
|
|
|
The harness should not encode a fixed answer for qwen27b, TP=4, TPOT25, or any testcase-specific threshold.
|
|
|
|
## Design Principles
|
|
|
|
### 1. Harness is an evidence system, not a prompt trick
|
|
|
|
The harness should provide structured evidence and a decision protocol. It should not simply add more natural-language hints to the prompt.
|
|
|
|
The output should include:
|
|
|
|
- `profile`: measured workload and engine traits.
|
|
- `bottleneck_diagnosis`: ranked bottleneck hypotheses with evidence.
|
|
- `candidate_actions`: candidate config changes with predicted effect and risk.
|
|
- `experiment_plan`: the next config to test, why it is informative, and what outcome would confirm/refute the hypothesis.
|
|
- `stop_decision`: only if no useful experiment remains under the current measurement budget.
|
|
|
|
### 2. Separate profiling from tuning
|
|
|
|
AITuner currently learns mostly from full trial results. That is too coarse. The harness needs a profile layer that extracts bottleneck signals from every probe:
|
|
|
|
- workload profile: input/output token distributions, cache reuse, burstiness, selected-request count per sampling threshold;
|
|
- latency profile: TTFT and TPOT mean/p50/p95/p99 by threshold and optionally by input/output token bucket;
|
|
- engine profile: prefill/decode throughput, waiting/running request counts, KV cache usage, preemptions, GPU utilization if available;
|
|
- topology profile: TP/DP/EP product, per-GPU request rate, launch stability, memory pressure;
|
|
- measurement profile: search high/low/tolerance, which thresholds were measured, whether failures were early-stop artifacts or true low-load infeasibility.
|
|
|
|
This profile should be persisted per trial so later decisions do not depend on prompt memory.
|
|
|
|
### 3. Diagnose bottlenecks from counters and slopes
|
|
|
|
A professional tuning loop should not infer bottlenecks from a single failed reason. It should compare how metrics change across thresholds and configs.
|
|
|
|
Useful diagnostics:
|
|
|
|
- TTFT-bound: TTFT p95/p99 rises sharply with request rate, long prompts dominate failures, prefill throughput is saturated, decode TPOT is still healthy.
|
|
- TPOT-bound: TPOT p95/p99 fails while TTFT is acceptable, decode throughput or per-token latency dominates, high sequence concurrency or batching creates token latency tails.
|
|
- Admission/queueing-bound: waiting requests and arrival lag grow, both TTFT and TPOT may degrade, utilization may be high but service time per request is not the only issue.
|
|
- Memory/KV-bound: KV cache usage, preemptions, OOM/launch failures, or max model length/cache pressure limit throughput.
|
|
- Topology/communication-bound: increasing TP lowers per-request latency but may reduce per-GPU throughput; DP improves admission but may not reduce per-request latency.
|
|
|
|
The diagnosis should be a ranked list with confidence, not a single label:
|
|
|
|
```json
|
|
{
|
|
"bottlenecks": [
|
|
{
|
|
"name": "decode_tpot",
|
|
"confidence": 0.62,
|
|
"evidence": ["tpot_p95 fails at last infeasible probe", "ttft_p95 remains below SLO"]
|
|
},
|
|
{
|
|
"name": "prefill_ttft",
|
|
"confidence": 0.31,
|
|
"evidence": ["long prompt tail", "ttft_p99 grows near knee"]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 4. Candidate actions come from a knob-effect model
|
|
|
|
The harness should maintain a generic model of knob effects, not case-specific rules.
|
|
|
|
Examples:
|
|
|
|
| Knob family | Expected effect | Main risks |
|
|
|---|---|---|
|
|
| Increase TP | lower per-request compute latency, may help TTFT/TPOT tails | communication overhead, lower per-GPU efficiency, fewer independent replicas |
|
|
| Increase DP | more replicas, better admission/queueing | does not reduce per-request compute latency, may waste memory |
|
|
| Increase EP | MoE expert distribution, possible decode/prefill balance | launch complexity, communication, only relevant with model/engine evidence |
|
|
| Lower max-num-seqs | reduce decode contention and tail TPOT | underutilization, lower throughput |
|
|
| Raise max-num-seqs | improve admission/concurrency | TPOT/TTFT tails, memory pressure |
|
|
| Raise max-num-batched-tokens | improve prefill batching and GPU occupancy | long-prefill HoL blocking, memory pressure |
|
|
| Lower max-num-batched-tokens | reduce long-prefill interference | underfilled prefill batches |
|
|
| Change block-size | cache fragmentation/allocator effects | launch/runtime instability, workload-specific |
|
|
|
|
The harness should score candidates with:
|
|
|
|
```text
|
|
score = expected_bottleneck_relief
|
|
+ information_gain
|
|
+ launch_safety
|
|
- regression_risk
|
|
- measurement_cost
|
|
```
|
|
|
|
This lets it choose between "exploit current incumbent" and "explore unmeasured topology" from evidence, not hardcoded order.
|
|
|
|
### 5. Use hypothesis-driven experiments
|
|
|
|
Each trial should answer a specific question.
|
|
|
|
Bad:
|
|
|
|
- "Try TP=4 because previous run needed TP=4."
|
|
|
|
Good:
|
|
|
|
- "At the current best TP=2, the last infeasible probe is TPOT-bound. Increasing TP to 4 should reduce per-token compute latency but may reduce per-GPU efficiency. This trial tests whether the SLO is compute-latency limited or topology-overhead limited."
|
|
|
|
Every candidate should have:
|
|
|
|
- hypothesis;
|
|
- expected metric movement;
|
|
- risk;
|
|
- confirm condition;
|
|
- reject condition;
|
|
- fallback next family.
|
|
|
|
### 6. Stop is a decision under measured search limits
|
|
|
|
Early stop should not mean "we found something good." It should mean:
|
|
|
|
- the configured `search.high` is saturated by a feasible incumbent; or
|
|
- the remaining high-value hypotheses are already measured or invalidated; or
|
|
- all remaining candidates have low expected information gain or high launch risk relative to the current best; or
|
|
- the baseline is infeasible even at the lowest load and the SLO is too tight.
|
|
|
|
For Fig18 raw mode, stop can be disabled or recorded as `stop` rather than filling remaining columns with best-so-far.
|
|
|
|
## Proposed Architecture
|
|
|
|
### Components
|
|
|
|
1. `Profiler`
|
|
- Input: trial result, probe history, engine logs/metrics if available, study spec.
|
|
- Output: `TrialProfile`.
|
|
|
|
2. `BottleneckAnalyzer`
|
|
- Input: recent `TrialProfile`s, workload profile, SLO.
|
|
- Output: ranked bottleneck hypotheses.
|
|
|
|
3. `KnobEffectModel`
|
|
- Generic mapping from bottleneck hypotheses to possible knob families.
|
|
- Contains no model-specific or SLO-specific constants.
|
|
|
|
4. `ExperimentPlanner`
|
|
- Generates candidate config patches.
|
|
- Scores candidates by expected relief, information gain, risk, and cost.
|
|
- Emits the next experiment and rationale.
|
|
|
|
5. `StopPolicy`
|
|
- Uses measured search coverage and remaining candidate scores.
|
|
- Does not stop just because a strong incumbent appears.
|
|
|
|
6. `PromptRenderer`
|
|
- Renders the structured plan for the LLM when LLM involvement is needed.
|
|
- The LLM can choose among candidates or refine rationale, but should not invent arbitrary directions without evidence.
|
|
|
|
### Data Flow
|
|
|
|
```text
|
|
study spec + workload trace
|
|
|
|
|
v
|
|
workload profile
|
|
|
|
|
trial result/probes/logs --> trial profile
|
|
|
|
|
v
|
|
bottleneck analyzer --> ranked hypotheses
|
|
|
|
|
v
|
|
knob effect model --> candidate action set
|
|
|
|
|
v
|
|
experiment planner --> next config / stop
|
|
|
|
|
v
|
|
AITuner evaluates config over configured search range
|
|
```
|
|
|
|
## How This Handles the Current Failure
|
|
|
|
Current fixed TTFT4s + TPOT25 data:
|
|
|
|
- TP=2 reached `0.1992 req/s/GPU`.
|
|
- Later TP=2 local runtime probes did not improve.
|
|
- Min-prompt no-harness found TP=4 with `0.2696-0.2758 req/s/GPU`.
|
|
|
|
A profile-driven harness should have reasoned:
|
|
|
|
1. The current incumbent is good but not proof of optimum.
|
|
2. The failed probes still show SLO pressure, especially token-latency/tail behavior.
|
|
3. Higher TP is a candidate action because it may reduce per-request compute latency.
|
|
4. TP=4 is an unmeasured topology hypothesis within allowed constraints.
|
|
5. Before spending many local TP=2 runtime trials or stopping, test TP=4 as a high-information experiment.
|
|
|
|
This is not because TP=4 is hardcoded. If TP=4 had already been measured and underperformed, the planner would prefer local runtime refinement or stop. If the bottleneck were admission/queueing with TTFT/TPOT healthy, the planner might prefer DP or max-num-seqs instead.
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Make the current harness evidence-first
|
|
|
|
- Persist a compact `TrialProfile` for every trial.
|
|
- Build ranked bottleneck hypotheses from probe sequences.
|
|
- Replace single `active_bottleneck` with ranked hypotheses and evidence.
|
|
- Replace deterministic local refinement with candidate scoring.
|
|
- Keep existing rules only as entries in the generic `KnobEffectModel`.
|
|
|
|
### Phase 2: Candidate scoring
|
|
|
|
- Generate candidates from:
|
|
- adjacent topology moves;
|
|
- same-topology runtime moves;
|
|
- rollback/avoidance after failures;
|
|
- no-op stop.
|
|
- Score candidates by:
|
|
- whether they directly target top bottleneck;
|
|
- whether they test an unmeasured high-information direction;
|
|
- launch risk from prior failures;
|
|
- expected impact on request_rate_per_gpu;
|
|
- measurement cost.
|
|
|
|
### Phase 3: Controlled experiments
|
|
|
|
- Add ablations:
|
|
- old strong-prompt no-harness;
|
|
- min-prompt no-harness;
|
|
- current rule harness;
|
|
- profile-driven harness.
|
|
- Run each under the same SLO/workload/search.
|
|
- Report raw `perf[i]`, best-so-far, failed configs, and stop reason.
|
|
|
|
## Acceptance Criteria
|
|
|
|
The profile-driven harness is acceptable only if:
|
|
|
|
1. It does not encode model names, fixed SLO values, or known winning configs.
|
|
2. It can explain every proposal as a hypothesis tied to measured bottleneck evidence.
|
|
3. It does not early-stop while an unmeasured high-score candidate remains.
|
|
4. It improves or matches min-prompt no-harness convergence on at least the qwen27b TTFT4s/TPOT25 setup.
|
|
5. It does not regress the previous stepped TTFT/TPOT50 setup.
|
|
6. It records enough evidence for a human engineer to audit why each trial was chosen.
|
|
|