10 KiB
Profile-Driven Harness Design
Problem
The current harness is too rule-like. It can help in a narrow setup, but it can also overfit to a previous observation such as "TP=2 is enough" and then fail when the SLO or workload changes. Adding one more rule for each failure mode is not acceptable.
The harness should make AITuner behave more like a performance engineer:
- Profile the workload and engine behavior.
- Diagnose the active bottleneck from measurements.
- Form hypotheses about which knob families can relieve that bottleneck.
- Choose the next experiment that best disambiguates or improves the system.
- Interpret the result and update the diagnosis.
- Stop only when the measured search space is exhausted or the remaining hypotheses have low value.
The harness should not encode a fixed answer for qwen27b, TP=4, TPOT25, or any testcase-specific threshold.
Design Principles
1. Harness is an evidence system, not a prompt trick
The harness should provide structured evidence and a decision protocol. It should not simply add more natural-language hints to the prompt.
The output should include:
profile: measured workload and engine traits.bottleneck_diagnosis: ranked bottleneck hypotheses with evidence.candidate_actions: candidate config changes with predicted effect and risk.experiment_plan: the next config to test, why it is informative, and what outcome would confirm/refute the hypothesis.stop_decision: only if no useful experiment remains under the current measurement budget.
2. Separate profiling from tuning
AITuner currently learns mostly from full trial results. That is too coarse. The harness needs a profile layer that extracts bottleneck signals from every probe:
- workload profile: input/output token distributions, cache reuse, burstiness, selected-request count per sampling threshold;
- latency profile: TTFT and TPOT mean/p50/p95/p99 by threshold and optionally by input/output token bucket;
- engine profile: prefill/decode throughput, waiting/running request counts, KV cache usage, preemptions, GPU utilization if available;
- topology profile: TP/DP/EP product, per-GPU request rate, launch stability, memory pressure;
- measurement profile: search high/low/tolerance, which thresholds were measured, whether failures were early-stop artifacts or true low-load infeasibility.
This profile should be persisted per trial so later decisions do not depend on prompt memory.
3. Diagnose bottlenecks from counters and slopes
A professional tuning loop should not infer bottlenecks from a single failed reason. It should compare how metrics change across thresholds and configs.
Useful diagnostics:
- TTFT-bound: TTFT p95/p99 rises sharply with request rate, long prompts dominate failures, prefill throughput is saturated, decode TPOT is still healthy.
- TPOT-bound: TPOT p95/p99 fails while TTFT is acceptable, decode throughput or per-token latency dominates, high sequence concurrency or batching creates token latency tails.
- Admission/queueing-bound: waiting requests and arrival lag grow, both TTFT and TPOT may degrade, utilization may be high but service time per request is not the only issue.
- Memory/KV-bound: KV cache usage, preemptions, OOM/launch failures, or max model length/cache pressure limit throughput.
- Topology/communication-bound: increasing TP lowers per-request latency but may reduce per-GPU throughput; DP improves admission but may not reduce per-request latency.
The diagnosis should be a ranked list with confidence, not a single label:
{
"bottlenecks": [
{
"name": "decode_tpot",
"confidence": 0.62,
"evidence": ["tpot_p95 fails at last infeasible probe", "ttft_p95 remains below SLO"]
},
{
"name": "prefill_ttft",
"confidence": 0.31,
"evidence": ["long prompt tail", "ttft_p99 grows near knee"]
}
]
}
4. Candidate actions come from a knob-effect model
The harness should maintain a generic model of knob effects, not case-specific rules.
Examples:
| Knob family | Expected effect | Main risks |
|---|---|---|
| Increase TP | lower per-request compute latency, may help TTFT/TPOT tails | communication overhead, lower per-GPU efficiency, fewer independent replicas |
| Increase DP | more replicas, better admission/queueing | does not reduce per-request compute latency, may waste memory |
| Increase EP | MoE expert distribution, possible decode/prefill balance | launch complexity, communication, only relevant with model/engine evidence |
| Lower max-num-seqs | reduce decode contention and tail TPOT | underutilization, lower throughput |
| Raise max-num-seqs | improve admission/concurrency | TPOT/TTFT tails, memory pressure |
| Raise max-num-batched-tokens | improve prefill batching and GPU occupancy | long-prefill HoL blocking, memory pressure |
| Lower max-num-batched-tokens | reduce long-prefill interference | underfilled prefill batches |
| Change block-size | cache fragmentation/allocator effects | launch/runtime instability, workload-specific |
The harness should score candidates with:
score = expected_bottleneck_relief
+ information_gain
+ launch_safety
- regression_risk
- measurement_cost
This lets it choose between "exploit current incumbent" and "explore unmeasured topology" from evidence, not hardcoded order.
5. Use hypothesis-driven experiments
Each trial should answer a specific question.
Bad:
- "Try TP=4 because previous run needed TP=4."
Good:
- "At the current best TP=2, the last infeasible probe is TPOT-bound. Increasing TP to 4 should reduce per-token compute latency but may reduce per-GPU efficiency. This trial tests whether the SLO is compute-latency limited or topology-overhead limited."
Every candidate should have:
- hypothesis;
- expected metric movement;
- risk;
- confirm condition;
- reject condition;
- fallback next family.
6. Stop is a decision under measured search limits
Early stop should not mean "we found something good." It should mean:
- the configured
search.highis saturated by a feasible incumbent; or - the remaining high-value hypotheses are already measured or invalidated; or
- all remaining candidates have low expected information gain or high launch risk relative to the current best; or
- the baseline is infeasible even at the lowest load and the SLO is too tight.
For Fig18 raw mode, stop can be disabled or recorded as stop rather than filling remaining columns with best-so-far.
Proposed Architecture
Components
-
Profiler- Input: trial result, probe history, engine logs/metrics if available, study spec.
- Output:
TrialProfile.
-
BottleneckAnalyzer- Input: recent
TrialProfiles, workload profile, SLO. - Output: ranked bottleneck hypotheses.
- Input: recent
-
KnobEffectModel- Generic mapping from bottleneck hypotheses to possible knob families.
- Contains no model-specific or SLO-specific constants.
-
ExperimentPlanner- Generates candidate config patches.
- Scores candidates by expected relief, information gain, risk, and cost.
- Emits the next experiment and rationale.
-
StopPolicy- Uses measured search coverage and remaining candidate scores.
- Does not stop just because a strong incumbent appears.
-
PromptRenderer- Renders the structured plan for the LLM when LLM involvement is needed.
- The LLM can choose among candidates or refine rationale, but should not invent arbitrary directions without evidence.
Data Flow
study spec + workload trace
|
v
workload profile
|
trial result/probes/logs --> trial profile
|
v
bottleneck analyzer --> ranked hypotheses
|
v
knob effect model --> candidate action set
|
v
experiment planner --> next config / stop
|
v
AITuner evaluates config over configured search range
How This Handles the Current Failure
Current fixed TTFT4s + TPOT25 data:
- TP=2 reached
0.1992 req/s/GPU. - Later TP=2 local runtime probes did not improve.
- Min-prompt no-harness found TP=4 with
0.2696-0.2758 req/s/GPU.
A profile-driven harness should have reasoned:
- The current incumbent is good but not proof of optimum.
- The failed probes still show SLO pressure, especially token-latency/tail behavior.
- Higher TP is a candidate action because it may reduce per-request compute latency.
- TP=4 is an unmeasured topology hypothesis within allowed constraints.
- Before spending many local TP=2 runtime trials or stopping, test TP=4 as a high-information experiment.
This is not because TP=4 is hardcoded. If TP=4 had already been measured and underperformed, the planner would prefer local runtime refinement or stop. If the bottleneck were admission/queueing with TTFT/TPOT healthy, the planner might prefer DP or max-num-seqs instead.
Implementation Plan
Phase 1: Make the current harness evidence-first
- Persist a compact
TrialProfilefor every trial. - Build ranked bottleneck hypotheses from probe sequences.
- Replace single
active_bottleneckwith ranked hypotheses and evidence. - Replace deterministic local refinement with candidate scoring.
- Keep existing rules only as entries in the generic
KnobEffectModel.
Phase 2: Candidate scoring
- Generate candidates from:
- adjacent topology moves;
- same-topology runtime moves;
- rollback/avoidance after failures;
- no-op stop.
- Score candidates by:
- whether they directly target top bottleneck;
- whether they test an unmeasured high-information direction;
- launch risk from prior failures;
- expected impact on request_rate_per_gpu;
- measurement cost.
Phase 3: Controlled experiments
- Add ablations:
- old strong-prompt no-harness;
- min-prompt no-harness;
- current rule harness;
- profile-driven harness.
- Run each under the same SLO/workload/search.
- Report raw
perf[i], best-so-far, failed configs, and stop reason.
Acceptance Criteria
The profile-driven harness is acceptable only if:
- It does not encode model names, fixed SLO values, or known winning configs.
- It can explain every proposal as a hypothesis tied to measured bottleneck evidence.
- It does not early-stop while an unmeasured high-score candidate remains.
- It improves or matches min-prompt no-harness convergence on at least the qwen27b TTFT4s/TPOT25 setup.
- It does not regress the previous stepped TTFT/TPOT50 setup.
- It records enough evidence for a human engineer to audit why each trial was chosen.