Files

Gahow Wang 63d6a111f4 Document profile-driven harness design

2026-05-12 21:09:29 +08:00

10 KiB

Raw Blame History

Profile-Driven Harness Design

Problem

The current harness is too rule-like. It can help in a narrow setup, but it can also overfit to a previous observation such as "TP=2 is enough" and then fail when the SLO or workload changes. Adding one more rule for each failure mode is not acceptable.

The harness should make AITuner behave more like a performance engineer:

Profile the workload and engine behavior.
Diagnose the active bottleneck from measurements.
Form hypotheses about which knob families can relieve that bottleneck.
Choose the next experiment that best disambiguates or improves the system.
Interpret the result and update the diagnosis.
Stop only when the measured search space is exhausted or the remaining hypotheses have low value.

The harness should not encode a fixed answer for qwen27b, TP=4, TPOT25, or any testcase-specific threshold.

Design Principles

1. Harness is an evidence system, not a prompt trick

The harness should provide structured evidence and a decision protocol. It should not simply add more natural-language hints to the prompt.

The output should include:

profile: measured workload and engine traits.
bottleneck_diagnosis: ranked bottleneck hypotheses with evidence.
candidate_actions: candidate config changes with predicted effect and risk.
experiment_plan: the next config to test, why it is informative, and what outcome would confirm/refute the hypothesis.
stop_decision: only if no useful experiment remains under the current measurement budget.

2. Separate profiling from tuning

AITuner currently learns mostly from full trial results. That is too coarse. The harness needs a profile layer that extracts bottleneck signals from every probe:

workload profile: input/output token distributions, cache reuse, burstiness, selected-request count per sampling threshold;
latency profile: TTFT and TPOT mean/p50/p95/p99 by threshold and optionally by input/output token bucket;
engine profile: prefill/decode throughput, waiting/running request counts, KV cache usage, preemptions, GPU utilization if available;
topology profile: TP/DP/EP product, per-GPU request rate, launch stability, memory pressure;
measurement profile: search high/low/tolerance, which thresholds were measured, whether failures were early-stop artifacts or true low-load infeasibility.

This profile should be persisted per trial so later decisions do not depend on prompt memory.

3. Diagnose bottlenecks from counters and slopes

A professional tuning loop should not infer bottlenecks from a single failed reason. It should compare how metrics change across thresholds and configs.

Useful diagnostics:

TTFT-bound: TTFT p95/p99 rises sharply with request rate, long prompts dominate failures, prefill throughput is saturated, decode TPOT is still healthy.
TPOT-bound: TPOT p95/p99 fails while TTFT is acceptable, decode throughput or per-token latency dominates, high sequence concurrency or batching creates token latency tails.
Admission/queueing-bound: waiting requests and arrival lag grow, both TTFT and TPOT may degrade, utilization may be high but service time per request is not the only issue.
Memory/KV-bound: KV cache usage, preemptions, OOM/launch failures, or max model length/cache pressure limit throughput.
Topology/communication-bound: increasing TP lowers per-request latency but may reduce per-GPU throughput; DP improves admission but may not reduce per-request latency.

The diagnosis should be a ranked list with confidence, not a single label:

{
  "bottlenecks": [
    {
      "name": "decode_tpot",
      "confidence": 0.62,
      "evidence": ["tpot_p95 fails at last infeasible probe", "ttft_p95 remains below SLO"]
    },
    {
      "name": "prefill_ttft",
      "confidence": 0.31,
      "evidence": ["long prompt tail", "ttft_p99 grows near knee"]
    }
  ]
}

4. Candidate actions come from a knob-effect model

The harness should maintain a generic model of knob effects, not case-specific rules.

Examples:

Knob family	Expected effect	Main risks
Increase TP	lower per-request compute latency, may help TTFT/TPOT tails	communication overhead, lower per-GPU efficiency, fewer independent replicas
Increase DP	more replicas, better admission/queueing	does not reduce per-request compute latency, may waste memory
Increase EP	MoE expert distribution, possible decode/prefill balance	launch complexity, communication, only relevant with model/engine evidence
Lower max-num-seqs	reduce decode contention and tail TPOT	underutilization, lower throughput
Raise max-num-seqs	improve admission/concurrency	TPOT/TTFT tails, memory pressure
Raise max-num-batched-tokens	improve prefill batching and GPU occupancy	long-prefill HoL blocking, memory pressure
Lower max-num-batched-tokens	reduce long-prefill interference	underfilled prefill batches
Change block-size	cache fragmentation/allocator effects	launch/runtime instability, workload-specific

The harness should score candidates with:

score = expected_bottleneck_relief
      + information_gain
      + launch_safety
      - regression_risk
      - measurement_cost

This lets it choose between "exploit current incumbent" and "explore unmeasured topology" from evidence, not hardcoded order.

5. Use hypothesis-driven experiments

Each trial should answer a specific question.

Bad:

"Try TP=4 because previous run needed TP=4."

Good:

"At the current best TP=2, the last infeasible probe is TPOT-bound. Increasing TP to 4 should reduce per-token compute latency but may reduce per-GPU efficiency. This trial tests whether the SLO is compute-latency limited or topology-overhead limited."

Every candidate should have:

hypothesis;
expected metric movement;
risk;
confirm condition;
reject condition;
fallback next family.

6. Stop is a decision under measured search limits

Early stop should not mean "we found something good." It should mean:

the configured search.high is saturated by a feasible incumbent; or
the remaining high-value hypotheses are already measured or invalidated; or
all remaining candidates have low expected information gain or high launch risk relative to the current best; or
the baseline is infeasible even at the lowest load and the SLO is too tight.

For Fig18 raw mode, stop can be disabled or recorded as stop rather than filling remaining columns with best-so-far.

Proposed Architecture

Components

Profiler
- Input: trial result, probe history, engine logs/metrics if available, study spec.
- Output: TrialProfile.
BottleneckAnalyzer
- Input: recent TrialProfiles, workload profile, SLO.
- Output: ranked bottleneck hypotheses.
KnobEffectModel
- Generic mapping from bottleneck hypotheses to possible knob families.
- Contains no model-specific or SLO-specific constants.
ExperimentPlanner
- Generates candidate config patches.
- Scores candidates by expected relief, information gain, risk, and cost.
- Emits the next experiment and rationale.
StopPolicy
- Uses measured search coverage and remaining candidate scores.
- Does not stop just because a strong incumbent appears.
PromptRenderer
- Renders the structured plan for the LLM when LLM involvement is needed.
- The LLM can choose among candidates or refine rationale, but should not invent arbitrary directions without evidence.

Data Flow

study spec + workload trace
        |
        v
workload profile
        |
trial result/probes/logs --> trial profile
        |
        v
bottleneck analyzer --> ranked hypotheses
        |
        v
knob effect model --> candidate action set
        |
        v
experiment planner --> next config / stop
        |
        v
AITuner evaluates config over configured search range

How This Handles the Current Failure

Current fixed TTFT4s + TPOT25 data:

TP=2 reached 0.1992 req/s/GPU.
Later TP=2 local runtime probes did not improve.
Min-prompt no-harness found TP=4 with 0.2696-0.2758 req/s/GPU.

A profile-driven harness should have reasoned:

The current incumbent is good but not proof of optimum.
The failed probes still show SLO pressure, especially token-latency/tail behavior.
Higher TP is a candidate action because it may reduce per-request compute latency.
TP=4 is an unmeasured topology hypothesis within allowed constraints.
Before spending many local TP=2 runtime trials or stopping, test TP=4 as a high-information experiment.

This is not because TP=4 is hardcoded. If TP=4 had already been measured and underperformed, the planner would prefer local runtime refinement or stop. If the bottleneck were admission/queueing with TTFT/TPOT healthy, the planner might prefer DP or max-num-seqs instead.

Implementation Plan

Phase 1: Make the current harness evidence-first

Persist a compact TrialProfile for every trial.
Build ranked bottleneck hypotheses from probe sequences.
Replace single active_bottleneck with ranked hypotheses and evidence.
Replace deterministic local refinement with candidate scoring.
Keep existing rules only as entries in the generic KnobEffectModel.

Phase 2: Candidate scoring

Generate candidates from:
- adjacent topology moves;
- same-topology runtime moves;
- rollback/avoidance after failures;
- no-op stop.
Score candidates by:
- whether they directly target top bottleneck;
- whether they test an unmeasured high-information direction;
- launch risk from prior failures;
- expected impact on request_rate_per_gpu;
- measurement cost.

Phase 3: Controlled experiments

Add ablations:
- old strong-prompt no-harness;
- min-prompt no-harness;
- current rule harness;
- profile-driven harness.
Run each under the same SLO/workload/search.
Report raw perf[i], best-so-far, failed configs, and stop reason.

Acceptance Criteria

The profile-driven harness is acceptable only if:

It does not encode model names, fixed SLO values, or known winning configs.
It can explain every proposal as a hypothesis tied to measured bottleneck evidence.
It does not early-stop while an unmeasured high-score candidate remains.
It improves or matches min-prompt no-harness convergence on at least the qwen27b TTFT4s/TPOT25 setup.
It does not regress the previous stepped TTFT/TPOT50 setup.
It records enough evidence for a human engineer to audit why each trial was chosen.

10 KiB Raw Blame History