Files
agentic-kvc/analysis/characterization/agentic_dispatch_coupling.md
Gahow Wang 8ac41a8684 Agentic dispatch coupling: trace-replay session-sequentiality is realistic
The B3 audit flagged the trace replayer's "fire turn N+1 immediately
if turn N is behind schedule" semantics as a potential benchmark
crime, because under saturation the effective arrival process becomes
policy-dependent (slow policy -> longer session lifetimes -> more
concurrent in-flight -> harder system -> still slower). The audit
called this dispatch slip.

But in agentic workloads, turn N+1 is generated by a tool-call
response or an autonomous-loop step, not by a human reading the
previous reply. There is no inter-turn think-time. So the replayer's
"no think-time, sequential within session, fire-immediately-when-
ready" behavior is the correct model of agentic production, and the
feedback amplification is a real property of production systems
under saturation rather than an artifact of the replayer.

The note (analysis/characterization/agentic_dispatch_coupling.md)
lays out:
- The dispatch rule and the apparent feedback loop
- Why agentic workloads do not have user think-time
- Application of Little's Law: slower policy carries higher concurrent
  in-flight load, so the policy x feedback gap is real, not artifact
- Reframes B3 as the "production-replay" experiment and B4 as the
  orthogonal "controlled-load" experiment, complementary not
  hierarchical
- Calls the feedback amplification itself out as a finding worth
  reporting (e.g. unified's ~2x latency-p90 gap over lmetric in B3
  reflects both the routing improvement and the in-flight reduction)
- Contrasts with chat workloads (human think-time partially breaks
  the feedback loop, agentic removes that floor)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 01:00:25 +08:00

8.9 KiB
Raw Permalink Blame History

Agentic Dispatch Coupling: Why Session-Sequential Replay is the Realistic Mode

Date: 2026-05-26 Owner: characterization Status: methodology note for paper framing

The observation

In replayer/replay.py:282-287, turn N of a session fires at:

t_fire(N) = max(
    turn_N_trace_timestamp,         # what the trace asked for
    turn_{N-1}_finish_wall_clock,   # but turn N-1 must complete first
)

When the system is fast, the second term loses → turn N fires at its trace timestamp → the replay matches the captured trace. When the system is slow, the second term dominates → turn N fires immediately after turn N-1 completes, with the trace timestamp ignored. The session's "inter-turn think time" collapses to zero.

A first reading flags this as a benchmark concern: under saturation the arrival process becomes policy-dependent, so cross-policy latency comparisons are confounded by a feedback loop (slow policy → longer sessions → more concurrent in-flight → harder system → slower latency).

This note argues the opposite: the trace-replayer's behavior is the correct model of agentic workloads, and the feedback loop is a real property of production systems, not a methodology artifact.

Why agentic workloads do not have user think-time

In chat workloads, the turn N+1 message is composed by a human reading the turn N response. The inter-turn gap is dominated by human reading + typing speed, which is independent of how fast the server replied. The trace's timestamp captures the human cadence and is a faithful arrival process.

In agentic workloads, turn N+1 is generated by:

  • A tool-call response feeding back into the model context
  • An autonomous loop deciding the next action
  • A planner / executor stepping to the next subgoal

None of these wait for a human. Turn N+1 fires as soon as the infrastructure can hand the previous turn's output back to the next inference step. There is no think-time floor.

This means: in a real agentic system, the wall-clock time between turn N finish and turn N+1 dispatch is essentially zero. If the model serving infrastructure slows down (high TTFT or E2E for turn N), turn N+1's dispatch slips by the same amount — exactly the behavior the replayer exhibits.

What B3's session-sequential dispatch is actually measuring

B3's trace replayer drives a workload that:

  • preserves the causal structure of the original trace (which turns belong to which session and in what order),
  • uses the original timestamps as a lower bound (turn N+1 cannot fire before its trace timestamp),
  • but lets turn N+1 fire immediately when the system has fallen behind.

For an agentic workload, this is the right model:

  1. The captured trace's timestamps reflect the production system's actual response speed at capture time — they already encode the round-trip time the model + tool stack delivered.
  2. When we replay against a different policy, what we want to measure is "what wall-clock would this session take under policy X" — which includes the same tool-call-driven cadence: each next turn fires as soon as it can.
  3. The "inter-turn gap" is not a fixed delay we should respect; it is an artifact of the captured system's speed that we are explicitly trying to replace.

So the replayer's behavior is not "broken under saturation"; it is modeling the agentic semantic correctly: no think-time, sequential within session, fire-immediately when ready.

The feedback loop is a real production phenomenon

Once we accept the agentic semantic, the so-called "dispatch slip artifact" is not an artifact — it is a real system behavior:

slow policy
    → each turn takes longer
    → each session lives in the system longer
    → at any moment, more sessions are concurrently in-flight
    → 8 workers' KV / queue pressure is higher
    → each request gets less per-worker capacity
    → each turn takes even longer
    → ...

By Little's Law: N_concurrent ≈ session_arrival_rate × mean_session_lifetime.

In our B3 data:

  • lmetric: mean session lifetime is much longer than the original trace's ~600 s span (lmetric's 1214-request replay took 49 min wall clock — sessions stayed alive ~8× longer than the trace captured).
  • unified: sessions drain ~3× faster than lmetric.

So under unified, the 8-worker pool sees fewer concurrent sessions than under lmetric — and this is what production would also see if the operator switched routing policies on the same incoming agentic load.

This is not a fairness violation. It is a faithful reflection of: "a faster routing policy is faster both because of its per-request behavior AND because it reduces the steady-state concurrent load it inflicts on itself".

A user running an agentic system does benefit from both effects when they pick a better policy. The combined "policy × system-feedback" gain is what the user actually experiences.

What this means for B3 and B4 in the paper

B3 trace-driven replay B4 open-loop Poisson
Arrival process original trace timestamps with session-sequential "fire-on-finish" Poisson session inter-arrival at fixed λ
Inter-turn think-time none (matches agentic) none (matches agentic)
Session lifetime under load grows with policy slowness (feedback) fixed by trace template plus per-turn latency
What latency at p90 measures end-user latency under agentic feedback amplification per-request behavior at the operator-chosen load level
What "fair across policies" means same trace, same total session set; arrival process is policy-dependent on purpose same λ, decoupled from policy throughput
When to use it "if we run this real captured load through our system, what does the user see" "what is the max sustainable rate (SRR) before SLO breaks, per policy"

The two are complementary, not "B3 is unfair and B4 fixes it":

  • B3 answers the "in-production replay" question — including feedback amplification, which agentic users will actually experience.
  • B4 answers the "saturation envelope" question — what's the policy's intrinsic throughput at fixed load.

A paper that drops B3 in favor of B4 would understate how much the combined effect (policy + feedback) actually helps the user. A paper that drops B4 in favor of B3 would conflate the two effects and prevent a "policy X sustains higher λ" statement.

  1. B3 is the production-replay experiment. Report latency percentiles as "end-to-end under captured agentic load with no-think-time sequencing". Acknowledge that the combined gap (e.g. unified TTFT p90 = 7.24 s vs lmetric 15.6 s) reflects both policy and feedback; call this out, do not hide it.

  2. B4 is the controlled-load experiment. Report SRR_max per policy under per-class SLO. This is the experiment that decouples policy from feedback and gives a sustainable-rate ranking.

  3. The feedback amplification itself is a finding to call out. It is the reason why a "marginally better" routing policy (e.g. unified over lmetric in microbenchmarks) can deliver a much bigger gap in production (here ~2×): the feedback halves the in-flight count which compounds on top of the per-request improvement.

  4. The contrast with chat workloads is a paper section (or at least a paragraph). Chat workloads have human think-time bounded by reading speed, so the feedback loop is partially broken: even if the server slows down, the user-driven inter-turn delay still puts a floor on how concentrated the load can become. Agentic workloads remove that floor.

Open questions

  • Is the feedback amplification quantifiable from B3 alone? We have total wall-clock per policy and per-session lifetime distributions; we can in principle attribute the policy-vs-feedback split by comparing B3's saturated-replay p90 to B4's at-fixed-λ p90 (when B4 runs).
  • Does it matter that the original trace was captured under one policy's behavior? The trace's timestamps were the production system's output at capture time. When we replay against a slower policy, we are asking "what if this same set of session+turns ran on a worse policy" — and the answer is "the sessions would live longer". This is precisely the counterfactual we want.
  • What happens if real tools have variable per-call latency? Our replayer assumes turn N+1 fires the instant turn N finishes. Real agentic systems have some tool-execution time between turns. This is a quantitative correction (raises the floor on inter-turn gap), not a qualitative one — the feedback loop still applies, just with a higher baseline.

Cross-reference

  • replayer/replay.py:282-287 — the dispatch rule
  • analysis/characterization/window_1_results.md §"What Window 1 does not answer" — current treatment as caveat
  • analysis/claude_characterization_work_plan.md §B4 — open-loop Poisson loadgen as the orthogonal measurement