# Agentic Dispatch Coupling: Why Session-Sequential Replay is the Realistic Mode Date: 2026-05-26 Owner: characterization Status: methodology note for paper framing ## The observation In `replayer/replay.py:282-287`, turn N of a session fires at: ``` t_fire(N) = max( turn_N_trace_timestamp, # what the trace asked for turn_{N-1}_finish_wall_clock, # but turn N-1 must complete first ) ``` When the system is fast, the second term loses → turn N fires at its trace timestamp → the replay matches the captured trace. When the system is slow, the second term dominates → turn N fires *immediately* after turn N-1 completes, with the trace timestamp ignored. The session's "inter-turn think time" collapses to zero. A first reading flags this as a benchmark concern: under saturation the arrival process becomes policy-dependent, so cross-policy latency comparisons are confounded by a feedback loop (slow policy → longer sessions → more concurrent in-flight → harder system → slower latency). This note argues the opposite: **the trace-replayer's behavior is the correct model of agentic workloads, and the feedback loop is a real property of production systems, not a methodology artifact**. ## Why agentic workloads do not have user think-time In chat workloads, the turn N+1 message is composed by a human reading the turn N response. The inter-turn gap is dominated by human reading + typing speed, which is independent of how fast the server replied. The trace's timestamp captures the human cadence and is a faithful arrival process. In **agentic workloads**, turn N+1 is generated by: - A tool-call response feeding back into the model context - An autonomous loop deciding the next action - A planner / executor stepping to the next subgoal None of these wait for a human. Turn N+1 fires as soon as the infrastructure can hand the previous turn's output back to the next inference step. There is no think-time floor. This means: in a real agentic system, **the wall-clock time between turn N finish and turn N+1 dispatch is essentially zero**. If the model serving infrastructure slows down (high TTFT or E2E for turn N), turn N+1's dispatch slips by the same amount — exactly the behavior the replayer exhibits. ## What B3's session-sequential dispatch is actually measuring B3's trace replayer drives a workload that: - preserves the *causal structure* of the original trace (which turns belong to which session and in what order), - uses the *original timestamps as a lower bound* (turn N+1 cannot fire before its trace timestamp), - *but* lets turn N+1 fire immediately when the system has fallen behind. For an agentic workload, this is the right model: 1. The captured trace's timestamps reflect the **production system's actual response speed at capture time** — they already encode the round-trip time the model + tool stack delivered. 2. When we replay against a *different* policy, what we want to measure is "what wall-clock would this session take under policy X" — which includes the same tool-call-driven cadence: each next turn fires as soon as it can. 3. The "inter-turn gap" is not a fixed delay we should respect; it is an artifact of the captured system's speed that we are explicitly trying to replace. So the replayer's behavior is not "broken under saturation"; it is modeling the agentic semantic correctly: **no think-time, sequential within session, fire-immediately when ready**. ## The feedback loop is a real production phenomenon Once we accept the agentic semantic, the so-called "dispatch slip artifact" is not an artifact — it is a real system behavior: ``` slow policy → each turn takes longer → each session lives in the system longer → at any moment, more sessions are concurrently in-flight → 8 workers' KV / queue pressure is higher → each request gets less per-worker capacity → each turn takes even longer → ... ``` By Little's Law: `N_concurrent ≈ session_arrival_rate × mean_session_lifetime`. In our B3 data: - lmetric: mean session lifetime is much longer than the original trace's ~600 s span (lmetric's 1214-request replay took 49 min wall clock — sessions stayed alive ~8× longer than the trace captured). - unified: sessions drain ~3× faster than lmetric. So under unified, the 8-worker pool sees fewer concurrent sessions than under lmetric — and this is **what production would also see** if the operator switched routing policies on the same incoming agentic load. **This is not a fairness violation**. It is a faithful reflection of: "a faster routing policy is faster both because of its per-request behavior AND because it reduces the steady-state concurrent load it inflicts on itself". A user running an agentic system *does* benefit from both effects when they pick a better policy. The combined "policy × system-feedback" gain is what the user actually experiences. ## What this means for B3 and B4 in the paper | | B3 trace-driven replay | B4 open-loop Poisson | |---|---|---| | Arrival process | original trace timestamps with session-sequential "fire-on-finish" | Poisson session inter-arrival at fixed λ | | Inter-turn think-time | none (matches agentic) | none (matches agentic) | | Session lifetime under load | *grows* with policy slowness (feedback) | *fixed* by trace template plus per-turn latency | | What latency at p90 measures | end-user latency under agentic feedback amplification | per-request behavior at the operator-chosen load level | | What "fair across policies" means | same trace, same total session set; arrival process is policy-dependent **on purpose** | same λ, decoupled from policy throughput | | When to use it | "if we run this real captured load through our system, what does the user see" | "what is the max sustainable rate (SRR) before SLO breaks, per policy" | The two are **complementary**, not "B3 is unfair and B4 fixes it": - **B3 answers the "in-production replay" question** — including feedback amplification, which agentic users will actually experience. - **B4 answers the "saturation envelope" question** — what's the policy's intrinsic throughput at fixed load. A paper that drops B3 in favor of B4 would understate how much the **combined** effect (policy + feedback) actually helps the user. A paper that drops B4 in favor of B3 would conflate the two effects and prevent a "policy X sustains higher λ" statement. ## Recommended paper framing 1. **B3 is the production-replay experiment**. Report latency percentiles as "end-to-end under captured agentic load with no-think-time sequencing". Acknowledge that the *combined* gap (e.g. unified TTFT p90 = 7.24 s vs lmetric 15.6 s) reflects both policy and feedback; call this out, do not hide it. 2. **B4 is the controlled-load experiment**. Report `SRR_max` per policy under per-class SLO. This is the experiment that decouples policy from feedback and gives a sustainable-rate ranking. 3. **The feedback amplification itself is a finding to call out**. It is the reason why a "marginally better" routing policy (e.g. unified over lmetric in microbenchmarks) can deliver a much bigger gap in production (here ~2×): the feedback halves the in-flight count which compounds on top of the per-request improvement. 4. **The contrast with chat workloads is a paper section** (or at least a paragraph). Chat workloads have human think-time bounded by reading speed, so the feedback loop is partially broken: even if the server slows down, the user-driven inter-turn delay still puts a floor on how concentrated the load can become. Agentic workloads remove that floor. ## Open questions - **Is the feedback amplification quantifiable from B3 alone?** We have total wall-clock per policy and per-session lifetime distributions; we can in principle attribute the policy-vs-feedback split by comparing B3's saturated-replay p90 to B4's at-fixed-λ p90 (when B4 runs). - **Does it matter that the original trace was captured under one policy's behavior?** The trace's timestamps were the production system's output at capture time. When we replay against a slower policy, we are asking "what if this same set of session+turns ran on a worse policy" — and the answer is "the sessions would live longer". This is precisely the counterfactual we want. - **What happens if real tools have variable per-call latency?** Our replayer assumes turn N+1 fires the instant turn N finishes. Real agentic systems have some tool-execution time between turns. This is a quantitative correction (raises the floor on inter-turn gap), not a qualitative one — the feedback loop still applies, just with a higher baseline. ## Cross-reference - `replayer/replay.py:282-287` — the dispatch rule - `analysis/characterization/window_1_results.md` §"What Window 1 does not answer" — current treatment as caveat - `analysis/claude_characterization_work_plan.md` §B4 — open-loop Poisson loadgen as the orthogonal measurement