Measure inter-turn T_external on the raw production trace; add f3a CDF

The earlier conversation suggested agentic might "have no human think-time" and therefore live in a strict closed-loop regime. The user pushed back: tool calls also take time and might restore a chatbot-like buffer between turns. To resolve this, we go to the actual data. The previously-published per-record formatted trace only carries arrival timestamps, so an arrival-to-arrival diff conflates W_turn + T_external. The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/ 051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms, which lets us compute the pure inter-turn external gap T_external = next.request_ready_time_ms - prev.request_end_time_ms for each session's consecutive turn pair. Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions): p25 = 0.69 s p50 = 1.6 s p75 = 8.6 s p90 = 44 s mean = 37 s (heavy long-tail; paused/abandoned sessions) 39 % of gaps < 1 s 67 % of gaps < 5 s 87 % of gaps < 30 s The bulk of the distribution is dominated by sub-second to a-few-seconds tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 = 7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile of T_external, so dispatch coupling is the dominant regime for the majority of turns — not a corner case. This corrects the earlier conflated arrival-to-arrival "median gap 11 s" figure (which folded W_turn into T_external). The true T_external median is 1.6 s. Adds: - scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator - analysis/characterization/data/agentic_inter_turn_gap.json: 500-point CDF cache + summary stats, scp'd back from dash0 - scripts/plot_inter_turn_gap.py: local figure renderer - figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and unified/lmetric TTFT p90 reference lines Next step (per user): pull a chatbot trace through the same pipeline and compare distributions side by side; this will let §2.3 stop hand-waving about "no think-time" and instead present the regime split empirically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 12:37:32 +08:00
parent 555cabcf1f
commit 41232f49d3
4 changed files with 178 additions and 0 deletions
--- a/analysis/characterization/data/agentic_inter_turn_gap.json
+++ b/analysis/characterization/data/agentic_inter_turn_gap.json
--- a/figs/f3a_inter_turn_gap.png
+++ b/figs/f3a_inter_turn_gap.png
--- a/scripts/compute_inter_turn_gap_remote.py
+++ b/scripts/compute_inter_turn_gap_remote.py
@@ -0,0 +1,87 @@
+#!/usr/bin/env python3
+"""Compute inter-turn T_external (next.ready - prev.end) on the raw agentic trace.
+
+Run on dash0 (the trace is at the path below; not co-located with the repo).
+Writes /tmp/agentic_inter_turn_gap.json which is then scp'd into the repo at
+analysis/characterization/data/agentic_inter_turn_gap.json for figure rebuild.
+
+Reproduce:
+    scp scripts/compute_inter_turn_gap_remote.py dash0:/tmp/
+    ssh dash0 'python3 /tmp/compute_inter_turn_gap_remote.py'
+    scp dash0:/tmp/agentic_inter_turn_gap.json analysis/characterization/data/
+"""
+import json
+from collections import defaultdict
+import numpy as np
+
+path = "/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317-raw.jsonl"
+sessions = defaultdict(list)
+
+n_total = 0
+n_kept = 0
+with open(path) as f:
+    for line in f:
+        try:
+            r = json.loads(line)
+        except Exception:
+            continue
+        n_total += 1
+        m = r.get("meta", {})
+        sid = m.get("session_id")
+        ready = m.get("request_ready_time_ms")
+        end = m.get("request_end_time_ms")
+        if sid is None or ready is None or end is None:
+            continue
+        if end <= 0 or ready <= 0 or end < ready:
+            continue
+        sessions[sid].append((int(ready), int(end)))
+        n_kept += 1
+
+print(f"records_total: {n_total}")
+print(f"records_kept: {n_kept}")
+print(f"sessions_total: {len(sessions)}")
+
+gaps_ms = []
+neg = 0
+for sid, turns in sessions.items():
+    if len(turns) < 2:
+        continue
+    turns.sort(key=lambda x: x[0])
+    for i in range(len(turns) - 1):
+        g = turns[i + 1][0] - turns[i][1]
+        if g < 0:
+            neg += 1
+            continue
+        gaps_ms.append(g)
+
+gaps = np.array(gaps_ms, dtype=np.float64) / 1000.0
+
+print(f"sessions_with_>=2_turns: {sum(1 for t in sessions.values() if len(t) >= 2)}")
+print(f"gaps_kept: {len(gaps)}")
+print(f"gaps_negative_dropped: {neg}")
+pcts = [1, 5, 25, 50, 75, 90, 95, 99]
+ps = {f"p{p}": float(np.percentile(gaps, p)) for p in pcts}
+print(f"stats_s: min={gaps.min():.3f} mean={gaps.mean():.3f} max={gaps.max():.3f} {ps}")
+for thr in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]:
+    pct = (gaps < thr).sum() / len(gaps) * 100
+    print(f"frac < {thr:5.1f}s : {pct:5.1f}%")
+
+n = len(gaps)
+arr = np.sort(gaps)
+idx_top = np.unique(np.round(np.geomspace(1, max(1, n // 100), 200)).astype(int)) - 1
+idx_rest = np.unique(np.linspace(n // 100, n - 1, 300).astype(int))
+idx = np.unique(np.concatenate([[0], idx_top, idx_rest, [n - 1]]))
+idx = idx[idx < n]
+samples = [{"rank_pct": float((i + 1) / n * 100), "gap_s": float(arr[i])} for i in idx]
+out = {
+    "n_gaps": n,
+    "n_sessions": sum(1 for t in sessions.values() if len(t) >= 2),
+    "negative_dropped": neg,
+    "stats_s": {**{"min": float(gaps.min()), "max": float(gaps.max()),
+                   "mean": float(gaps.mean())}, **ps},
+    "fraction_below": {f"{thr}s": float((gaps < thr).sum() / n)
+                        for thr in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0]},
+    "cdf_samples": samples,
+}
+open("/tmp/agentic_inter_turn_gap.json", "w").write(json.dumps(out))
+print("wrote /tmp/agentic_inter_turn_gap.json")
--- a/scripts/plot_inter_turn_gap.py
+++ b/scripts/plot_inter_turn_gap.py
@@ -0,0 +1,90 @@
+#!/usr/bin/env python3
+"""Plot the production trace inter-turn gap distribution.
+
+Inter-turn gap = next_turn.request_ready_time_ms - prev_turn.request_end_time_ms
+(i.e. T_external: the wall-clock between a turn finishing and the next turn
+of the same session arriving). This is the tool-call latency + any pause,
+not the conflated arrival-to-arrival interval.
+
+Data is pre-computed on dash0 by scripts/agentic_gap.py and cached under
+``analysis/characterization/data/agentic_inter_turn_gap.json`` (~23 KB).
+"""
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import numpy as np
+
+
+def load(cache_path: Path) -> tuple[np.ndarray, np.ndarray, dict]:
+    d = json.loads(cache_path.read_text())
+    samples = d["cdf_samples"]
+    xs = np.array([s["gap_s"] for s in samples])
+    ys = np.array([s["rank_pct"] for s in samples])
+    return xs, ys, d
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--data",
+        default="analysis/characterization/data/agentic_inter_turn_gap.json",
+    )
+    parser.add_argument("--out", default="figs/f3a_inter_turn_gap.png")
+    args = parser.parse_args()
+
+    xs, ys, d = load(Path(args.data))
+
+    fig, ax = plt.subplots(figsize=(9, 5.2))
+    ax.plot(xs, ys, color="#1f77b4", lw=2.2,
+            label=f"agentic trace (n={d['n_gaps']:,} gaps, "
+                  f"{d['n_sessions']:,} multi-turn sessions)")
+
+    p = d["stats_s"]
+    for pct, key in [(25, "p25"), (50, "p50"), (75, "p75"), (90, "p90")]:
+        v = p[key]
+        ax.scatter([v], [pct], color="#c44e52", s=55, zorder=5)
+        ax.annotate(f"p{pct} = {v:.2g}s",
+                    xy=(v, pct), xytext=(8, -4),
+                    textcoords="offset points",
+                    fontsize=10, color="#7a1d1d")
+
+    # Reference vertical lines: scheduler W_turn (TTFT p90 from our window_1 runs)
+    refs = [
+        ("lmetric TTFT p90 = 15.7s", 15.7, "#888"),
+        ("unified TTFT p90 = 7.3s", 7.3, "#444"),
+    ]
+    for label, v, color in refs:
+        ax.axvline(v, color=color, ls=":", lw=1.2, alpha=0.85)
+        ax.text(v * 1.05, 8, label, fontsize=8.5, color=color,
+                rotation=90, va="bottom")
+
+    ax.set_xscale("log")
+    ax.set_xlim(0.05, 2000)
+    ax.set_ylim(0, 102)
+    ax.set_xlabel(
+        "Inter-turn gap T_external (s, log scale) "
+        "— next_turn.ready − prev_turn.end"
+    )
+    ax.set_ylabel("Cumulative % of inter-turn intervals")
+    ax.set_title(
+        "Inter-turn external gap CDF — production agentic trace\n"
+        f"median T_external = {p['p50']:.2g}s; "
+        f"{int(d['fraction_below']['1.0s']*100)}% gaps < 1s, "
+        f"{int(d['fraction_below']['5.0s']*100)}% < 5s, "
+        f"{int(d['fraction_below']['30.0s']*100)}% < 30s"
+    )
+    ax.grid(True, which="both", alpha=0.3)
+    ax.legend(loc="lower right", framealpha=0.92, fontsize=9)
+
+    out_path = Path(args.out)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    fig.savefig(out_path, dpi=150, bbox_inches="tight")
+    print(f"wrote {out_path}")
+
+
+if __name__ == "__main__":
+    main()