Commit Graph

6 Commits

Author SHA1 Message Date
075f5bbc22 trace: time_to_parent_chat annotation + thinktime trace variants
Adds `scripts/add_ttp_streaming.py`: one streaming pass over the 522 GB raw
glm5.1 trace to build {chat_id: (ready_ms, end_ms)} and join the real
inter-turn gap onto the COMPLETE formatted trace (no early-exit, low memory).
  time_to_parent_chat = (this.request_ready_time_ms - parent.request_end_time_ms)/1000
  = tool-exec + agent think-time; turn-1 -> null, negatives clamped to 0.

Ships the two ttp-annotated sampled traces (same anonymized data + one
timestamp-derived field; regenerated via sample_trace.py --seed 42 so they are
row-for-row identical to the non-ttp variants on all 9 shared fields):
  traces/w600_r0.0015_st30_ttp.jsonl          (1214 reqs)
  traces/w600_r0.0015_st30_first600s_ttp.jsonl (807 reqs)
They are needed to replay with --dispatch-mode thinktime without the
non-redistributable raw trace, so they are added to the .gitignore allowlist.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 20:58:49 +08:00
71b0747b3b 600s-truncated trace + LPWL 5-policy results
traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600
trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical
APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder,
lower-locality regime; whitelisted alongside the parent anonymized trace.

analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on
the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero
knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request
balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is
E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B
was tuned on full w600 yet is beaten by the knob-free policy on this regime.
Includes the run_5policy_600s.sh repro driver.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 16:08:35 +08:00
08c3cf48aa Ship anonymized benchmark trace w600_r0.0015_st30 + provenance
Whitelist the sampled replay trace (1214 reqs / 274 sessions / ~600 s) past
the traces/ ignore so the repo is runnable without dash0 access. Metadata
only (token counts, opaque KV-block hashes, timing, session structure) — no
prompts/outputs/PII. traces/README documents schema, provenance (sampled
from the internal GLM-5.1 production trace via scripts/sample_trace.py), and
the regeneration command.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 11:54:43 +08:00
2b0ac70ee7 Phase 1 milestone: system-level analysis + reproducible report
- REPORT.md: self-contained milestone report covering baseline vs elastic
  setup, exact launch commands, benchmark params, results, log locations,
  and repo structure — sufficient for anyone to reproduce
- analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown
  (KV cache hit ratio, per-class TTFT, GPU util paradox explanation)
- scripts/cache_aware_proxy.py: round-robin P-instance selection replacing
  argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x)
- scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 16:17:41 +08:00
445e491123 Add vLLM v0.18.1 source tree with KV transfer abort fix
third_party/vllm/ now tracked in git for direct patch management.
Based on vLLM v0.18.1 release with one patch applied:

  vllm/v1/core/sched/scheduler.py:
    Replace fatal assert with graceful skip when KV transfer callback
    arrives for an already-aborted request during PD disaggregated serving.

Future vLLM modifications should be made directly in third_party/vllm/
and committed normally. The patches/ directory is kept as documentation
of what changed from upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:30:38 +08:00
05592e6adc Agentic workload PD separation analysis with trace-driven benchmarks
Systematic study of prefill-decode disaggregation for agentic LLM workloads
using production GLM-5.1 coder trace (2.1M requests, 71B input tokens).

Key findings:
- Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7%
  without PD separation, matching PD-Sep's decode isolation benefit
- PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain
  when using the same cache-aware scheduler
- Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x
  vs decode AI <2), but absolute FLOPs drop 71% from cache hits
- For agentic MoE workloads, cache-aware routing > PD separation

Infrastructure:
- Trace sampler preserving session structure + hash_ids for prefix sharing
- Async trace replayer with streaming TTFT/TPOT/E2E measurement
- Unified cache-aware + token-level load-balanced global scheduler proxy
  supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes
- vLLM 0.18.1 scheduler patch for KV transfer abort race condition
- Roofline analysis tool for prefill/decode compute characterization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:21:57 +08:00