tim 986f351365 feat(sglang): drop streaming-session reqs with fill_ids < prefix_indices
Fix A from docs/E3_FINDINGS_ZH.md §3. The existing streaming-session
correction at the top of ScheduleBatch.prepare_for_extend zeroes
req.extend_input_len when len(fill_ids) <= len(prefix_indices), but
the per-req invariant later in the same function (assert
seq_len - pre_len == req.extend_input_len) is computed from raw
fill_ids/prefix_indices lengths and has no path to be satisfied
when fill_len < prefix_len. The result is an AssertionError that
crashes the entire decode worker.

Add a pre-filter pass at the start of prepare_for_extend that
detects this state, marks the affected reqs with FINISH_ABORT (so
the client gets an error response instead of the worker hanging),
and drops them from the batch before the correction loop runs. If
all reqs are filtered, populate empty tensor/list state and return
early so downstream model.forward sees a valid no-op batch.

This treats fill_ids < prefix_indices as upstream state
inconsistency that should be reported to the client rather than
silently miscomputed. The narrower invariant after this filter:
prepare_for_extend's body only ever sees streaming-session reqs
where actual_extend_len > 0, which is the regime the existing
correction logic was designed for.

Reproduced by E3 first run on 2026-05-12 02:51:21 UTC (rid
6f4318e93dd543a49dbf19248cfc1e6f, session 1000195, fill_len=6648,
prefix_len=43459) — masked in E1/E2 because the cap-out failure
cascade prevented sessions from accumulating deep enough committed
prefix to trigger the inconsistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 12:12:14 +08:00
2026-04-24 12:17:40 +00:00

Agentic PD Hybrid

这个项目是在 SGLang xPyD 上做一个最小实验框架,用来判断:

面向 agentic coding workload 的 session-aware / KV-cache-aware P/D routing能不能降低端到端延迟。

更完整但仍然简洁的说明见 docs/PROJECT_OVERVIEW.md

当前做了什么

  • 启动单机 SGLang P/D 栈。
  • 回放 Ali coding agent trace并记录 request-level metrics。
  • 支持 defaultstickykv-aware 路由策略。
  • 支持 pd-disaggregationkvcache-centricpd-colo 对比。
  • 支持小 append、多轮 session 的 micro-benchmark trace。
  • 维护了基于 SGLang v0.5.10 的本地 patch放在 third_party/sglang

环境

统一使用 uv

uv sync

默认模型路径:

~/models/Qwen/Qwen3-Coder-30B-A3B-Instruct

当前主要测试环境是单机 8 GPU约束是 prefill + decode <= 8

常用命令

生成小 append trace

uv run agentic-pd-hybrid make-small-append-trace \
  --output outputs/smoke-hotcap-30k-1k-256.jsonl \
  --session-count 4 \
  --turns-per-session 3 \
  --initial-input-length 30000 \
  --append-input-length 1000 \
  --output-length 256

跑 live benchmark

uv run agentic-pd-hybrid benchmark-live \
  --trace outputs/micro-serveable-varturn-30k-1k-256-20260424T0756Z.jsonl \
  --output-root outputs/live-serveable-varturn-30k-1k-256-hotcap \
  --mechanism kvcache-centric \
  --policy kv-aware \
  --kvcache-admission-mode worker \
  --prefill-workers 1 \
  --decode-workers 1 \
  --prefill-gpu-ids 0 \
  --decode-gpu-ids 1 \
  --transfer-backend mooncake \
  --target-duration-s 2000 \
  --session-sample-rate 1.0 \
  --min-turns 2 \
  --time-scale 1 \
  --concurrency-limit 1000

只回放并写 metrics

uv run agentic-pd-hybrid replay \
  --trace path/to/trace.jsonl \
  --policy kv-aware \
  --mechanism pd-disaggregation \
  --router-url http://127.0.0.1:8000 \
  --output outputs/replay.jsonl

输出

每次 replay/benchmark 会写:

  • request metricsrequest-metrics.jsonl
  • 汇总结果:request-metrics.jsonl.summary.json

重点看:

  • E2E latency
  • TTFT / TPOT
  • execution mode
  • cached tokens
  • KV transfer blocks
  • error

维护约定

  • 项目代码改动:feat: / fix: / docs:
  • SGLang 改动:feat(sglang): ... / fix(sglang): ...
  • third_party/sglang 的基线是 clean SGLang v0.5.10 snapshot。
  • 不提交 outputs/、日志、__pycache__、虚拟环境。
Description
No description provided
Readme 18 MiB
Languages
Python 100%