Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00
commit a57afa86b4
323 changed files with 42569 additions and 0 deletions
--- a/projects/kvcachecache/Dev.figs/250414-000021-1.png
+++ b/projects/kvcachecache/Dev.figs/250414-000021-1.png
--- a/projects/kvcachecache/Dev.figs/250414-000021.png
+++ b/projects/kvcachecache/Dev.figs/250414-000021.png
--- a/projects/kvcachecache/Dev.md
+++ b/projects/kvcachecache/Dev.md
@@ -0,0 +1,32 @@
+
+[ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production](https://arxiv.org/pdf/2505.09999)
+
+
+
+
+
+
+
+
+
+优先 evict M queue
+![[projects/kvcachecache/Dev.figs/250414-000021.png]]
+
+|            |   S3FIFO |
+|------------|----------|
+| 1kGPU1kCPU | 0.095005 |
+| 1kGPU2kCPU | 0.136413 |
+| 1kGPU4kCPU | 0.213832 |
+
+优先 evict S queue
+![[projects/kvcachecache/Dev.figs/250414-000021-1.png]]
+
+|            | S3FIFO   |
+| ---------- | -------- |
+| 1kGPU1kCPU | 0.095005 |
+| 1kGPU2kCPU | 0.136413 |
+| 1kGPU4kCPU | 0.213832 |
+
+
+
+
--- a/roadmap.figs/250414-000021-1.png
+++ b/roadmap.figs/250414-000021-1.png
--- a/roadmap.figs/250414-000021.png
+++ b/roadmap.figs/250414-000021.png
--- a/roadmap.figs/250812-140723.png
+++ b/roadmap.figs/250812-140723.png
--- a/projects/kvcachecache/Refine
+++ b/projects/kvcachecache/Refine
@@ -0,0 +1,33 @@
+# trace 部分
+
+## 现存的问题
+
+1. traceA 和 traceB 来自不同的 log，格式不同，处理时需要的脚本不同
+2. traceA 和 traceB 之前测试时使用的时长分别为 1h/2h，时长不统一 & 不够长
+3. trace 相关的处理脚本、绘制脚本不够高效
+	- 需要 load 所有 trace 进 RAM，1h/4h 还能处理，24h 下显然不可行
+	- 即使是 simulator，4h 数据也需要差不多 1h 运行，24h 下效率太低
+
+## TBD
+
+- 统一 preprocess 一遍 24h 的 traceA 和 traceB，得到一份统一好用的 trace
+- 优化现有脚本，把 load all 改为 streaming，支持 24h 的 trace 处理
+- 刷一遍现有 trace 分析的数据
+	- 改为 24h 后，一天内应该会出现特征的波动，为了说明一段较长的连续时间内是稳定的（小时级别），可能需要同时展示 24h 和 2h 的数据（高峰/低谷：9-11 / 22-24）
+	- 预期：特征与之前分析的结论一致。24h 下一天的波动可能能带来一些新发现
+- suggestion from 大爷：一个表，现有 paper 的假设/观察 和我们在实际工业界 trace 观察的关系：结论相同？相悖？
+
+## From vLLM v1 meetup
+
+1. 移除 prefill 和 decode 的概念
+![[projects/kvcachecache/Refine roadmap.figs/250414-000021.png]]
+2. 不做 PD 分离，对 scheduler 自然提出了很高的要求（需要 scheduler 尽可能避免 compute-bound/memory-bound）。vLLM 团队也有计划做 workload-specific scheduler。不过他们这个是请求到来的 workload-aware，不是我们之前的 KVCache 的 workload-aware。我们 refine 工作时，可能也可以思考 workload-aware 对 scheduler 能提供什么帮助。【需要先熟悉 vLLM v1 最新的 scheduler】
+![[projects/kvcachecache/Refine roadmap.figs/250414-000021-1.png]]
+3. 针对不同场景 Reasoning/Coding/... 做优化，我们的 workload-aware 能否考虑到这类信息，对系统设计/优化提供有价值的信息？
+![[projects/kvcachecache/Refine roadmap.figs/250812-140723.png]]
+
+# Response
+
+- [ ] workload 对 PD 分离 global scheduler 有什么启发？
+- [ ] workload 对 MoE 的影响？不同 workload 下，硬件配比是否有变化？
+- [ ] Deepseek 320 experts， 320 卡每张卡一个 expert，如果在 8 卡下跑，expert 如何分布？激活特性？
--- a/projects/kvcachecache/Trace
+++ b/projects/kvcachecache/Trace
@@ -0,0 +1,64 @@
+## trace 格式约定
+
+Q1: 当前时间：8月7日15:39，balbal111
+A1: xxx
+Q2: 当前时间：8月7日15:40，balbal111, xxx, blabal222
+A2: yy
+Q3:  当前时间：8月7日15:40, blabal222, yy, blabla333 -> 当前时间：8月7日15:40,balbal111, xxx, blabal222, yy, blabla333 
+
+| 字段名                       | 类型【feather】     | 说明                                                                                                                  |
+| ------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------- |
+| request_id                | str             | 当前请求的唯一标识                                                                                                           |
+| chat_id                   | str【int】        | 从 0 开始递增的唯一标识                                                                                                       |
+| session_id【当前不支持】         | str             | 一个 session 的唯一标识                                                                                                    |
+| parent_chat_id            | str【int】        | session 中上一轮对话请求的 chat_id，若不存在上一轮对话，则为 -1                                                                           |
+| uid【当前不支持】                | str             | 请求来自用户的 uid                                                                                                         |
+| time                      | str             | 请求到达时间，形如 `"2025-02-18 23:52:48.827000"`                                                                            |
+| end_time                  | str【datetime】   | 请求结束时间，形如 `"2025-02-18 23:53:00.854000"`                                                                            |
+| timestamp                 | float【datetime】 | 请求到达时间的时间戳（单位 s）                                                                                                    |
+| first_latency             | int             | 首包延迟，TTFT (单位 ms)                                                                                                   |
+| duration                  | int             | 请求总耗时，E2E latency (单位 ms)                                                                                           |
+| input_token_length        | int             | 输入 token 总数                                                                                                         |
+| output_token_length       | int             | 输出 token 总数                                                                                                         |
+| usage                     | dict            | 该请求的资源用量，形如 `{'input_tokens': 1195, 'output_tokens': 246, 'plugins': {'wanx': {'count': 1}}, 'total_tokens': 1441}` |
+| token_ids                 | list            | 输入的 token list，使用 qwen vocab range 的 token id                                                                       |
+| input_text                | str             | 输入的 prompt                                                                                                          |
+| messages                  | list            | 该请求的 context，形如 `[("system", "You are an assistant"), ("user", "hi"), ("assistant", "hello"), ("user", "world")]`   |
+| turn                      | int             | 该请求在所处 session 的对话轮数                                                                                                |
+| type                      | str             | workload tag                                                                                                        |
+| no_sp_messages            | list            | 移除 system prompt 中时间对 prefix cache 影响后的 messages                                                                    |
+| no_sp_input_text          | str             | 移除 system prompt 中时间对 prefix cache 影响后的 input_text                                                                  |
+| no_sp_sw_messages         | list            | 在 no_sp 的基础上，进一步移除了 sliding window 影响后的 messages                                                                    |
+| no_sp_sw_input_text       | str             | 在 no_sp 的基础上，进一步移除了 sliding window 影响后的 input_text                                                                  |
+| no_sp_token_ids           | list            | 移除 system prompt 中时间对 prefix cache 影响后的 token_ids                                                                   |
+| no_sp_sw_token_ids        | list            | 在 no_sp 的基础上，进一步移除了 sliding window 影响后的 token_ids                                                                   |
+| no_sp_sw_output_token_ids | list            | 若有下一轮对话，从下一轮对话的 answer 获取 token_ids，若没有则随机生成一段长度为 output_token_length 的 token_ids                                   |
+
+## 处理流程
+
+- pass 1: 将能够从 raw trace 中直接获得的字段获取，还剩下 parent_chat_id, session_id, type, uid (in traceA) 无法获取，获取时删除所有 illegal 的 record，按照 timestamp 排序
+- pass2: streaming 的获取 session，设置 parent_chat_id，设置 session_id，更新 turn 字段（因为存在 sliding window，直接 count user 的 message 次数存在 bias）
+- pass3: 通过 plugins 设置 type
+	- traceA
+		zhiwen_doc_search, pdf_extracter: file
+		tongyi_nlp_web_search, tongyi_nlp_deep_search, search: search
+		wanx: image
+		other: text
+	- traceB
+		same system prompt qps > 0.5: api
+		other: file
+- pass4: 移除 system prompt 中的时间和 sliding window 导致的 prefix unmatch，添加上 no_sp, no_sp_sw 相关 field
+- pass5: 添加 output_token_ids，如果有下一轮对话，则为下一轮对话的 answer，否则为 random gen 的长度为 output_token_length 的 list
+
+## 现存问题
+
+- 新 traceA 中 uid 无法获取
+	- fig6: KV cache reuse by same uid
+	- fig7: hit by uid count
+	- fig8: reqs count by uid
+	- fig10: number of turns by uid
+
+## TBD
+
+- [ ] 确认处理一个 session 内前后 chat turn（即设置 parent_chat_id 的过程）是否正确
+
--- a/projects/kvcachecache/Trace-Qwen3.figs/250812-140723.png
+++ b/projects/kvcachecache/Trace-Qwen3.figs/250812-140723.png
--- a/projects/kvcachecache/Trace-Qwen3.md
+++ b/projects/kvcachecache/Trace-Qwen3.md
@@ -0,0 +1,85 @@
+## Current Problems
+
+- 最新的 trace 缺少了做 workload 类型判断的依据（之前依据 plugin call 进行判断，根据是否调用了 pdf_extractor/wanx/tongyi_nlp_web_search 分类 file/image/search）
+
+- 对 turn 的定义变得不确定
+	manus: agent 模式下，context 太长，LLM 忘了之前的内容怎么办，就会把一些内容，让 LLM 自己复述一遍，“让我想想，我之前看过这个东西 xxx” 相当于把距离很远的 context 又提到最后了
+```
+## Old
+---------+
+|  Human  |
+---------+
+  |    /|\
+  |     |
+ \|/    |
+---------+
+|   LLM   |
+---------+
+
+## New
+---------+
+|  Human  |
+---------+
+  |    /|\
+  |     |
+ \|/    |
+---------+
+|   LLM   |------+
+---------+      |
+   /|\           | ----> "<Web_search> Do you understand?" "Yes, sir!"
+    |            |
+	+------------+
+```
+
+- 缺少 output token length
+在之前的 trace 中有 `usage` 这个 field
+
+![[projects/kvcachecache/Trace-Qwen3.figs/250812-140723.png]]
+
+
+## fields
+
+```
+__source__,__tag__:__hostname__,__tag__:__pack_id__,__tag__:__path__,__tag__:__receive_time__,__tag__:__service_name__,__tag__:__user_defined_id__,__tag__:_container_ip_,__tag__:_container_name_,__tag__:_image_name_,__tag__:_namespace_,__tag__:_pod_name_,__tag__:_pod_uid_,__tag__:eci_id,__time__,__topic__,code,context,ds_service_id,ds_service_name,interval,message,model,request_id,service_id,service_name,span_id,step,task_id,time,trace_id,user_id
+
+['__source__', '__tag__:__hostname__', '__tag__:__pack_id__', '__tag__:__path__', '__tag__:__receive_time__', '__tag__:__service_name__', '__tag__:__user_defined_id__', '__tag__:_container_ip_', '__tag__:_container_name_', '__tag__:_image_name_', '__tag__:_namespace_', '__tag__:_pod_name_', '__tag__:_pod_uid_', '__tag__:eci_id', '__time__', '__topic__', 'code', 'context', 'ds_service_id', 'ds_service_name', 'interval', 'message', 'model', 'request_id', 'service_id', 'service_name', 'span_id', 'step', 'task_id', 'time', 'trace_id', 'user_id']
+
+__source__
+__tag__:__hostname__
+__tag__:__pack_id__
+__tag__:__path__
+__tag__:__receive_time__
+__tag__:__service_name__
+__tag__:__user_defined_id__
+__tag__:_container_ip_
+__tag__:_container_name_
+__tag__:_image_name_
+__tag__:_namespace_
+__tag__:_pod_name_
+__tag__:_pod_uid_
+__tag__:eci_id
+__time__
+__topic__
+code
+context
+ds_service_id
+ds_service_name
+interval
+message
+model
+request_id
+service_id
+service_name
+span_id
+step
+task_id
+time
+trace_id
+user_id
+
+```
+
+
+qwen-chat: 345339
+tongyi: 421551
+tob: 740251 + 740393
--- a/projects/kvcachecache/mail
+++ b/projects/kvcachecache/mail
@@ -0,0 +1,21 @@
+arXiv Submission: Paper Password for Ownership Claim
+
+Dear Co-Authors,
+
+I’m writing to share the paper password for our recent arXiv submission titled " KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider" (arXiv ID: 2506.02634). This password is required for each co-author to claim ownership of the paper on arXiv.
+
+**Paper ID:** 2506.02634
+**Paper Password:** fzw95
+
+**Next Steps:**
+
+1. Please use the below paper ID and password to claim ownership with a password form in https://arxiv.org/auth/need-paper-password.
+2. Confirm with me once you’ve completed this step so we can ensure all co-authors are properly credited.
+
+If you encounter any issues or need further assistance, reply to this email directly.
+
+Thank you for your prompt attention to this!
+
+Best regards,
+Jiahao Wang
+IPADS Shanghai Jiao Tong University
--- a/projects/kvcachecache/mail
+++ b/projects/kvcachecache/mail
@@ -0,0 +1,19 @@
+**Subject:** Action Required: Sign Consent to Publish Form for _KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider_
+
+Dear Co-Authors,
+
+As one of the lead author, I’m writing to remind you that all authors are required to sign the **consent to publish form** for our paper to be published by USENIX ATC'25. This form grants permission for the paper, as well as any accompanying slides, audio, and/or video of our presentation, to be freely shared as part of USENIX’s open-access commitment.
+
+**Action Items:**
+
+1. Please review and e-sign the form here: https://app.hellosign.com/s/79G82eZT.
+	FYI:
+	  - Our paper title is "KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider".
+	  - You can find the e-sign link in https://www.usenix.org/conference/atc25/instructions-presenters.
+2. If you encounter any issues or questions, reply to this email or contact me directly.
+
+Thank you all for your prompt attention to this! Let me know if you’ve already submitted your form—I’ll follow up individually if needed to ensure everyone is covered.
+
+Best regards,
+Jiahao Wang
+IPADS Shanghai Jiao Tong University
--- a/projects/kvcachecache/mail
+++ b/projects/kvcachecache/mail
@@ -0,0 +1,19 @@
+Request for Minor Revision to Submitted Paper Due to Open-Source Feedback
+
+Dear ATC'25 Chairs,
+
+I hope this message finds you well.
+
+We are writing regarding our paper titled "KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider" (Paper ID: 477) submitted to ATC'25. Following its open-sourcing on GitHub, a community member kindly pointed out a minor misstatement about the trace anonymous method in the paper. We have reviewed their feedback and agree with the concern.
+
+To ensure accuracy and clarity, we have updated the relevant statement in the paper accordingly. The change does not affect the overall results or conclusions, but we believe it is important to address the issue for the sake of correctness.
+
+We would like to kindly request your permission to update the submitted version with this corrected version. Please let us know the appropriate process, or whether such an update is permissible at this stage.
+
+We have attached our current version (kvcache.pdf) as well as the difference compared to the current submitted version (diff.pdf).
+
+We appreciate your understanding and look forward to your guidance.
+
+Best regards,
+Jiahao Wang
+Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University
--- a/projects/kvcachecache/pub-news.md
+++ b/projects/kvcachecache/pub-news.md
@@ -0,0 +1,6 @@
+
+目前，大语言模型（LLM）推理服务已成为云服务提供商的关键应用，而跨请求缓存中间结果（KVCache）能够显著提升系统吞吐量并降低响应延迟；然而，现有研究多基于合成负载，尚未充分揭示真实生产环境下 KVCache 的作用机制，例如缓存驱逐策略等系统决策高度依赖于工作负载。为此，本工作依托于阿里巴巴通义实验室，对通义千问在线服务的全量真实工作负载进行了脱敏采集与深入分析，发现单轮请求与多轮对话之间的缓存重用同等重要却表现各异，不同请求类型下的缓存重用时间窗口与概率虽差异显著，但对于某一固定类型的请求，其缓存重用模式高度可预测；且在 API 主导的场景中，容量有限的 GPU 本地缓存已足以满足需求。基于这些针对真实负载的观察，我们设计了一种基于工作负载感知的缓存驱逐策略，使缓存命中率由 14.5% 提升至 18.5%，首词时延（TTFT）缩短约 25%，从而在真实业务场景下大幅提升了服务性能。
+
+
+xingda version:
+KVCache 缓存是当今大模型推理系统的关键组件，其系统设计与缓存特征密切相关。在本研究中，我们与阿里通义实验室合作，深入分析了千问线上脱敏的 KVCache 缓存特征。我们发现了几个全新的见解，包括：单轮对话场景也高度依赖 KVCache 缓存，不同负载的缓存分布在各时间段呈现规律性特征等。基于这些观察，我们设计了一种新型的负载感知的 KVCache 缓存替换策略，在真实数据集上将缓存命中率从 14.5% 提升至 18.5%，同时将首词时延（TTFT）减少 25%。