Initial commit: obsidian to gitea
This commit is contained in:
17
phd/weekly-report/24/241027.md
Normal file
17
phd/weekly-report/24/241027.md
Normal file
@@ -0,0 +1,17 @@
|
||||
Objectives
|
||||
- Analysis of QWen trace
|
||||
- Customize vLLM(Ali ver) with new features
|
||||
- Port XPURemoting to PhOS
|
||||
|
||||
Key Results
|
||||
- Enhance QWen trace's workloads separation
|
||||
- Get vLLM KVCache hit rate for different open source workloads
|
||||
- Build unified docker image for XPURemoting and PhOS
|
||||
|
||||
Last Week
|
||||
- Get a unified workload taxonomy for QWen trace in both Web and App ends.
|
||||
- Run vLLM(Ali ver) and start to customize to get some features(e.x. KVCache hit rate for different workloads).
|
||||
- Build a new docker image to satisfy PhOS's base image requirement with XPURemoting env(static linked PyTorch 1.13.1).
|
||||
|
||||
Next Week
|
||||
- Customize vLLM to support new features like KVCache schedule policy comparation.
|
||||
16
phd/weekly-report/24/241103.md
Normal file
16
phd/weekly-report/24/241103.md
Normal file
@@ -0,0 +1,16 @@
|
||||
Objectives
|
||||
- Analysis of QWen trace
|
||||
- Customize vLLM(Ali ver) with new features
|
||||
|
||||
Key Results
|
||||
- Tokenize Qwen trace with Qwen-agent and some other tools [60%]
|
||||
- Modify vLLM to support different KV cache block number
|
||||
- Profile open source dataset with different cache blocks
|
||||
|
||||
Last Week
|
||||
- Use Qwen-agent to handle workloads with file, get a more precise token length for these workloads.
|
||||
- Modify vLLM's cache manager to support specific KVCache cache blocks, then measure the KV cache hit rate trend by block number in different workloads.
|
||||
|
||||
Next Week
|
||||
- Tokenize all Qwen trace especially multimodal (image) workloads and measure with these trace.
|
||||
- Profile KVCache cache hit rate in actual trace and compare with other open source trace to find different.
|
||||
15
phd/weekly-report/24/241110.md
Normal file
15
phd/weekly-report/24/241110.md
Normal file
@@ -0,0 +1,15 @@
|
||||
Objectives
|
||||
- Analysis of QWen trace
|
||||
- Customize vLLM(Ali ver) with new features
|
||||
|
||||
Key Results
|
||||
- Tokenize Qwen trace with Qwen-agent and some other tools
|
||||
- Profile Qwen trace with different cache blocks
|
||||
|
||||
Last Week
|
||||
- Use Qwen-agent to handle all workloads in Qwen trace and get a precise token stream to simulate actual online environment.
|
||||
- Measure the performance and KVCache cache hit rate for different cache blocks using real Qwen trace running for one hour.
|
||||
|
||||
Next Week
|
||||
- Check the tokenize results from Qwen trace, maybe need to modify.
|
||||
- Measure KV cache performance with CPU memory.
|
||||
14
phd/weekly-report/24/241117.md
Normal file
14
phd/weekly-report/24/241117.md
Normal file
@@ -0,0 +1,14 @@
|
||||
Objective
|
||||
- Customize vLLM(Ali ver) with new features
|
||||
|
||||
Key Results
|
||||
- Test modified vLLM which supports CPU KV cache
|
||||
- Profile and breakdown modified vLLM in synthetic data and real Qwen trace
|
||||
|
||||
Last Week
|
||||
- Merge vLLM which supports CPU KV cache and use synthetic data and real Qwen trace to measure the performance and find bugs.
|
||||
- Add a breakdown measurement support in vLLM server side to measure the time for copying of KV blocks.
|
||||
|
||||
Next Week
|
||||
- Run more test for vLLM which supports CPU KV cache.
|
||||
- Try to optimize current implementation.
|
||||
17
phd/weekly-report/24/241124.md
Normal file
17
phd/weekly-report/24/241124.md
Normal file
@@ -0,0 +1,17 @@
|
||||
Objective
|
||||
- Workload-centric KV cache scheduling
|
||||
- XPURemoting adaption for PhOS
|
||||
|
||||
Key Results
|
||||
- Refactor vLLM benchmark tools to get more precise metrics
|
||||
- Simulate different token lengths and hit rate to define hit rate's effect
|
||||
- Modify XPURemoting to support new architecture
|
||||
|
||||
Last Week
|
||||
- Implement a unified vLLM benchmark tool to get more precise metric results and provide a unified requests builder.
|
||||
- Measure the effect of cache hit rate and try to define a good hit rate for real performance improvement.
|
||||
- Merge XPURemoting with new features and support for PhOS.
|
||||
|
||||
Next Week
|
||||
- Define a `good hit rate` for KV cache scheduling.
|
||||
- Finish XPURemoting adaption.
|
||||
16
phd/weekly-report/24/241201.md
Normal file
16
phd/weekly-report/24/241201.md
Normal file
@@ -0,0 +1,16 @@
|
||||
Objective
|
||||
- Workload-centric KV cache scheduling
|
||||
- XPURemoting adaption for PhOS
|
||||
|
||||
Key Results
|
||||
- Define the Good KVCache hit rate in different conditions [6/10]
|
||||
- Prove the interference between different workloads in current vLLM
|
||||
- Modify XPURemoting to support PhOS (v1)
|
||||
|
||||
Last Week
|
||||
- Search different KVCache schedule algorithms and sumarize something common for definition of Good KVCache hit rate.
|
||||
- Profile ali trace in vLLM and group them to prove interference.
|
||||
- Adaption of XPURemoting to support current PhOS's API. And fully test implementation in PhOS's open source examples. [MR](https://ipads.se.sjtu.edu.cn:1312/scaleaisys/xpuremoting/-/merge_requests/25) for XPURemoting and [e80bf94](https://github.com/Gahow/PhoenixOS/commit/e80bf94075fcd6f53c97406dadfbe7f13fc16092) for PhOS.
|
||||
|
||||
Next Week
|
||||
- Finish definetion of Good KVCache hit rate.
|
||||
15
phd/weekly-report/24/241208.md
Normal file
15
phd/weekly-report/24/241208.md
Normal file
@@ -0,0 +1,15 @@
|
||||
Objectives
|
||||
- Serverless KVCache cache
|
||||
- PhOS profile
|
||||
|
||||
Key Results
|
||||
- Implement a workload aware KVCache scheduler. [3/10]
|
||||
- Provide test apps for PhOS
|
||||
|
||||
Last Week
|
||||
- Implement a simulator for KVCache scheduler to quick test different policies.
|
||||
- Prepare and do a paper sharing in Ali.
|
||||
- Provide StableDiffusion single GPU train, Llama2-13b multi GPU train, Llama2-70b multi GPU inference script for PhOS profiling.
|
||||
|
||||
Next Week
|
||||
- Implement a solution to reduce KVCache memory need.
|
||||
13
phd/weekly-report/24/241215.md
Normal file
13
phd/weekly-report/24/241215.md
Normal file
@@ -0,0 +1,13 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Test a workload aware KVCache scheduler
|
||||
- Implement the workload aware policy in vLLM
|
||||
|
||||
Last Week
|
||||
- Design a workload aware schedule policy in simulator and profile the KVCache reuse rate.
|
||||
- Implement the designed policy under vLLM.
|
||||
|
||||
Next Week
|
||||
- Profile the real performance of new policy under vLLM and do some enhancement.
|
||||
16
phd/weekly-report/24/241222.md
Normal file
16
phd/weekly-report/24/241222.md
Normal file
@@ -0,0 +1,16 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Implement the workload aware policy in vLLM [8/10]
|
||||
- Profile the workload aware policy [3/10]
|
||||
- Supply workloads difference in Qwen trace
|
||||
|
||||
Last Week
|
||||
- Add new design point to cache policy, making the policy to consider cache memory size and predicted reuse distance together. To do this, add a new monitor for workloads' reuse time interval and average number of tokens.
|
||||
- Set a offline (i.e. best) scheduling policy, profile the default policy, our workload aware policy and offline policy to show the performance difference in CDF of TTFT.
|
||||
- Implement a cache block source tracker in vLLM to show where the KVCache reuse comes from. Prove that 90% of KVCache reuse comes from multi turns chat.
|
||||
|
||||
Next Week
|
||||
- Improve the performance of our policy.
|
||||
- Plot some formal figures.
|
||||
14
phd/weekly-report/24/241229.md
Normal file
14
phd/weekly-report/24/241229.md
Normal file
@@ -0,0 +1,14 @@
|
||||
Objective
|
||||
- Serverless KVCache cache
|
||||
|
||||
Key Results
|
||||
- Implement the workload aware policy in vLLM
|
||||
- Profile the workload aware policy [3/10]
|
||||
|
||||
Last Week
|
||||
- Implement priority-based (calculated by our policy) evictor for both GPU and CPU sides.
|
||||
- Test our policy under ralative small cache memory, and get a 30% cache hit ratio and 10% performance improvement. Prove our policy is used for limited cache memory. But for the larger cache memory, our policy still need some fine-tune.
|
||||
|
||||
Next Week
|
||||
- Improve our policy for larger cache memory.
|
||||
- Analysis new trace.
|
||||
Reference in New Issue
Block a user