Objectives - Heterogenous parallelism in cluster - EP design for inference performance Key Results - [5/10] Profile different parallelism setup with real trace and analysis their difference - [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup - [0/10] Understand how EP influence performance fully - [0/10] Verify how dynamic EP influence performance - [4/10] Analysis correlations between MoE layers (suspended) Last Week - [For KR1] Read vLLM code and understand how vLLM TP/PP/DP works. - [For KR1] Run profile test with different config in a more complete search space. - [Surveying] Understand the bottleneck of autoscaling in Ali. - [Surveying] The opportunity for profile kernel and get a best compute graph to guide the parallelism config. - [Misc] Prepare slides for AIR project conclusion defense. Next Week - Survey the possibility of a universal parallelism config search based on kernel. (Start from the related works about NanoFlow) - Check the possibility to use GPU bubbles which running small models. - Check the challenges to switch parallelism config with context.