Objectives - Heterogenous parallelism in cluster - EP design for inference performance Key Results - [6/10] Profile vLLM to get compute graph - [2/10] Understand the possibility/challenges in LLM inference compute graph arrangement automatically - [5/10] Profile different parallelism setup with real trace and analysis their difference - [0/10] Meta-analysis for the theory maximum improvement with heterogenous setup - [0/10] Understand how EP influence performance fully - [0/10] Verify how dynamic EP influence performance - [4/10] Analysis correlations between MoE layers (suspended) Last Week - [Surveying] Learn about the compute graph arrangement in traditional streaming/batch system and compared to LLM inference system. - [KR1] Profile the vLLM to get kernels time consuming, overlapping status. - [Misc] Review 3 papers as shadow PC for Round 2. - [Misc] Prepare and finish the AIR project conclusion defense with slides. Next Week - Summarize a table for the similarities and challenges in compute graph arrangement optimization between traditional streaming system and LLM inference system.