ASPLOS 2024有哪些值得关注的论文？（回答）

Post Date:

2024-05-06

Blog Link:

按 Session 顺序讲讲北大的几篇工作吧。

首先是 @Charlie @李秀红的 Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning，主要做分布式训练的 partition 和 scheduling 的抽象与优化，是这次 ASPLOS’24 的 best paper 之一（这次一共有 6 篇 best paper，平均一个 cycle 2 篇，基本涵盖了这次 ASPLOS 涉及的各个领域）。

之后是刘譞哲组的徐大亮的 SoCFlow: Efficient and Scalable DNN Training on SoC-Clustered Edge Servers，主要做 SoC-Cluster 上的分布式训练（感觉就是 distributed on-device learning）。

然后是我们组做的 MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN，代码开源在 https://github.com/pku-liang/MAGIS，大家感兴趣可以关注一下。主要是通过图变换和图调度的协同来做 DNN 的内存优化。里面关于 dim graph 的抽象还有 fission tree 的构建挺有意思的，虽然 dim graph 主要是为了 fission transform 定义的完备性搞的，实际优化 DNN training 中主要起作用的还是 batch-dim（不过 sub-graph 会比较多样）。我签证被拒了，就由梁云老师还有子健帮忙代讲 talk 和 lightning talk，聪哥帮忙张贴 poster ……

最后是孙广宇组李聪和周哲的两篇 ML PIM System 的工作 PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization 和 SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration，前者搞了针对 DRAM-PIM 平台的 DL 框架，AE 代码开源在 https://github.com/leesou/PIM-DL-ASPLOS ，后者则设计了用来加速 LLM 投机推理的 PIM 架构以及配套的 DSE 框架。

此外，本次会议个人最喜欢的还是 torch compiler PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation ，在易用性和性能上都达到了相当不错的效果，和 OpenAI Triton 一起竖起了目前工业级 ML Compiler 的标杆（而且 torch compiler 后端也可以接 triton，属于是强强联合了）。值得一提的是，torch compiler 后端接 triton 的话，最终会生成 python 代码，包含了被融合的算子对应的 triton 代码以及原函数的优化后的代码骨架；我们这次的工作 MAGIS 也有类似的做法，最后会生成调用了 pytorch api 的 python 函数，事实上也可以直接对接 torch compiler（不过在内存管理上会有些问题，之后会想办法改进一下）。