Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow , a novel distributed training framework based on an SBP ( split , broadcast and partial-value ) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning.
2021: J. Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo, Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan Wu, Haoran Zhang, Jie Zhao
https://arxiv.org/pdf/2110.15032v6.pdf
view more