CO2: Efficient Distributed Training with Full Communication-Computation OverlapJan 29, 2024·Weigao Sun,Zhen Qin,Weixuan Sun,Shidi Li,Dong Li,Xuyang Shen,Yu Qiao,Yiran Zhong· 0 min read PDF Cite CodeLast updated on Jan 29, 2024 AuthorsWeigao SunYoung Scientist ← Linear Attention Sequence Parallelism Apr 3, 2024Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models Jan 9, 2024 →