Xu, Weixin
Muon is Scalable for LLM Training
Liu, Jingyuan, Su, Jianlin, Yao, Xingcheng, Jiang, Zhejun, Lai, Guokun, Du, Yulun, Qin, Yidao, Xu, Weixin, Lu, Enzhe, Yan, Junjie, Chen, Yanru, Zheng, Huabin, Liu, Yibo, Liu, Shaowei, Yin, Bohong, He, Weiran, Zhu, Han, Wang, Yuzhi, Wang, Jianzhou, Dong, Mengnan, Zhang, Zheng, Kang, Yongsheng, Zhang, Hao, Xu, Xinran, Zhang, Yutao, Wu, Yuxin, Zhou, Xinyu, Yang, Zhilin
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.
Unleashing the Strengths of Unlabeled Data in Pan-cancer Abdominal Organ Quantification: the FLARE22 Challenge
Ma, Jun, Zhang, Yao, Gu, Song, Ge, Cheng, Ma, Shihao, Young, Adamo, Zhu, Cheng, Meng, Kangkang, Yang, Xin, Huang, Ziyan, Zhang, Fan, Liu, Wentao, Pan, YuanKe, Huang, Shoujin, Wang, Jiacheng, Sun, Mingze, Xu, Weixin, Jia, Dengqiang, Choi, Jae Won, Alves, Natália, de Wilde, Bram, Koehler, Gregor, Wu, Yajun, Wiesenfarth, Manuel, Zhu, Qiongjie, Dong, Guoqiang, He, Jian, Consortium, the FLARE Challenge, Wang, Bo
Quantitative organ assessment is an essential step in automated abdominal disease diagnosis and treatment planning. Artificial intelligence (AI) has shown great potential to automatize this process. However, most existing AI algorithms rely on many expert annotations and lack a comprehensive evaluation of accuracy and efficiency in real-world multinational settings. To overcome these limitations, we organized the FLARE 2022 Challenge, the largest abdominal organ analysis challenge to date, to benchmark fast, low-resource, accurate, annotation-efficient, and generalized AI algorithms. We constructed an intercontinental and multinational dataset from more than 50 medical groups, including Computed Tomography (CT) scans with different races, diseases, phases, and manufacturers. We independently validated that a set of AI algorithms achieved a median Dice Similarity Coefficient (DSC) of 90.0% by using 50 labeled scans and 2000 unlabeled scans, which can significantly reduce annotation requirements. They also enabled automatic extraction of key organ biology features, which was labor-intensive with traditional manual measurements. This opens the potential to use unlabeled data to boost performance and alleviate annotation shortages for modern AI models. Abdominal organs are high cancer incidence areas, such as liver cancer, kidney cancer, pancreas cancer, and gastric cancer [1]. Computed Tomography (CT) scanning has been a major imaging technology for the diagnosis and treatment of abdominal cancer because it can yield important prognostic information with fast imaging speed for cancer patients, which has been recommended by many clinical treatment guidelines. In order to quantify abdominal organs, radiologists and clinicians need to manually delineate organ boundaries in each slice of the 3D CT scans [2], [3]. However, manual segmentation is time-consuming and inherently subjective with inter-and intra-expert variability.