Goto

Collaborating Authors

 staleness


RecommendationModels

Neural Information Processing Systems

Although synchronous AR training is designed to have higher training efficiency,asynchronous PStraining would beabetter choice for training speed when there are stragglers (slow workers) in the shared cluster, especially under limited computing resources.


SupplementaryMaterial

Neural Information Processing Systems

R(h). (23) Here for simplicity, we abused the symbolD in(22)by maximizing outh0 in the originalD. In the top-left areaP,suppose only oneexample (markedbyxwith vertical coordinate1)isconfidently labeled as positive, and the rest examples are highly inconfidently labeled, hence not to contribute to the riskR. Similarly,there isonly one confidently labeled example ()inthe bottom-right area ofP, and it is negative with vertical coordinate 1. Wheneverλ > 2, the optimalhλ is in(0,1)and can be solved by a quadratic equation. In contrast,di-MDD is immune to this problem becauseRis used only to determineh, while the di-MDD value itself is solely contributed byD. Same as the scenario of largeλ, we do not change the feature distribution of source and target domains, hence keepingD(h) = 1 |h|.


Large Graph Property Prediction via Graph Segment Training

Neural Information Processing Systems

Learning to predict properties of a large graph is challenging because each prediction requires the knowledge of an entire graph, while the amount of memory available during training is bounded.





ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models

Wang, Yujia, Cao, Yuanpu, Chen, Jinghui

arXiv.org Artificial Intelligence

Federated learning (FL) has been extensively studied as a privacy-preserving training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (LLMs), even a single block can contain a significant number of parameters, posing substantial communication latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning LLMs, we propose ParaBlock, a novel approach that establishes two parallel threads for communication and computation to enhance communication efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning LLMs on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves communication efficiency.



VISAGNN: Versatile Staleness-Aware Efficient Training on Large-Scale Graphs

Xue, Rui

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) have shown exceptional success in graph representation learning and a wide range of real-world applications. However, scaling deeper GNNs poses challenges due to the neighbor explosion problem when training on large-scale graphs. To mitigate this, a promising class of GNN training algorithms utilizes historical embeddings to reduce computation and memory costs while preserving the expressiveness of the model. These methods leverage historical embeddings for out-of-batch nodes, effectively approximating full-batch training without losing any neighbor information-a limitation found in traditional sampling methods. However, the staleness of these historical embeddings often introduces significant bias, acting as a bottleneck that can adversely affect model performance. In this paper, we propose a novel VersatIle Staleness-Aware GNN, named VISAGNN, which dynamically and adaptively incorporates staleness criteria into the large-scale GNN training process. By embedding staleness into the message passing mechanism, loss function, and historical embeddings during training, our approach enables the model to adaptively mitigate the negative effects of stale embeddings, thereby reducing estimation errors and enhancing downstream accuracy. Comprehensive experiments demonstrate the effectiveness of our method in overcoming the staleness issue of existing historical embedding techniques, showcasing its superior performance and efficiency on large-scale benchmarks, along with significantly faster convergence.