Goto

Collaborating Authors

 model replica


Distributed Low-Communication Training with Decoupled Momentum Optimization

arXiv.org Artificial Intelligence

The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with gradient momentum compression. In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every $H$ steps. Empirically, our method achieves up to a $16\times$ reduction in communication compared to the baseline DiLoCo, and it generalizes across architectures, including transformer-based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.


On Machine Learning Approaches for Protein-Ligand Binding Affinity Prediction

arXiv.org Artificial Intelligence

Binding affinity optimization is crucial in early-stage drug discovery. While numerous machine learning methods exist for predicting ligand potency, their comparative efficacy remains unclear. This study evaluates the performance of classical tree-based models and advanced neural networks in protein-ligand binding affinity prediction. Our comprehensive benchmarking encompasses 2D models utilizing ligand-only RDKit embeddings and Large Language Model (LLM) ligand representations, as well as 3D neural networks incorporating bound protein-ligand conformations. We assess these models across multiple standard datasets, examining various predictive scenarios including classification, ranking, regression, and active learning. Results indicate that simpler models can surpass more complex ones in specific tasks, while 3D models leveraging structural information become increasingly competitive with larger training datasets containing compounds with labelled affinity data against multiple targets. Pre-trained 3D models, by incorporating protein pocket environments, demonstrate significant advantages in data-scarce scenarios for specific binding pockets. Additionally, LLM pretraining on 2D ligand data enhances complex model performance, providing versatile embeddings that outperform traditional RDKit features in computational efficiency. Finally, we show that combining 2D and 3D model strengths improves active learning outcomes beyond current state-of-the-art approaches. These findings offer valuable insights for optimizing machine learning strategies in drug discovery pipelines.


Large Scale Distributed Deep Networks

Neural Information Processing Systems

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS.


Communication-Efficient Cooperative Multi-Agent PPO via Regulated Segment Mixture in Internet of Vehicles

arXiv.org Artificial Intelligence

Multi-Agent Reinforcement Learning (MARL) has become a classic paradigm to solve diverse, intelligent control tasks like autonomous driving in Internet of Vehicles (IoV). However, the widely assumed existence of a central node to implement centralized federated learning-assisted MARL might be impractical in highly dynamic scenarios, and the excessive communication overheads possibly overwhelm the IoV system. Therefore, in this paper, we design a communication efficient cooperative MARL algorithm, named RSM-MAPPO, to reduce the communication overheads in a fully distributed architecture. In particular, RSM-MAPPO enhances the multi-agent Proximal Policy Optimization (PPO) by incorporating the idea of segment mixture and augmenting multiple model replicas from received neighboring policy segments. Afterwards, RSM-MAPPO adopts a theory-guided metric to regulate the selection of contributive replicas to guarantee the policy improvement. Finally, extensive simulations in a mixed-autonomy traffic control scenario verify the effectiveness of the RSM-MAPPO algorithm.


Machine Learning Detection and Response: Safeguarding AI with MLDR

#artificialintelligence

In previous articles, we've discussed the ubiquity of AI-based systems and the risks they're facing; we've also described the common types of attacks against machine learning (ML) and built a list of adversarial ML tools and frameworks that are publicly available. Today, the time has come to talk about countermeasures. Over the past year, we've been working on something that fundamentally changes how we approach the security of ML and AI systems. Typically undertaken is a robustness-first approach which adds complexity to models, often at the expense of performance, efficacy, and training cost. To us, it felt like kicking the can down the road and not addressing the core problem โ€“ that ML is under attack. Back in 2019, the future founders of HiddenLayer worked closely together at a next-generation antivirus company.


Decentralized Federated Learning: A Segmented Gossip Approach

arXiv.org Machine Learning

The emerging concern about data privacy and security has motivated the proposal of federated learning, which allows nodes to only synchronize the locally-trained models instead their own original data. Conventional federated learning architecture, inherited from the parameter server design, relies on highly centralized topologies and the assumption of large nodes-to-server bandwidths. However, in real-world federated learning scenarios the network capacities between nodes are highly uniformly distributed and smaller than that in a dat-acenter. It is of great challenges for conventional federated learning approaches to efficiently utilize network capacities between nodes. In this paper, we propose a model segment level decentralized federated learning to tackle this problem. In particular, we propose a segmented gossip approach, which not only makes full utilization of node-to- node bandwidth, but also has good training convergence. The experimental results show that even the training time can be highly reduced as compared to centralized federated learning.


Large Scale Distributed Deep Networks

Neural Information Processing Systems

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports for a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 100x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.