worker node
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Asia > Middle East > Jordan (0.04)
Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework
Coded computing has emerged as a promising framework for tackling significant challenges in large-scale distributed computing, including the presence of slow, faulty, or compromised servers. In this approach, each worker node processes a combination of the data, rather than the raw data itself. The final result then is decoded from the collective outputs of the worker nodes. However, there is a significant gap between current coded computing approaches and the broader landscape of general distributed computing, particularly when it comes to machine learning workloads. To bridge this gap, we propose a novel foundation for coded computing, integrating the principles of learning theory, and developing a framework that seamlessly adapts with machine learning applications.
- North America > United States > Minnesota (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Africa > Sudan (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
Appendix of A Deep Learning Dataloader with Shared Data Preparation
In this part, we show the I/O speed in the synchronous and asynchronous cases. Figure 3a show the I/O speed for four jobs that start at different moments. Then we further compare the RefCnt with the generic cache policy in the above cases. D = sample ([0, 13333], 10000) means sample a subset D with 10000 of size from [0, 13333] uniformly at random 36th Conference on Neural Information Processing Systems (NeurIPS 2022). DSA can always get the minimum misses.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
Verifiable Split Learning via zk-SNARKs
Alaa, Rana, González-Ferreiro, Darío, Beis-Penedo, Carlos, Fernández-Veiga, Manuel, Díaz-Redondo, Rebeca P., Fernández-Vilas, Ana
Split learning is an approach to collaborative learning in which a deep neural network is divided into two parts: client-side and server-side at a cut layer. The client side executes its model using its raw input data and sends the intermediate activation to the server side. This configuration architecture is very useful for enabling collaborative training when data or resources are separated between devices. However, split learning lacks the ability to verify the correctness and honesty of the computations that are performed and exchanged between the parties. To this purpose, this paper proposes a verifiable split learning framework that integrates a zk-SNARK proof to ensure correctness and verifiability. The zk-SNARK proof and verification are generated for both sides in forward propagation and backward propagation on the server side, guaranteeing verifiability on both sides. The verifiable split learning architecture is compared to a blockchain-enabled system for the same deep learning network, one that records updates but without generating the zero-knowledge proof. From the comparison, it can be deduced that applying the zk-SNARK test achieves verifiability and correctness, while blockchains are lightweight but unverifiable.
- Research Report (0.82)
- Overview (0.68)
- Information Technology > Security & Privacy (1.00)
- Government (0.68)
Approximate Gradient Coding for Distributed Learning with Heterogeneous Stragglers
In this paper, we propose an optimally structured gradient coding scheme to mitigate the straggler problem in distributed learning. Conventional gradient coding methods often assume homogeneous straggler models or rely on excessive data replication, limiting performance in real-world heterogeneous systems. To address these limitations, we formulate an optimization problem minimizing residual error while ensuring unbiased gradient estimation by explicitly considering individual straggler probabilities. We derive closed-form solutions for optimal encoding and decoding coefficients via Lagrangian duality and convex optimization, and propose data allocation strategies that reduce both redundancy and computation load. We also analyze convergence behavior for $λ$-strongly convex and $μ$-smooth loss functions. Numerical results show that our approach significantly reduces the impact of stragglers and accelerates convergence compared to existing methods.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework
Coded computing has emerged as a promising framework for tackling significant challenges in large-scale distributed computing, including the presence of slow, faulty, or compromised servers. In this approach, each worker node processes a combination of the data, rather than the raw data itself. The final result then is decoded from the collective outputs of the worker nodes. However, there is a significant gap between current coded computing approaches and the broader landscape of general distributed computing, particularly when it comes to machine learning workloads. To bridge this gap, we propose a novel foundation for coded computing, integrating the principles of learning theory, and developing a framework that seamlessly adapts with machine learning applications. In this framework, the objective is to find the encoder and decoder functions that minimize the loss function, defined as the mean squared error between the estimated and true values. Facilitating the search for the optimum decoding and functions, we show that the loss function can be upper-bounded by the summation of two terms: the generalization error of the decoding function and the training error of the encoding function. Focusing on the second-order Sobolev space, we then derive the optimal encoder and decoder.
- North America > United States > Minnesota (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Africa > Sudan (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems
Dai, Yuanjun, He, Keqiang, Wang, An
Abstract--Existing batch size selection approaches in distributed machine learning rely on static allocation or simplistic heuristics that fail to adapt to heterogeneous, dynamic computing environments. We present DYNAMIX, a reinforcement learning framework that formulates batch size optimization as a sequential decision-making problem using Proximal Policy Optimization (PPO). Our approach employs a multi-dimensional state representation encompassing network-level metrics, system-level resource utilization, and training statistical efficiency indicators to enable informed decision-making across diverse computational resources. Our approach eliminates the need for explicit system modeling while integrating seamlessly with existing distributed training frameworks. Through evaluations across diverse workloads, hardware configurations, and network conditions, DY - NAMIX achieves up to 6.3% improvement in the final model accuracy and 46% reduction in the total training time. Our scalability experiments demonstrate that DYNAMIX maintains the best performance as cluster size increases to 32 nodes, while policy transfer experiments show that learned policies generalize effectively across related model architectures. Distributed machine learning (DML) has emerged as the predominant paradigm for training increasingly complex models on expansive datasets. As model architectures grow in parameter count and computational demands, practitioners increasingly rely on distributed training across multiple computational nodes to maintain feasible training timelines. Within this paradigm, batch size selection represents a critical hy-perparameter that significantly influences both training efficiency and model convergence properties. While larger batch sizes generally improve hardware utilization through increased parallelism, they may adversely affect statistical efficiency, potentially degrading convergence rates and generalization performance [19], [32]. The optimization complexity intensifies substantially in heterogeneous distributed environments, characterized by variance in computational capabilities, network characteristics, and hardware specifications across training nodes. These heterogeneous configurations arise from several practical considerations: cost optimization through spot instance utilization [12], consolidation of diverse hardware generations within organizational clusters [13], and workload deployment in multi-tenant infrastructure [15]. Under such conditions, the conventional approach of uniform batch size allocation frequently leads to suboptimal resource utilization, as demonstrated by Jia et al. [16], who observed significant throughput degradation due to synchronization barriers in heterogeneous clusters. Existing approaches to batch size optimization in distributed environments fall into several distinct categories, each exhibiting particular limitations.
- North America > United States > Ohio (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Asia > Middle East > Jordan (0.04)