Goto

Collaborating Authors

 dct




Distributed Low-Communication Training with Decoupled Momentum Optimization

Nedelkoski, Sasho, Acker, Alexander, Kao, Odej, Becker, Soeren, Scheinert, Dominik

arXiv.org Artificial Intelligence

The training of large models demands substantial computational resources, typically available only in data centers with high-bandwidth interconnects. However, reducing the reliance on high-bandwidth interconnects between nodes enables the use of distributed compute resources as an alternative to centralized data center training. Building on recent advances in distributed model training, we propose an approach that further reduces communication by combining infrequent synchronizations across distributed model replicas with gradient momentum compression. In particular, we treat the optimizer momentum as a signal and decompose the Nesterov momentum into high- and low-frequency components via the discrete cosine transform (DCT). Only the high-frequency components are synchronized across model replicas every $H$ steps. Empirically, our method achieves up to a $16\times$ reduction in communication compared to the baseline DiLoCo, and it generalizes across architectures, including transformer-based language models and convolutional neural networks for images. Overall, this work advances the feasibility of training large models on distributed nodes with low-bandwidth interconnects.


165a59f7cf3b5c4396ba65953d679f17-AuthorFeedback.pdf

Neural Information Processing Systems

Thank you for your comments. What's the difference between Deeploss-VGG/-Squeeze and the loss proposed in [29] (LPIPS)? We wanted a consistent naming scheme in the paper, but see that this can be confusing. We consider renaming it to LPIPS-VGG and LPIPS-Squeeze. Quantitative analysis on other score-functions We will provide a measure in the updated paper.


Combining Discrete Wavelet and Cosine Transforms for Efficient Sentence Embedding

Salama, Rana, Youssef, Abdou, Diab, Mona

arXiv.org Artificial Intelligence

Wavelets have emerged as a cutting edge technology in a number of fields. Concrete results of their application in Image and Signal processing suggest that wavelets can be effectively applied to Natural Language Processing (NLP) tasks that capture a variety of linguistic properties. In this paper, we leverage the power of applying Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We first evaluate, intrinsically and extrinsically, how wavelets can effectively be used to consolidate important information in a word vector while reducing its dimensionality. We further combine DWT with Discrete Cosine Transform (DCT) to propose a non-parameterized model that compresses a sentence with a dense amount of information in a fixed size vector based on locally varying word features. We show the efficacy of the proposed paradigm on downstream applications models yielding comparable and even superior (in some tasks) results to original embeddings.


SCOPE for Hexapod Gait Generation

O'Connor, Jim, Nash, Jay B., Gezgin, Derin, Parker, Gary B.

arXiv.org Artificial Intelligence

Evolutionary methods have previously been shown to be an effective learning method for walking gaits on hexapod robots. However, the ability of these algorithms to evolve an effective policy rapidly degrades as the input space becomes more complex. This degradation is due to the exponential growth of the solution space, resulting from an increasing parameter count to handle a more complex input. In order to address this challenge, we introduce Sparse Cosine Optimized Policy Evolution (SCOPE). SCOPE utilizes the Discrete Cosine Transform (DCT) to learn directly from the feature coefficients of an input matrix. By truncating the coefficient matrix returned by the DCT, we can reduce the dimensionality of an input while retaining the highest energy features of the original input. We demonstrate the effectiveness of this method by using SCOPE to learn the gait of a hexapod robot. The hexapod controller is given a matrix input containing time-series information of previous poses, which are then transformed to gait parameters by an evolved policy. In this task, the addition of SCOPE to a reference algorithm achieves a 20% increase in efficacy. SCOPE achieves this result by reducing the total input size of the time-series pose data from 2700 to 54, a 98% decrease. Additionally, SCOPE is capable of compressing an input to any output shape, provided that each output dimension is no greater than the corresponding input dimension. This paper demonstrates that SCOPE is capable of significantly compressing the size of an input to an evolved controller, resulting in a statistically significant gain in efficacy.


Efficient Transformations in Deep Learning Convolutional Neural Networks

Yilmaz, Berk, Harvey, Daniel Fidel, Dhuri, Prajit

arXiv.org Artificial Intelligence

This study investigates the integration of signal processing transformations -- Fast Fourier Transform (FFT), Walsh-Hadamard Transform (WHT), and Discrete Cosine Transform (DCT) -- within the ResNet50 convolutional neural network (CNN) model for image classification. The primary objective is to assess the trade-offs between computational efficiency, energy consumption, and classification accuracy during training and inference. Using the CIFAR-100 dataset (100 classes, 60,000 images), experiments demonstrated that incorporating WHT significantly reduced energy consumption while improving accuracy. Specifically, a baseline ResNet50 model achieved a testing accuracy of 66%, consuming an average of 25,606 kJ per model. In contrast, a modified ResNet50 incorporating WHT in the early convolutional layers achieved 74% accuracy, and an enhanced version with WHT applied to both early and late layers achieved 79% accuracy, with an average energy consumption of only 39 kJ per model. These results demonstrate the potential of WHT as a highly efficient and effective approach for energy-constrained CNN applications.


Sepsyn-OLCP: An Online Learning-based Framework for Early Sepsis Prediction with Uncertainty Quantification using Conformal Prediction

Zhou, Anni, Raheem, Beyah, Kamaleswaran, Rishikesan, Xie, Yao

arXiv.org Artificial Intelligence

Sepsis is a life-threatening syndrome with high morbidity and mortality in hospitals. Early prediction of sepsis plays a crucial role in facilitating early interventions for septic patients. However, early sepsis prediction systems with uncertainty quantification and adaptive learning are scarce. This paper proposes Sepsyn-OLCP, a novel online learning algorithm for early sepsis prediction by integrating conformal prediction for uncertainty quantification and Bayesian bandits for adaptive decision-making. By combining the robustness of Bayesian models with the statistical uncertainty guarantees of conformal prediction methodologies, this algorithm delivers accurate and trustworthy predictions, addressing the critical need for reliable and adaptive systems in high-stakes healthcare applications such as early sepsis prediction. We evaluate the performance of Sepsyn-OLCP in terms of regret in stochastic bandit setting, the area under the receiver operating characteristic curve (AUROC), and F-measure. Our results show that Sepsyn-OLCP outperforms existing individual models, increasing AUROC of a neural network from 0.64 to 0.73 without retraining and high computational costs. And the model selection policy converges to the optimal strategy in the long run. We propose a novel reinforcement learning-based framework integrated with conformal prediction techniques to provide uncertainty quantification for early sepsis prediction. The proposed methodology delivers accurate and trustworthy predictions, addressing a critical need in high-stakes healthcare applications like early sepsis prediction.


HADL Framework for Noise Resilient Long-Term Time Series Forecasting

Dey, Aditya, Kusch, Jonas, Machot, Fadi Al

arXiv.org Artificial Intelligence

Long-term time series forecasting is critical in domains such as finance, economics, and energy, where accurate and reliable predictions over extended horizons drive strategic decision-making. Despite the progress in machine learning-based models, the impact of temporal noise in extended lookback windows remains underexplored, often degrading model performance and computational efficiency. In this paper, we propose a novel framework that addresses these challenges by integrating the Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) to perform noise reduction and extract robust long-term features. These transformations enable the separation of meaningful temporal patterns from noise in both the time and frequency domains. To complement this, we introduce a lightweight low-rank linear prediction layer that not only reduces the influence of residual noise but also improves memory efficiency. Our approach demonstrates competitive robustness to noisy input, significantly reduces computational complexity, and achieves competitive or state-of-the-art forecasting performance across diverse benchmark datasets. Extensive experiments reveal that the proposed framework is particularly effective in scenarios with high noise levels or irregular patterns, making it well suited for real-world forecasting tasks. The code is available in https://github.com/forgee-master/HADL.


Review for NeurIPS paper: Active Structure Learning of Causal DAGs via Directed Clique Trees

Neural Information Processing Systems

Additional Feedback: - Line 42: "MEC" used before defined - Line 63: Definition of directed cycle looks weird, possibly should be *- instead of *-*? (By this definition, e.g. I.e. is it the actual m(D), or the lower bound provided by Theorem 2? - Appendix, lines 591-593: Please elaborate on the clique intervention lower bound, or provide a reference. The lower bound is indeed kind of nice, but I still disagree with the authors on the clarity of presentation. The claim itself can be presented as a simple combinatorial statement, and the proof does not use any advanced techniques. In particular, I would encourage the authors to make sure that the proofs in the main paper can be followed without reference to the appendix or prior work.