Karamzade, Armin
Reinforcement Learning from Delayed Observations via World Models
Karamzade, Armin, Kim, Kyungmin, Kalsi, Montek, Fox, Roy
In standard reinforcement learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms. In this paper, we address observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to 250%. Moreover, we evaluate our methods on visual delayed environments, for the first time showcasing delay-aware reinforcement learning continuous control with visual observations.
Moonwalk: Inverse-Forward Differentiation
Krylov, Dmitrii, Karamzade, Armin, Fox, Roy
Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of na\"ive forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to compute true gradients in invertible networks in computation time comparable to backpropagation and using significantly less memory.
Matching DNN Compression and Cooperative Training with Resources and Data Availability
Malandrino, Francesco, Di Giacomo, Giuseppe, Karamzade, Armin, Levorato, Marco, Chiasserini, Carla Fabiana
To make machine learning (ML) sustainable and apt to run on the diverse devices where relevant data is, it is essential to compress ML models as needed, while still meeting the required learning quality and time performance. However, how much and when an ML model should be compressed, and {\em where} its training should be executed, are hard decisions to make, as they depend on the model itself, the resources of the available nodes, and the data such nodes own. Existing studies focus on each of those aspects individually, however, they do not account for how such decisions can be made jointly and adapted to one another. In this work, we model the network system focusing on the training of DNNs, formalize the above multi-dimensional problem, and, given its NP-hardness, formulate an approximate dynamic programming problem that we solve through the PACT algorithmic framework. Importantly, PACT leverages a time-expanded graph representing the learning process, and a data-driven and theoretical approach for the prediction of the loss evolution to be expected as a consequence of training decisions. We prove that PACT's solutions can get as close to the optimum as desired, at the cost of an increased time complexity, and that, in any case, such complexity is polynomial. Numerical results also show that, even under the most disadvantageous settings, PACT outperforms state-of-the-art alternatives and closely matches the optimal energy cost.
Regularizing Recurrent Neural Networks via Sequence Mixup
Karamzade, Armin, Najafi, Amir, Motahari, Seyed Abolfazl
Recurrent neural networks are the basis of the state-of-the-art models in natural language processing, including language modeling (Mikolov et al., 2011), machine translation (Cho et al., 2014) and named entity recognition (Lample et al., 2016). It is needless to say that complex learning tasks require relatively large networks with millions of parameters to be accomplished. However, large neural networks need more data and/or strong regularization techniques to be trained successfully and avoid overfitting. Without the means to collect more data, which is the case in the majority of real-world problems, data augmentation and regularization methods are standard alternative practices to overcome this barrier. Data augmentation in natural language processing is limited, and often task-specific (Kobayashi, 2018; Kafle et al., 2017). On the other hand, adopting the same regularization methods that are originally proposed for feed-forward (non-recurrent) networks needs to be done with extra care to avoid hurting the network's information flow between consecutive time-steps. To overcome such limitations, we present Sequence Mixup: a set of training methods, regularization techniques, and data augmentation procedures for RNNs. Sequence Mixup can be considered as the RNN-generalization of input mixup (Zhang et al., 2017) and manifold mixup (Verma et al., 2018), which are already introduced for feed-forward neural
Structure Learning of Sparse GGMs over Multiple Access Networks
Tavassolipour, Mostafa, Karamzade, Armin, Mirzaeifard, Reza, Motahari, Seyed Abolfazl, Shalmani, Mohammad-Taghi Manzuri
A central machine is interested in estimating the underlying structure of a sparse Gaussian Graphical Model (GGM) from datasets distributed across multiple local machines. The local machines can communicate with the central machine through a wireless multiple access channel. In this paper, we are interested in designing effective strategies where reliable learning is feasible under power and bandwidth limitations. Two approaches are proposed: Signs and Uncoded methods. In Signs method, the local machines quantize their data into binary vectors and an optimal channel coding scheme is used to reliably send the vectors to the central machine where the structure is learned from the received data. In Uncoded method, data symbols are scaled and transmitted through the channel. The central machine uses the received noisy symbols to recover the structure. Theoretical results show that both methods can recover the structure with high probability for large enough sample size. Experimental results indicate the superiority of Signs method over Uncoded method under several circumstances.