Goto

Collaborating Authors

 Rengarajan, Desik


Hierarchical Multi-Agent Framework for Carbon-Efficient Liquid-Cooled Data Center Clusters

arXiv.org Artificial Intelligence

Reducing the environmental impact of cloud computing requires efficient workload distribution across geographically dispersed Data Center Clusters (DCCs) and simultaneously optimizing liquid and air (HVAC) cooling with time shift of workloads within individual data centers (DC). This paper introduces Green-DCC, which proposes a Reinforcement Learning (RL) based hierarchical controller to optimize both workload and liquid cooling dynamically in a DCC. By incorporating factors such as weather, carbon intensity, and resource availability, Green-DCC addresses realistic constraints and interdependencies. We demonstrate how the system optimizes multiple data centers synchronously, enabling the scope of digital twins, and compare the performance of various RL approaches based on carbon emissions and sustainability metrics while also offering a framework and benchmark simulation for broader ML research in sustainability.


Structured Reinforcement Learning for Media Streaming at the Wireless Edge

arXiv.org Artificial Intelligence

Media streaming is the dominant application over wireless edge (access) networks. The increasing softwarization of such networks has led to efforts at intelligent control, wherein application-specific actions may be dynamically taken to enhance the user experience. The goal of this work is to develop and demonstrate learning-based policies for optimal decision making to determine which clients to dynamically prioritize in a video streaming setting. We formulate the policy design question as a constrained Markov decision problem (CMDP), and observe that by using a Lagrangian relaxation we can decompose it into single-client problems. Further, the optimal policy takes a threshold form in the video buffer length, which enables us to design an efficient constrained reinforcement learning (CRL) algorithm to learn it. Specifically, we show that a natural policy gradient (NPG) based algorithm that is derived using the structure of our problem converges to the globally optimal policy. We then develop a simulation environment for training, and a real-world intelligent controller attached to a WiFi access point for evaluation. We empirically show that the structured learning approach enables fast learning. Furthermore, such a structured policy can be easily deployed due to low computational complexity, leading to policy execution taking only about 15$\mu$s. Using YouTube streaming experiments in a resource constrained scenario, we demonstrate that the CRL approach can increase quality of experience (QOE) by over 30\%.


N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics

arXiv.org Artificial Intelligence

We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors.


Federated Ensemble-Directed Offline Reinforcement Learning

arXiv.org Artificial Intelligence

Federated learning is an approach wherein clients learn collaboratively by sharing their locally trained models (not their data) with a federating agent, which periodically combines their models and returns the federated model to the clients for further refinement (Kairouz et al., 2021; Wang et al., 2021). Federated learning has seen much success in supervised learning applications due to its ability to generate well-trained models using small amounts of data at each client while preserving privacy and reducing the usage of communication resources. Recently, there is a growing interest in employing federated learning for online RL problems where each client collects data online by following its own Markovian trajectory, while simultaneously updating the model parameters (Khodadadian et al., 2022; Nadiger et al., 2019; Qi et al., 2021). However, such an online learning approach requires sequential interactions with the environment or the simulator, which may not be feasible in many real-world applications. Instead, each clients may have pre-collected operational data generated according to a client-specific behavior policy. The federated offline reinforcement learning problem is to learn the optimal policy using these heterogeneous offline data sets distributed across the clients and collected by different unknown behavior policies, without sharing the data explicitly. The framework of offline reinforcement learning (Levine et al., 2020) offers a way to learn the policy only using the offline data collected according a behavior policy, without any direct interactions with the environment. However, naively combining an off-the-shelf offline RL algorithm such as TD3-BC (Fujimoto & Gu, 2021) with an off-the-shelf federated supervised learning approach such as FedAvg (McMahan et al., 2017) will lead to a poorly performing policy, as we show later (see Figure 1-3). Federated offline RL is significantly more challenging than its supervised learning counterpart and the centralized offline RL because of the following reasons.


Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

arXiv.org Artificial Intelligence

A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.


QFlow: A Reinforcement Learning Approach to High QoE Video Streaming over Wireless Networks

arXiv.org Machine Learning

Wireless Internet access has brought legions of heterogeneous applications all sharing the same resources. However, current wireless edge networks that cater to worst or average case performance lack the agility to best serve these diverse sessions. Simultaneously, software reconfigurable infrastructure has become increasingly mainstream to the point that dynamic per packet and per flow decisions are possible at multiple layers of the communications stack. Exploiting such reconfigurability requires the design of a system that can enable a configuration, measure the impact on the application performance (Quality of Experience), and adaptively select a new configuration. Effectively, this feedback loop is a Markov Decision Process whose parameters are unknown. The goal of this work is to design, develop and demonstrate QFlow that instantiates this feedback loop as an application of reinforcement learning (RL). Our context is that of reconfigurable (priority) queueing, and we use the popular application of video streaming as our use case. We develop both model-free and model-based RL approaches that are tailored to the problem of determining which clients should be assigned to which queue at each decision period. Through experimental validation, we show how the RL-based control policies on QFlow are able to schedule the right clients for prioritization in a high-load scenario to outperform the status quo, as well as the best known solutions with over 25% improvement in QoE, and a perfect QoE score of 5 over 85% of the time.