Goto

Collaborating Authors

 Hajiesmaili, Mohammad


Heterogeneous Multi-Agent Bandits with Parsimonious Hints

arXiv.org Artificial Intelligence

We study a hinted heterogeneous multi-agent multi-armed bandits problem (HMA2B), where agents can query low-cost observations (hints) in addition to pulling arms. In this framework, each of the $M$ agents has a unique reward distribution over $K$ arms, and in $T$ rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm. The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time-independent regret. We study HMA2B in both centralized and decentralized setups. Our main centralized algorithm, GP-HCLA, which is an extension of HCLA, uses a central decision-maker for arm-pulling and hint queries, achieving $O(M^4K)$ regret with $O(MK\log T)$ adaptive hints. In decentralized setups, we propose two algorithms, HD-ETC and EBHD-ETC, that allow agents to choose actions independently through collision-based communication and query hints uniformly until stopping, yielding $O(M^3K^2)$ regret with $O(M^3K\log T)$ hints, where the former requires knowledge of the minimum gap and the latter does not. Finally, we establish lower bounds to prove the optimality of our results and verify them through numerical simulations.


Heterogeneous Multi-agent Multi-armed Bandits on Stochastic Block Models

arXiv.org Artificial Intelligence

We study a novel heterogeneous multi-agent multi-armed bandit problem with a cluster structure induced by stochastic block models, influencing not only graph topology, but also reward heterogeneity. Specifically, agents are distributed on random graphs based on stochastic block models - a generalized Erdos-Renyi model with heterogeneous edge probabilities: agents are grouped into clusters (known or unknown); edge probabilities for agents within the same cluster differ from those across clusters. In addition, the cluster structure in stochastic block model also determines our heterogeneous rewards. Rewards distributions of the same arm vary across agents in different clusters but remain consistent within a cluster, unifying homogeneous and heterogeneous settings and varying degree of heterogeneity, and rewards are independent samples from these distributions. The objective is to minimize system-wide regret across all agents. To address this, we propose a novel algorithm applicable to both known and unknown cluster settings. The algorithm combines an averaging-based consensus approach with a newly introduced information aggregation and weighting technique, resulting in a UCB-type strategy. It accounts for graph randomness, leverages both intra-cluster (homogeneous) and inter-cluster (heterogeneous) information from rewards and graphs, and incorporates cluster detection for unknown cluster settings. We derive optimal instance-dependent regret upper bounds of order $\log{T}$ under sub-Gaussian rewards. Importantly, our regret bounds capture the degree of heterogeneity in the system (an additional layer of complexity), exhibit smaller constants, scale better for large systems, and impose significantly relaxed assumptions on edge probabilities. In contrast, prior works have not accounted for this refined problem complexity, rely on more stringent assumptions, and exhibit limited scalability.


Towards Environmentally Equitable AI

arXiv.org Artificial Intelligence

Nonetheless, the technological advancement of AI relies on computationally intensive calculations and thus has led to a surge in resource usage and energy consumption. Even putting aside the environmental toll of server manufacturing and supply chains, AI systems can create a huge environmental cost to communities and regions where they are deployed, including air/thermal pollution due to fossil fuel-based electricity generation and further stressed water resources due to AI's staggering water footprint [12, 25]. To make AI more environmentally friendly and ensure that its overall impacts on climate change are positive, recent studies have pursued multi-faceted approaches, including efficient training and inference [5], energy-efficient GPU and accelerator designs [19], carbon forecasting[14], carbon-aware task scheduling[1, 21], green cloud infrastructures[2], sustainable AI policies [10, 18], and more. Additionally, data center operators have also increasingly adopted carbon-free energy(such as solar and wind power) and climate-conscious cooling systems, lowering carbon footprint and direct water consumption [8]. Although these initiatives are encouraging, unfortunately, a worrisome outcome-- environmental inequity -- has emerged [3]. That is, minimizing the total environmental cost of a globally deployed AI system across multiple regions does not necessarily mean that each region is treated equitably. In fact, the environmental cost of AI is often disproportionately higher in certain disadvantaged regions than in others. Even worse, AI's environmental inequity can be amplified by existing environmental equity agnostic resource allocation, load balancing, and scheduling algorithms and compounded by enduring socioeconomic disparities between regions.


Combinatorial Logistic Bandits

arXiv.org Artificial Intelligence

We introduce a novel framework called combinatorial logistic bandits (CLogB), where in each round, a subset of base arms (called the super arm) is selected, with the outcome of each base arm being binary and its expectation following a logistic parametric model. The feedback is governed by a general arm triggering process. Our study covers CLogB with reward functions satisfying two smoothness conditions, capturing application scenarios such as online content delivery, online learning to rank, and dynamic channel allocation. We first propose a simple yet efficient algorithm, CLogUCB, utilizing a variance-agnostic exploration bonus. Under the 1-norm triggering probability modulated (TPM) smoothness condition, CLogUCB achieves a regret bound of $\tilde{O}(d\sqrt{\kappa KT})$, where $\tilde{O}$ ignores logarithmic factors, $d$ is the dimension of the feature vector, $\kappa$ represents the nonlinearity of the logistic model, and $K$ is the maximum number of base arms a super arm can trigger. This result improves on prior work by a factor of $\tilde{O}(\sqrt{\kappa})$. We then enhance CLogUCB with a variance-adaptive version, VA-CLogUCB, which attains a regret bound of $\tilde{O}(d\sqrt{KT})$ under the same 1-norm TPM condition, improving another $\tilde{O}(\sqrt{\kappa})$ factor. VA-CLogUCB shows even greater promise under the stronger triggering probability and variance modulated (TPVM) condition, achieving a leading $\tilde{O}(d\sqrt{T})$ regret, thus removing the additional dependency on the action-size $K$. Furthermore, we enhance the computational efficiency of VA-CLogUCB by eliminating the nonconvex optimization process when the context feature map is time-invariant while maintaining the tight $\tilde{O}(d\sqrt{T})$ regret. Finally, experiments on synthetic and real-world datasets demonstrate the superior performance of our algorithms compared to benchmark algorithms.


Multi-Agent Stochastic Bandits Robust to Adversarial Corruptions

arXiv.org Machine Learning

We study the problem of multi-agent multi-armed bandits with adversarial corruption in a heterogeneous setting, where each agent accesses a subset of arms. The adversary can corrupt the reward observations for all agents. Agents share these corrupted rewards with each other, and the objective is to maximize the cumulative total reward of all agents (and not be misled by the adversary). We propose a multi-agent cooperative learning algorithm that is robust to adversarial corruptions. For this newly devised algorithm, we demonstrate that an adversary with an unknown corruption budget $C$ only incurs an additive $O((L / L_{\min}) C)$ term to the standard regret of the model in non-corruption settings, where $L$ is the total number of agents, and $L_{\min}$ is the minimum number of agents with mutual access to an arm. As a side-product, our algorithm also improves the state-of-the-art regret bounds when reducing to both the single-agent and homogeneous multi-agent scenarios, tightening multiplicative $K$ (the number of arms) and $L$ (the number of agents) factors, respectively.


LACS: Learning-Augmented Algorithms for Carbon-Aware Resource Scaling with Uncertain Demand

arXiv.org Artificial Intelligence

Motivated by an imperative to reduce the carbon emissions of cloud data centers, this paper studies the online carbon-aware resource scaling problem with unknown job lengths (OCSU) and applies it to carbon-aware resource scaling for executing computing workloads. The task is to dynamically scale resources (e.g., the number of servers) assigned to a job of unknown length such that it is completed before a deadline, with the objective of reducing the carbon emissions of executing the workload. The total carbon emissions of executing a job originate from the emissions of running the job and excess carbon emitted while switching between different scales (e.g., due to checkpoint and resume). Prior work on carbon-aware resource scaling has assumed accurate job length information, while other approaches have ignored switching losses and require carbon intensity forecasts. These assumptions prohibit the practical deployment of prior work for online carbon-aware execution of scalable computing workload. We propose LACS, a theoretically robust learning-augmented algorithm that solves OCSU. To achieve improved practical average-case performance, LACS integrates machine-learned predictions of job length. To achieve solid theoretical performance, LACS extends the recent theoretical advances on online conversion with switching costs to handle a scenario where the job length is unknown. Our experimental evaluations demonstrate that, on average, the carbon footprint of LACS lies within 1.2% of the online baseline that assumes perfect job length information and within 16% of the offline baseline that, in addition to the job length, also requires accurate carbon intensity forecasts. Furthermore, LACS achieves a 32% reduction in carbon footprint compared to the deadline-aware carbon-agnostic execution of the job.


Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond

arXiv.org Artificial Intelligence

We introduce a novel framework of combinatorial multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT), where the outcome of each arm is a $d$-dimensional multivariant random variable and the feedback follows a general arm triggering process. Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables. For CMAB-MT, we propose a general 1-norm multivariant and triggering probability-modulated smoothness condition, and an optimistic CUCB-MT algorithm built upon this condition. Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution, all of which meet the above smoothness condition and achieve matching or improved regret bounds compared to existing works. Through our new framework, we build the first connection between the episodic RL and CMAB literature, by offering a new angle to solve the episodic RL through the lens of CMAB, which may encourage more interactions between these two important directions.


Chasing Convex Functions with Long-term Constraints

arXiv.org Artificial Intelligence

This paper introduces and studies a novel class of online metric problems with long-term demand constraints motivated by emerging applications in the design of sustainable systems. In convex function chasing with a long-term constraint, an online player aims to satisfy a demand by making decisions in a normed vector space, paying a hitting cost based on time-varying convex cost functions which are revealed online, and switching cost defined by the norm. The player is constrained to ensure that the entire demand is satisfied at or before the time horizon ends, and their objective is to minimize their total cost. The generality of this problem makes it applicable to a wide variety of online resource allocation problems; in this paper, we consider one such special case, discussing its connections to other online settings and suggestions towards broad new areas of inquiry in online optimization with long-term constraints. Our motivation to introduce these problems is rooted in an emerging class of carbon-aware control problems for sustainable systems. A shared objective involves minimizing carbon emissions by shifting flexible workloads temporally and/or spatially to better leverage low-carbon electricity generation (e.g., renewables such as solar and wind).


Robust Learning-Augmented Dictionaries

arXiv.org Artificial Intelligence

We present the first learning-augmented data structure for implementing dictionaries with optimal consistency and robustness. Our data structure, named RobustSL, is a skip list augmented by predictions of access frequencies of elements in a data sequence. With proper predictions, RobustSL has optimal consistency (achieves static optimality). At the same time, it maintains a logarithmic running time for each operation, ensuring optimal robustness, even if predictions are generated adversarially. Therefore, RobustSL has all the advantages of the recent learning-augmented data structures of Lin, Luo, and Woodruff (ICML 2022) and Cao et al. (arXiv 2023), while providing robustness guarantees that are absent in the previous work. Numerical experiments show that RobustSL outperforms alternative data structures using both synthetic and real datasets.


Online Conversion with Switching Costs: Robust and Learning-Augmented Algorithms

arXiv.org Artificial Intelligence

This paper introduces and studies online conversion with switching costs (OCS), a novel class of online problems motivated by emerging control problems in the design of sustainable systems. We consider both minimization (OCS-min) and maximization (OCS-max) variants of the problem. In OCS-min, an online player aims to purchase one item over a sequence of time-varying cost functions and decides the fractional amount of item to purchase in each round. The player must purchase the entire item before a deadline, and they incur a movement cost whenever their decision changes, i.e., whenever they purchase different amounts of the item in consecutive time steps. From the player's perspective, the goal is to minimize their total cost, including the total purchasing cost and any movement cost incurred over the time horizon. In OCS-max, the setting is almost the same, except the player sells an item fractionally according to time-varying price functions, so the goal is to maximize their total profit, and any movement costs are subtracted from the revenue. In both settings, the cost/price functions are revealed one by one in an online manner, and the player makes an irrevocable decision at each time step without the knowledge of future cost/price functions. Our motivation behind introducing OCS is an emerging class of carbon-aware problems such as carbon-aware electric vehicle (EV) charging [12] and carbon-aware compute shifting [1, 3, 22, 23, 46, 57], which have attracted significant attention in recent years.