Avadhanula, Vashist
Fully Dynamic Online Selection through Online Contention Resolution Schemes
Avadhanula, Vashist, Celli, Andrea, Colini-Baldeschi, Riccardo, Leonardi, Stefano, Russo, Matteo
We study fully dynamic online selection problems in an adversarial/stochastic setting that includes Bayesian online selection, prophet inequalities, posted price mechanisms, and stochastic probing problems subject to combinatorial constraints. In the classical ``incremental'' version of the problem, selected elements remain active until the end of the input sequence. On the other hand, in the fully dynamic version of the problem, elements stay active for a limited time interval, and then leave. This models, for example, the online matching of tasks to workers with task/worker-dependent working times, and sequential posted pricing of perishable goods. A successful approach to online selection problems in the adversarial setting is given by the notion of Online Contention Resolution Scheme (OCRS), that uses a priori information to formulate a linear relaxation of the underlying optimization problem, whose optimal fractional solution is rounded online for any adversarial order of the input sequence. Our main contribution is providing a general method for constructing an OCRS for fully dynamic online selection problems. Then, we show how to employ such OCRS to construct no-regret algorithms in a partial information model with semi-bandit feedback and adversarial inputs.
Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms
Avadhanula, Vashist, Baki, Omar Abdul, Bastani, Hamsa, Bastani, Osbert, Gocmen, Caner, Haimovich, Daniel, Hwang, Darren, Karamshuk, Dima, Leeper, Thomas, Ma, Jiayuan, Macnamara, Gregory, Mullett, Jake, Palow, Christopher, Park, Sung, Rajagopal, Varun S, Schaeffer, Kevin, Shah, Parikshit, Sinha, Deeksha, Stier-Moses, Nicolas, Xu, Peng
We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms. Meta relies on both handcrafted and learned risk models to flag potentially violating content for human review. Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models. A key challenge is that violation trends change over time, affecting which risk models are most reliable. Our system additionally handles production challenges such as changing risk models and novel risk models. We use a contextual bandit to update the calibration in response to such trends. Our approach increases Meta's top-line metric for measuring the effectiveness of its content moderation strategy by 13%.
Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations
Garcelon, Evrard, Avadhanula, Vashist, Lazaric, Alessandro, Pirotta, Matteo
Consider an idealized content reviewing task in a large social media firm, where the objective is to identify harmful content that violates the platforms' community standards. Given the large volume of content generated on a daily basis, it may not be possible to ask human reviewers to provide a thorough assessment of each piece of content. For this reason, the platform may automatically assign a badness score for each piece of content depending on their estimated level of severity. For example, a hate speech related post may be assigned a higher badness score in comparison to a click bait post. The content with higher badness score may then be prioritized for human review, which eventually leads to what we can consider as a "ground-truth" evaluation of the severity of the content. The more accurate the badness score is in predicting the actual severity, the higher the chance that harmful content is passed for human review and properly identified. In practice, the badness score may be obtained by aggregating predictions returned by different automatic systems (e.g., rule-based, ML-based systems). For instance, the platform could rely on NLP-based classifiers for hostile speech detection, or CV-based classifiers for graphic images. As such, it is crucial to properly calibrate the predictions returned by each of these classifiers to ensure that the scores can be compared meaningfully and then return an aggregate and reliable badness score that correctly prioritizes the most harmful content for human review.
Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit
Agrawal, Priyank, Avadhanula, Vashist, Tulabandhula, Theja
We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments of cardinality at most $K$, out of $N$ items, to minimize the expected cumulative regret (loss of revenue). The feedback is given by a multinomial logit (MNL) choice model. This sequential decision making problem is studied under the MNL contextual bandit framework. The existing algorithms for MNL contexual bandit have frequentist regret guarantees as $\tilde{\mathrm{O}}(\kappa\sqrt{T})$, where $\kappa$ is an instance dependent constant. $\kappa$ could be arbitrarily large, e.g. exponentially dependent on the model parameters, causing the existing regret guarantees to be substantially loose. We propose an optimistic algorithm with a carefully designed exploration bonus term and show that it enjoys $\tilde{\mathrm{O}}(\sqrt{T})$ regret. In our bounds, the $\kappa$ factor only affects the poly-log term and not the leading term of the regret bounds.
Multi-armed Bandits with Cost Subsidy
Sinha, Deeksha, Sankararama, Karthik Abinav, Kazerouni, Abbas, Avadhanula, Vashist
In this paper, we consider a novel variant of the multi-armed bandit (MAB) problem, \emph{MAB with cost subsidy}, which models many real-life applications where the learning agent has to pay to select an arm and is concerned about optimizing cumulative costs and rewards. We present two applications, \emph{intelligent SMS routing problem} and \emph{ad audience optimization problem} faced by a number of businesses (especially online platforms) and show how our problem uniquely captures key features of these applications. We show that naive generalizations of existing MAB algorithms like Upper Confidence Bound and Thompson Sampling do not perform well for this problem. We then establish fundamental lower bound of $\Omega(K^{1/3} T^{2/3})$ on the performance of any online learning algorithm for this problem, highlighting the hardness of our problem in comparison to the classical MAB problem (where $T$ is the time horizon and $K$ is the number of arms). We also present a simple variant of \textit{explore-then-commit} and establish near-optimal regret bounds for this algorithm. Lastly, we perform extensive numerical simulations to understand the behavior of a suite of algorithms for various instances and recommend a practical guide to employ different algorithms.