Goto

Collaborating Authors

 follower




AI 'slop' is transforming social media - and a backlash is brewing

BBC News

AI'slop' is transforming social media - and a backlash is brewing Théodore remembers the AI slop that tipped him over the edge. The image was of two emaciated, impoverished South Asian children. For some reason, despite their boyish features they have thick beards. One of them had no hands and only one foot. The other was holding a sign saying it's his birthday and asking for likes.


Grok Is Being Used to Mock and Strip Women in Hijabs and Sarees

WIRED

A substantial number of AI images generated or edited with Grok are targeting women in religious and cultural clothing. Among the vast and growing library of nonconsensual sexualized edits that Grok has generated on request over the past week, many perpetrators have asked xAI's bot to put on or take off a hijab, a saree, a nun's habit, or another kind of modest religious or cultural type of clothing. In a review of 500 Grok images generated between January 6 and January 9, WIRED found around 5 percent of the output featured an image of a woman who was, as the result of prompts from users, either stripped from or made to wear religious or cultural clothing. Indian sarees and modest Islamic wear were the most common examples in the output, which also featured Japanese school uniforms, burqas, and early 20th century-style bathing suits with long sleeves. "Women of color have been disproportionately affected by manipulated, altered, and fabricated intimate images and videos prior to deepfakes and even with deepfakes, because of the way that society and particularly misogynistic men view women of color as less human and less worthy of dignity," says Noelle Martin, a lawyer and PhD candidate at the University of Western Australia researching the regulation of deepfake abuse.


Contextual Bilevel Reinforcement Learning for Incentive Alignment

Neural Information Processing Systems

The optimal policy in various real-world strategic decision-making problems depends both on the environmental configuration and exogenous events. For these settings, we introduce Contextual Bilevel Reinforcement Learning (CB-RL), a stochastic bilevel decision-making model, where the lower level consists of solving a contextual Markov Decision Process (CMDP). CB-RL can be viewed as a Stackelberg Game where the leader and a random context beyond the leader's control together decide the setup of many MDPs that potentially multiple followers best respond to. This framework extends beyond traditional bilevel optimization and finds relevance in diverse fields such as RLHF, tax design, reward shaping, contract theory and mechanism design. We propose a stochastic Hyper Policy Gradient Descent (HPGD) algorithm to solve CB-RL, and demonstrate its convergence. Notably, HPGD uses stochastic hypergradient estimates, based on observations of the followers' trajectories. Therefore, it allows followers to use any training procedure and the leader to be agnostic of the specific algorithm, which aligns with various real-world scenarios. We further consider the setting when the leader can influence the training of followers and propose an accelerated algorithm. We empirically demonstrate the performance of our algorithm for reward shaping and tax design.


Optimally Deceiving a Learning Leader in Stackelberg Games

Neural Information Processing Systems

Recent results in the ML community have revealed that learning algorithms used to compute the optimal strategy for the leader to commit to in a Stackelberg game, are susceptible to manipulation by the follower. Such a learning algorithm operates by querying the best responses or the payoffs of the follower, who consequently can deceive the algorithm by responding as if their payoffs were much different than what they actually are. For this strategic behavior to be successful, the main challenge faced by the follower is to pinpoint the payoffs that would make the learning algorithm compute a commitment so that best responding to it maximizes the follower's utility, according to the true payoffs. While this problem has been considered before, the related literature only focused on the simplified scenario in which the payoff space is finite, thus leaving the general version of the problem unanswered. In this paper, we fill this gap by showing that it is always possible for the follower to efficiently compute (near-)optimal payoffs for various scenarios of learning interaction between the leader and the follower.


Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Pásztor, Barna, Buening, Thomas Kleine, Krause, Andreas

arXiv.org Machine Learning

We introduce Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This approach decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. Unlike Reinforcement Learning from Human Feedback (RLHF), which assigns scalar rewards to actions, or Nash Learning from Human Feedback (NLHF), which seeks a simultaneous-move equilibrium, SLHF leverages the asymmetry of sequential play to capture richer preference structures. The sequential design of SLHF naturally enables inference-time refinement, as the Follower learns to improve the Leader's actions, and these refinements can be leveraged through iterative sampling. We compare the solution concepts of SLHF, RLHF, and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Experiments on large language models demonstrate that SLHF achieves strong alignment across diverse preference datasets, scales from 0.5B to 8B parameters, and yields inference-time refinements that transfer across model families without further fine-tuning.


On Dynamic Programming Theory for Leader-Follower Stochastic Games

Dibangoye, Jilles Steeve, Marre, Thibaut Le, Sankur, Ocan, Schwarzentruber, François

arXiv.org Artificial Intelligence

Leader-follower general-sum stochastic games (LF-GSSGs) model sequential decision-making under asymmetric commitment, where a leader commits to a policy and a follower best responds, yielding a strong Stackelberg equilibrium (SSE) with leader-favourable tie-breaking. This paper introduces a dynamic programming (DP) framework that applies Bellman recursion over credible sets-state abstractions formally representing all rational follower best responses under partial leader commitments-to compute SSEs. We first prove that any LF-GSSG admits a lossless reduction to a Markov decision process (MDP) over credible sets. We further establish that synthesising an optimal memoryless deterministic leader policy is NP-hard, motivating the development of ε-optimal DP algorithms with provable guarantees on leader exploitability. Experiments on standard mixed-motive benchmarks-including security games, resource allocation, and adversarial planning-demonstrate empirical gains in leader value and runtime scalability over state-of-the-art methods.


Cacheback: Speculative Decoding With Nothing But Cache

Ma, Zhiyao, Gim, In, Zhong, Lin

arXiv.org Artificial Intelligence

We present Cacheback Decoding, a training-free and model-agnostic speculative decoding method that exploits the locality in language to accelerate Large Language Model (LLM) inference. Cacheback leverages only Least Recently Used (LRU) cache tables of token n-grams to generate draft sequences. Cacheback achieves state-of-the-art performance among comparable methods despite its minimalist design, and its simplicity allows easy integration into existing systems. Cacheback also shows potential for fast adaptation to new domains.


Online Dynamic Pricing of Complementary Products

Mussi, Marco, Restelli, Marcello

arXiv.org Artificial Intelligence

Traditional pricing paradigms, once dominated by static models and rule-based heuristics, are increasingly being replaced by dynamic, data-driven approaches powered by machine learning algorithms. Despite their growing sophistication, most dynamic pricing algorithms focus on optimizing the price of each product independently, disregarding potential interactions among items. By neglecting these interdependencies in consumer demand across related goods, sellers may fail to capture the full potential of coordinated pricing strategies. In this paper, we address this problem by exploring dynamic pricing mechanisms designed explicitly for complementary products, aiming to exploit their joint demand structure to maximize overall revenue. We present an online learning algorithm considering both positive and negative interactions between products' demands. The algorithm utilizes transaction data to identify advantageous complementary relationships through an integer programming problem between different items, and then optimizes pricing strategies using data-driven and computationally efficient multi-armed bandit solutions based on heteroscedastic Gaussian processes. We validate our solution in a simulated environment, and we demonstrate that our solution improves the revenue w.r.t. a comparable learning algorithm ignoring such interactions.