Goto

Collaborating Authors

 composite action





Contextual semibandits via supervised learning oracles † ‡ Miroslav Dudík ‡

Neural Information Processing Systems

We study an online decision making problem where on each round a learner chooses a list of items based on some side information, receives a scalar feedback value for each individual item, and a reward that is linearly related to this feedback. These problems, known as contextual semibandits, arise in crowdsourcing, recommendation, and many other domains. This paper reduces contextual semibandits to supervised learning, allowing us to leverage powerful supervised learning methods in this partial-feedback setting. Our first reduction applies when the mapping from feedback to reward is known and leads to a computationally efficient algorithm with near-optimal regret. We show that this algorithm outperforms state-of-the-art approaches on real-world learning-to-rank datasets, demonstrating the advantage of oracle-based algorithms. Our second reduction applies to the previously unstudied setting when the linear mapping from feedback to reward is unknown. Our regret guarantees are superior to prior techniques that ignore the feedback.


Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior

Block, Adam, Jadbabaie, Ali, Pfrommer, Daniel, Simchowitz, Max, Tedrake, Russ

arXiv.org Machine Learning

Training dynamic agents from datasets of expert examples, known as imitation learning, promises to take advantage of the plentiful demonstrations available in the modern data environment, in an analogous manner to the recent successes of language models conducting unsupervised learning on enormous corpora of text [68, 71]. Imitation learning is especially exciting in robotics, where mass stores of pre-recorded demonstrations on Youtube [1] or cheaply collected simulated trajectories [43, 20] can be converted into learned robotic policies. For imitation learning to be a viable path toward generalist robotic behavior, it needs to be able to both represent and execute the complex behaviors exhibited in the demonstrated data. An approach that has shown tremendous promise is generative behavior cloning: fitting generative models, such as diffusion models [2, 19, 34], to expert demonstrations with pure supervised learning. In this paper, we ask: Under what conditions can generative behavior cloning imitate arbitrarily complex expert behavior? In this paper, we are interested in how algorithmic choices interface with the dynamics of the agent's environment to render imitation possible. The key challenge separating imitation learning from vanilla supervised learning is one of compounding error: when the learner executes the trained behavior in its environment, small mistakes can accumulate into larger ones; this in turn may bring the agent to regions of state space not seen during training, leading to larger-still deviations from intended trajectories.


Contextual semibandits via supervised learning oracles

Krishnamurthy, Akshay, Agarwal, Alekh, Dudik, Miro

Neural Information Processing Systems

We study an online decision making problem where on each round a learner chooses a list of items based on some side information, receives a scalar feedback value for each individual item, and a reward that is linearly related to this feedback. These problems, known as contextual semibandits, arise in crowdsourcing, recommendation, and many other domains. This paper reduces contextual semibandits to supervised learning, allowing us to leverage powerful supervised learning methods in this partial-feedback setting. Our first reduction applies when the mapping from feedback to reward is known and leads to a computationally efficient algorithm with near-optimal regret. We show that this algorithm outperforms state-of-the-art approaches on real-world learning-to-rank datasets, demonstrating the advantage of oracle-based algorithms. Our second reduction applies to the previously unstudied setting when the linear mapping from feedback to reward is unknown. Our regret guarantees are superior to prior techniques that ignore the feedback.


Contextual Semibandits via Supervised Learning Oracles

Krishnamurthy, Akshay, Agarwal, Alekh, Dudik, Miroslav

arXiv.org Machine Learning

Decision making with partial feedback, motivated by applications including personalized medicine [22] and content recommendation [17], is receiving increasing attention from the machine learning community. These problems are formally modeled as learning from bandit feedback, where a learner repeatedly takes an action and observes a reward for the action, with the goal of maximizing reward. While bandit learning captures many problems of interest, several applications have additional structure: the action is combinatorial in nature and more detailed feedback is provided. For example, in internet applications, we often recommend sets of items and record information about the user's interaction with each individual item (e.g., click). This additional feedback is unhelpful unless it relates to the overall reward (e.g., number of clicks), and, as in previous work, we assume a linear relationship. This interaction is known as the semibandit feedback model. Typical bandit and semibandit algorithms achieve reward that is competitive with the single best fixed action, i.e., the best medical treatment or the most popular news article for everyone. This is often inadequate for recommendation applications: while the most popular articles may get some clicks, personalizing content to the users is much more effective.


Temporal Composite Actions with Constraints

Doherty, Patrick (Linköping University) | Kvarnström, Jonas (Linköping University) | Szalas, Andrzej (Warzaw University)

AAAI Conferences

Complex mission or task specification languages play a fundamentally important role in human/robotic interaction. In realistic scenarios such as emergency response, specifying temporal, resource and other constraints on a mission is an essential component due to the dynamic and contingent nature of the operational environments. It is also desirable that in addition to having a formal semantics, the language should be sufficiently expressive, pragmatic and abstract. The main goal of this paper is to propose a mission specification language that meets these requirements. It is based on extending both the syntax and semantics of a well-established formalism for reasoning about action and change, Temporal Action Logic (TAL), in order to represent temporal composite actions with constraints. Fixpoints are required to specify loops and recursion in the extended language. The results include a sound and complete proof theory for this extension. To ensure that the composite language constructs are adequately grounded in the pragmatic operation of robotic systems, Task Specification Trees (TSTs) and their mapping to these constructs are proposed. The expressive and pragmatic adequacy of this approach is demonstrated using an emergency response scenario.


Towards Situated, Interactive, Instructable Agents in a Cognitive Architecture

Mohan, Shiwali (University of Michigan) | Laird, John E. (University of Michigan)

AAAI Conferences

This paper discusses the challenge of designing instructable agents that can learn through interaction with a human expert. Learning through instruction is a powerful paradigm for acquiring knowledge because it limits the complexity of the learning task in a variety of ways. To support learning through instruction, the agent must be able to effectively communicate its lack of knowledge to the human, comprehend instructions, and apply them to the ongoing task. Weidentify some problems of concern when designing instructable agents. We propose an agent design that addresses some of these problems. We instantiate this design in the Soar cognitive architecture and analyze its capabilities on a learning task.


Mathematical Structure of Quantum Decision Theory

Yukalov, V. I., Sornette, D.

arXiv.org Artificial Intelligence

One of the most complex systems is the human brain whose formalized functioning is characterized by decision theory. We present a "Quantum Decision Theory" of decision making, based on the mathematical theory of separable Hilbert spaces. This mathematical structure captures the effect of superposition of composite prospects, including many incorporated intentions, which allows us to explain a variety of interesting fallacies and anomalies that have been reported to particularize the decision making of real human beings. The theory describes entangled decision making, non-commutativity of subsequent decisions, and intention interference of composite prospects. We demonstrate how the violation of the Savage's sure-thing principle (disjunction effect) can be explained as a result of the interference of intentions, when making decisions under uncertainty. The conjunction fallacy is also explained by the presence of the interference terms. We demonstrate that all known anomalies and paradoxes, documented in the context of classical decision theory, are reducible to just a few mathematical archetypes, all of which finding straightforward explanations in the frame of the developed quantum approach.