Krishnamurthy, Akshay
Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases using Reinforcement Learning
Das, Rajarshi, Dhuliawala, Shehzaad, Zaheer, Manzil, Vilnis, Luke, Durugkar, Ishan, Krishnamurthy, Akshay, Smola, Alex, McCallum, Andrew
Knowledge bases (KB), both automatically and manually constructed, are often incomplete --- many valid facts can be inferred from the KB by synthesizing existing information. A popular approach to KB completion is to infer new relations by combinatory reasoning over the information found along other paths connecting a pair of entities. Given the enormous size of KBs and the exponential number of paths, previous path-based models have considered only the problem of predicting a missing relation given two entities or evaluating the truth of a proposed triple. Additionally, these methods have traditionally used random paths between fixed entity pairs or more recently learned to pick paths between them. We propose a new algorithm MINERVA, which addresses the much more difficult and practical task of answering questions where the relation is known, but only one entity. Since random walks are impractical in a setting with combinatorially many destinations from a start node, we present a neural reinforcement learning approach which learns how to navigate the graph conditioned on the input query to find predictive paths. Empirically, this approach obtains state-of-the-art results on several datasets, significantly outperforming prior methods.
Model-Based Reinforcement Learning in Contextual Decision Processes
Sun, Wen, Jiang, Nan, Krishnamurthy, Akshay, Agarwal, Alekh, Langford, John
We study the sample complexity of model-based reinforcement learning in general contextual decision processes. We design new algorithms for RL with an abstract model class and analyze their statistical properties. Our algorithms have sample complexity governed by a new structural parameter called the witness rank, which we show to be small in several settings of interest, including Factored MDPs and reactive POMDPs. We also show that the witness rank of a problem is never larger than the recently proposed Bellman rank parameter governing the sample complexity of the model-free algorithm OLIVE (Jiang et al., 2017), the only other provably sample efficient algorithm at this level of generality. Focusing on the special case of Factored MDPs, we prove an exponential lower bound for all model-free approaches, including OLIVE, which when combined with our algorithmic results demonstrates exponential separation between model-based and model-free RL in some rich-observation settings.
Contextual bandits with surrogate losses: Margin bounds and efficient algorithms
Foster, Dylan J., Krishnamurthy, Akshay
We introduce a new family of margin-based regret guarantees for adversarial contextual bandit learning. Our results are based on multiclass surrogate losses. Using the ramp loss, we derive a universal margin-based regret bound in terms of the sequential metric entropy for a benchmark class of real-valued regression functions. The new margin bound serves as a complete contextual bandit analogue of the classical margin bound from statistical learning. The result applies to large nonparametric classes, improving on the best known results for Lipschitz contextual bandits (Cesa-Bianchi et al., 2017) and, as a special case, generalizes the dimension-independent Banditron regret bound (Kakade et al., 2008) to arbitrary linear classes with smooth norms. On the algorithmic side, we use the hinge loss to derive an efficient algorithm with a $\sqrt{dT}$-type mistake bound against benchmark policies induced by $d$-dimensional regression functions. This provides the first hinge loss-based solution to the open problem of Abernethy and Rakhlin (2009). With an additional i.i.d. assumption we give a simple oracle-efficient algorithm whose regret matches our generic metric entropy-based bound for sufficiently complex nonparametric classes. Under realizability assumptions our results also yield classical regret bounds.
Myopic Bayesian Design of Experiments via Posterior Sampling and Probabilistic Programming
Kandasamy, Kirthevasan, Neiswanger, Willie, Zhang, Reed, Krishnamurthy, Akshay, Schneider, Jeff, Poczos, Barnabas
We design a new myopic strategy for a wide class of sequential design of experiment (DOE) problems, where the goal is to collect data in order to to fulfil a certain problem specific goal. Our approach, Myopic Posterior Sampling (MPS), is inspired by the classical posterior (Thompson) sampling algorithm for multi-armed bandits and leverages the flexibility of probabilistic programming and approximate Bayesian inference to address a broad set of problems. Empirically, this general-purpose strategy is competitive with more specialised methods in a wide array of DOE tasks, and more importantly, enables addressing complex DOE goals where no existing method seems applicable. On the theoretical side, we leverage ideas from adaptive submodularity and reinforcement learning to derive conditions under which MPS achieves sublinear regret against natural benchmark policies.
Semiparametric Contextual Bandits
Krishnamurthy, Akshay, Wu, Zhiwei Steven, Syrgkanis, Vasilis
This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for an action is modeled as a linear function of known action features confounded by an non-linear action-independent term. We design new algorithms that achieve $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds, when the linear function is $d$-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenewald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.
On Polynomial Time PAC Reinforcement Learning with Rich Observations
Dann, Christoph, Jiang, Nan, Krishnamurthy, Akshay, Agarwal, Alekh, Langford, John, Schapire, Robert E.
We study the computational tractability of provably sample-efficient (PAC) reinforcement learning in episodic environments with high-dimensional observations. We present new sample efficient algorithms for environments with deterministic hidden state dynamics but stochastic rich observations. These methods represent computationally efficient alternatives to prior algorithms that rely on enumerating exponentially many functions. We show that the only known statistically efficient algorithm for the more general stochastic transition setting requires NP-hard computation which cannot be implemented via standard optimization primitives. We also present several examples that illustrate fundamental challenges of tractable PAC reinforcement learning in such general settings.
Off-policy evaluation for slate recommendation
Swaminathan, Adith, Krishnamurthy, Akshay, Agarwal, Alekh, Dudik, Miro, Langford, John, Jose, Damien, Zitouni, Imed
This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context---a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator that uses logged data to estimate a policy's performance. A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased---these conditions are weaker than prior heuristics for slate evaluation---and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.
Disagreement-based combinatorial pure exploration: Efficient algorithms and an analysis with localization
Cao, Tongyi, Krishnamurthy, Akshay
We design new algorithms for the combinatorial pure exploration problem in the multi-arm bandit framework. In this problem, we are given K distributions and a collection of subsets $\mathcal{V} \subset 2^K$ of these distributions, and we would like to find the subset $v \in \mathcal{V}$ that has largest cumulative mean, while collecting, in a sequential fashion, as few samples from the distributions as possible. We study both the fixed budget and fixed confidence settings, and our algorithms essentially achieve state-of-the-art performance in all settings, improving on previous guarantees for structures like matchings and submatrices that have large augmenting sets. Moreover, our algorithms can be implemented efficiently whenever the decision set V admits linear optimization. Our analysis involves precise concentration-of-measure arguments and a new algorithm for linear programming with exponentially many constraints.
Off-policy evaluation for slate recommendation
Swaminathan, Adith, Krishnamurthy, Akshay, Agarwal, Alekh, Dudík, Miroslav, Langford, John, Jose, Damien, Zitouni, Imed
This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context---a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator that uses logged data to estimate a policy's performance. A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased---these conditions are weaker than prior heuristics for slate evaluation---and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.
Asynchronous Parallel Bayesian Optimisation via Thompson Sampling
Kandasamy, Kirthevasan, Krishnamurthy, Akshay, Schneider, Jeff, Poczos, Barnabas
We design and analyse variations of the classical Thompson sampling (TS) procedure for Bayesian optimisation (BO) in settings where function evaluations are expensive, but can be performed in parallel. Our theoretical analysis shows that a direct application of the sequential Thompson sampling algorithm in either synchronous or asynchronous parallel settings yields a surprisingly powerful result: making $n$ evaluations distributed among $M$ workers is essentially equivalent to performing $n$ evaluations in sequence. Further, by modeling the time taken to complete a function evaluation, we show that, under a time constraint, asynchronously parallel TS achieves asymptotically lower regret than both the synchronous and sequential versions. These results are complemented by an experimental analysis, showing that asynchronous TS outperforms a suite of existing parallel BO algorithms in simulations and in a hyper-parameter tuning application in convolutional neural networks. In addition to these, the proposed procedure is conceptually and computationally much simpler than existing work for parallel BO.