krishnamurthy
- North America > United States > Virginia > Arlington County > Arlington (0.05)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Asia > Middle East > Jordan (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
11f9e78e4899a78dedd439fc583b6693-Paper.pdf
There, areward function isdrawn from one of multiple possible reward models atthebeginning ofeveryepisode, buttheidentity ofthechosen rewardmodel is not revealed to the agent. Hence, the latent state space, for which the dynamics are Markovian, is not given to the agent. We study the problem of learning a near optimal policy for two reward-mixing MDPs. Unlike existing approaches that rely on strong assumptions on the dynamics, we make no assumptions and study the problem in full generality.
- North America > United States > Arizona (0.04)
- Asia > Middle East > Jordan (0.04)
On Word-of-Mouth and Private-Prior Sequential Social Learning
Da Col, Andrea, Rojas, Cristian R., Krishnamurthy, Vikram
-- Social learning constitutes a fundamental framework for studying interactions among rational agents who observe each other's actions but lack direct access to individual beliefs. This paper investigates a specific social learning paradigm known as Word-of-Mouth (WoM), where a series of agents seeks to estimate the state of a dynamical system. The first agent receives noisy measurements of the state, while each subsequent agent relies solely on a degraded version of her predecessor's estimate. A defining feature of WoM is that the final agent's belief is publicly broadcast and subsequently adopted by all agents, in place of their own. We analyze this setting theoretically and through numerical simulations, noting that some agents benefit from using the belief of the last agent, while others experience performance deterioration.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- Research Report (0.90)
- Overview (0.68)
analysis and our analysis of FRANCIS remains unchanged, we wish to note that in our own internal re-review we
We thank the reviewers for their thoughtful reviews; below we address their main concerns. This allows us to express the misspecification error (e.g., eqn 37 in appendix) directly in every (null 1) Note that the results from Chi et al. We consider this work as a first step in this direction. Is a good representation sufficient for sample efficient reinforcement learning?
Efficient First-Order Contextual Bandits: Prediction, Allocation, and Triangular Discrimination
A recurring theme in statistical learning, online learning, and beyond is that faster convergence rates are possible for problems with low noise, often quantified by the performance of the best hypothesis; such results are known as first-order or small-loss guarantees. While first-order guarantees are relatively well understood in statistical and online learning, adapting to low noise in contextual bandits (and more broadly, decision making) presents major algorithmic challenges. In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire asked whether first-order guarantees are even possible for contextual bandits and---if so---whether they can be attained by efficient algorithms. We give a resolution to this question by providing an optimal and efficient reduction from contextual bandits to online regression with the logarithmic (or, cross-entropy) loss. Our algorithm is simple and practical, readily accommodates rich function classes, and requires no distributional assumptions beyond realizability.
Efficient First-Order Contextual Bandits: Prediction, Allocation, and Triangular Discrimination
A recurring theme in statistical learning, online learning, and beyond is that faster convergence rates are possible for problems with low noise, often quantified by the performance of the best hypothesis; such results are known as first-order or small-loss guarantees. While first-order guarantees are relatively well understood in statistical and online learning, adapting to low noise in contextual bandits (and more broadly, decision making) presents major algorithmic challenges. In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire asked whether first-order guarantees are even possible for contextual bandits and---if so---whether they can be attained by efficient algorithms. We give a resolution to this question by providing an optimal and efficient reduction from contextual bandits to online regression with the logarithmic (or, cross-entropy) loss. Our algorithm is simple and practical, readily accommodates rich function classes, and requires no distributional assumptions beyond realizability.