AITopics

2502.0178

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.46)

Thompson Sampling in Partially Observable Contextual Bandits

arXiv.org Artificial IntelligenceFeb-15-2024

Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameters of each arm need to be learned by experimenting that specific arm. Accordingly, a fundamental problem is that of balancing exploration (i.e., pulling different arms to learn their parameters), versus exploitation (i.e., pulling the best arms to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on the data of observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish the followings: (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling of the regret with other quantities including dimensions and number of arms. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we introduce novel martingale techniques and concentration inequalities to address partially observed dependent random variables generated from unspecified distributions, and also leverage problem-dependent information to sharpen probabilistic bounds for time-varying suboptimality gaps. These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.

artificial intelligence, machine learning, observable contextual bandit, (1 more...)

arXiv.org Artificial Intelligence

2402.10289

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.53)

Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations

arXiv.org Machine LearningMay-25-2022

Contextual bandits are canonical models for sequential decision-making under uncertainty in environments with time-varying components. In this setting, the expected reward of each bandit arm consists of the inner product of an unknown parameter with the context vector of that arm. The classical bandit settings heavily rely on assuming that the contexts are fully observed, while study of the richer model of imperfectly observed contextual bandits is immature. This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values. We establish that the non-asymptotic worst-case regret grows poly-logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms. Numerical analysis showcasing the above efficiency of Greedy policies is also provided.

imperfect context observation, machine learning, reinforcement learning, (4 more...)

2204.04773

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)

Efficient Algorithms for Learning to Control Bandits with Unobserved Contexts

arXiv.org Machine LearningFeb-1-2022

Contextual bandits are commonly used for sequential decision-making with finitely many control actions. In this setting, available context observations can be utilized in a tractable way, thanks to the linearity of the relationship between the reward and the context vectors. The arms provide rewards depending on the contexts that represent their individual characteristics. The range of real-world applications is notably extensive, including personalized recommendations for Mobile Context-Aware Recommender Systems and mobile-health interventions [1, 2, 3]. To get satisfactory performances in bandits, the exploration-exploitation trade-off must be addressed. The theoretical analysis of efficient policies for the multi-armed bandits goes back to algorithms that decide based on Upper-Confident-Bounds (UCB) [4]. In fact, UCB employs an optimistic approximate of the unknown reward based on the history of observations, to allow an appropriate degree of exploration. Further theoretical results for UCB in contextual bandits, as well as in other settings, are available in the literature [5, 6, 7, 8, 9]. Posterior sampling is another ubiquitous reinforcement learning algorithm that effectively balances exploitation versus exploration.

data mining, machine learning, reinforcement learning, (16 more...)

2202.00867

Genre: Research Report (0.40)

Industry: Health & Medicine (0.89)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.92)
Information Technology > Data Science > Data Mining > Big Data (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits

arXiv.org Machine LearningOct-23-2021

Contextual multi-armed bandits are classical models in reinforcement learning for sequential decision-making associated with individual information. A widely-used policy for bandits is Thompson Sampling, where samples from a data-driven probabilistic belief about unknown parameters are used to select the control actions. For this computationally fast algorithm, performance analyses are available under full context-observations. However, little is known for problems that contexts are not fully observed. We propose a Thompson Sampling algorithm for partially observable contextual multi-armed bandits, and establish theoretical performance guarantees. Technically, we show that the regret of the presented policy scales logarithmically with time and the number of arms, and linearly with the dimension. Further, we establish rates of learning unknown parameters, and provide illustrative numerical analyses.

data mining, machine learning, reinforcement learning, (17 more...)

2110.12175

Country: Europe > United Kingdom > England (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)