Seraj, Raihan
Contextual bandits with entropy-based human feedback
Seraj, Raihan, Meng, Lili, Sylvain, Tristan
This work investigates how explicit human feedback can enhance CB performance. Building on successful integrations In recent years, preference-based human feedback of human guidance in reinforcement learning (Christiano mechanisms have become essential for enhancing et al., 2017; MacGlashan et al., 2017) and conversational model performance across diverse applications, AI (Achiam et al., 2023), we distinguish two primary feedback including conversational AI systems such as Chat-paradigms: (1) action-based feedback, where experts GPT. However, existing approaches often neglect directly prescribe optimal actions for specific contexts (Osa critical aspects, such as model uncertainty and et al., 2018; Li et al., 2023), and (2) preference-based feedback, the variability in feedback quality. To address where humans compare pairs of learner-generated actions these challenges, we introduce an entropy-based to express relative preferences (Christiano et al., 2017; human feedback framework for contextual bandits, Saha et al., 2023). While action-based methods require precise which dynamically balances exploration and expert knowledge, we focus on preference feedback for exploitation by soliciting expert feedback only its practical advantages in scalable data collection, notably when model entropy exceeds a predefined threshold.
Generalizing Multi-Step Inverse Models for Representation Learning to Finite-Memory POMDPs
Wu, Lili, Evans, Ben, Islam, Riashat, Seraj, Raihan, Efroni, Yonathan, Lamb, Alex
Discovering an informative, or agent-centric, state representation that encodes only the relevant information while discarding the irrelevant is a key challenge towards scaling reinforcement learning algorithms and efficiently applying them to downstream tasks. Prior works studied this problem in high-dimensional Markovian environments, when the current observation may be a complex object but is sufficient to decode the informative state. In this work, we consider the problem of discovering the agent-centric state in the more challenging high-dimensional non-Markovian setting, when the state can be decoded from a sequence of past observations. We establish that generalized inverse models can be adapted for learning agent-centric state representation for this task. Our results include asymptotic theory in the deterministic dynamics setting as well as counter-examples for alternative intuitive algorithms. We complement these findings with a thorough empirical study on the agent-centric state discovery abilities of the different alternatives we put forward. Particularly notable is our analysis of past actions, where we show that these can be a double-edged sword: making the algorithms more successful when used correctly and causing dramatic failure when used incorrectly.
PcLast: Discovering Plannable Continuous Latent States
Koul, Anurag, Sujit, Shivakanth, Chen, Shaoru, Evans, Ben, Wu, Lili, Xu, Byron, Chari, Rajan, Islam, Riashat, Seraj, Raihan, Efroni, Yonathan, Molu, Lekan, Dudik, Miro, Langford, John, Lamb, Alex
Goal-conditioned planning benefits from learned low-dimensional representations of rich, high-dimensional observations. While compact latent representations, typically learned from variational autoencoders or inverse dynamics, enable goal-conditioned planning they ignore state affordances, thus hampering their sample-efficient planning capabilities. In this paper, we learn a representation that associates reachable states together for effective onward planning. We first learn a latent representation with multi-step inverse dynamics (to remove distracting information); and then transform this representation to associate reachable states together in $\ell_2$ space. Our proposals are rigorously tested in various simulation testbeds. Numerical results in reward-based and reward-free settings show significant improvements in sampling efficiency, and yields layered state abstractions that enable computationally efficient hierarchical planning.
AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval
Yan, Qi, Seraj, Raihan, He, Jiawei, Meng, Lili, Sylvain, Tristan
Machine-based prediction of real-world events is garnering attention due to its potential for informed decision-making. Whereas traditional forecasting predominantly hinges on structured data like time-series, recent breakthroughs in language models enable predictions using unstructured text. In particular, (Zou et al., 2022) unveils AutoCast, a new benchmark that employs news articles for answering forecasting queries. Nevertheless, existing methods still trail behind human performance. The cornerstone of accurate forecasting, we argue, lies in identifying a concise, yet rich subset of news snippets from a vast corpus. With this motivation, we introduce AutoCast++, a zero-shot ranking-based context retrieval system, tailored to sift through expansive news document collections for event forecasting. Our approach first re-ranks articles based on zero-shot question-passage relevance, honing in on semantically pertinent news. Following this, the chosen articles are subjected to zero-shot summarization to attain succinct context. Leveraging a pre-trained language model, we conduct both the relevance evaluation and article summarization without needing domain-specific training. Notably, recent articles can sometimes be at odds with preceding ones due to new facts or unanticipated incidents, leading to fluctuating temporal dynamics. To tackle this, our re-ranking mechanism gives preference to more recent articles, and we further regularize the multi-passage representation learning to align with human forecaster responses made on different dates. Empirical results underscore marked improvements across multiple metrics, improving the performance for multiple-choice questions (MCQ) by 48% and true/false (TF) questions by up to 8%.
Tsetlin Machine for Solving Contextual Bandit Problems
Seraj, Raihan, Sharma, Jivitesh, Granmo, Ole-Christoffer
This paper introduces an interpretable contextual bandit algorithm using Tsetlin Machines, which solves complex pattern recognition tasks using propositional logic. The proposed bandit learning algorithm relies on straightforward bit manipulation, thus simplifying computation and interpretation. We then present a mechanism for performing Thompson sampling with Tsetlin Machine, given its non-parametric nature. Our empirical analysis shows that Tsetlin Machine as a base contextual bandit learner outperforms other popular base learners on eight out of nine datasets. We further analyze the interpretability of our learner, investigating how arms are selected based on propositional expressions that model the context.