Goto

Collaborating Authors

 Markov Models


Temporal and Object Quantification Networks

arXiv.org Machine Learning

We present Temporal and Object Quantification Networks (TOQ-Nets), a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events. This is done by including reasoning layers that implement finite-domain quantification over objects and time. The structure allows them to generalize directly to input instances with varying numbers of objects in temporal sequences of varying lengths. We evaluate TOQ-Nets on input domains that require recognizing event-types in terms of complex temporal relational patterns. We demonstrate that TOQ-Nets can generalize from small amounts of data to scenarios containing more objects than were present during training and to temporal warpings of input sequences.


Thompson Sampling with a Mixture Prior

arXiv.org Machine Learning

We study Thompson sampling (TS) in online decision-making problems where the uncertain environment is sampled from a mixture distribution. This is relevant to multi-task settings, where a learning agent is faced with different classes of problems. We incorporate this structure in a natural way by initializing TS with a mixture prior -- dubbed MixTS -- and develop a novel, general technique for analyzing the regret of TS with such priors. We apply this technique to derive Bayes regret bounds for MixTS in both linear bandits and tabular Markov decision processes (MDPs). Our regret bounds reflect the structure of the problem and depend on the number of components and confidence width of each component of the prior. Finally, we demonstrate the empirical effectiveness of MixTS in both synthetic and real-world experiments.


Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

arXiv.org Machine Learning

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" $\mu$ close to the optimal policy $\pi_\star$ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and horizon length $H$. We first design a sharp offline reduction algorithm -- which simply executes $\mu$ and runs offline policy optimization on the collected dataset -- that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $\mu$ and $\pi_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that -- perhaps surprisingly -- the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $\mu$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $\mu$ only satisfies concentrability partially up to a certain time step.


Information Avoidance and Overvaluation in Sequential Decision Making under Epistemic Constraints

arXiv.org Artificial Intelligence

Decision makers involved in the management of civil assets and systems usually take actions under constraints imposed by societal regulations. Some of these constraints are related to epistemic quantities, as the probability of failure events and the corresponding risks. Sensors and inspectors can provide useful information supporting the control process (e.g. the maintenance process of an asset), and decisions about collecting this information should rely on an analysis of its cost and value. When societal regulations encode an economic perspective that is not aligned with that of the decision makers, the Value of Information (VoI) can be negative (i.e., information sometimes hurts), and almost irrelevant information can even have a significant value (either positive or negative), for agents acting under these epistemic constraints. We refer to these phenomena as Information Avoidance (IA) and Information OverValuation (IOV). In this paper, we illustrate how to assess VoI in sequential decision making under epistemic constraints (as those imposed by societal regulations), by modeling a Partially Observable Markov Decision Processes (POMDP) and evaluating non optimal policies via Finite State Controllers (FSCs). We focus on the value of collecting information at current time, and on that of collecting sequential information, we illustrate how these values are related and we discuss how IA and IOV can occur in those settings.


Unsupervised Machine Learning Hidden Markov Models in Python

#artificialintelligence

The Hidden Markov Model or HMM is all about learning sequences. A lot of the data that would be very useful for us to model is in sequences. Stock prices are sequences of prices. Language is a sequence of words. Credit scoring involves sequences of borrowing and repaying money, and we can use those sequences to predict whether or not you're going to default.


Measurable Monte Carlo Search Error Bounds

arXiv.org Artificial Intelligence

Monte Carlo planners can often return sub-optimal actions, even if they are guaranteed to converge in the limit of infinite samples. Known asymptotic regret bounds do not provide any way to measure confidence of a recommended action at the conclusion of search. In this work, we prove bounds on the sub-optimality of Monte Carlo estimates for non-stationary bandits and Markov decision processes. These bounds can be directly computed at the conclusion of the search and do not require knowledge of the true action-value. The presented bound holds for general Monte Carlo solvers meeting mild convergence conditions. We empirically test the tightness of the bounds through experiments on a multi-armed bandit and a discrete Markov decision process for both a simple solver and Monte Carlo tree search.


Safe Deep Q-Network for Autonomous Vehicles at Unsignalized Intersection

arXiv.org Artificial Intelligence

We propose a safe DRL approach for autonomous vehicle (AV) navigation through crowds of pedestrians while making a left turn at an unsignalized intersection. Our method uses two long-short term memory (LSTM) models that are trained to generate the perceived state of the environment and the future trajectories of pedestrians given noisy observations of their movement. A future collision prediction algorithm based on the future trajectories of the ego vehicle and pedestrians is used to mask unsafe actions if the system predicts a collision. The performance of our approach is evaluated in two experiments using the high-fidelity CARLA simulation environment. The first experiment tests the performance of our method at intersections that are similar to the training intersection and the second experiment tests our method at intersections with a different topology. For both experiments, our methods do not result in a collision with a pedestrian while still navigating the intersection at a reasonable speed.


Residual Feedback Learning for Contact-Rich Manipulation Tasks with Uncertainty

arXiv.org Artificial Intelligence

While classic control theory offers state of the art solutions in many problem scenarios, it is often desired to improve beyond the structure of such solutions and surpass their limitations. To this end, \emph{\gls{rpl}} offers a formulation to improve existing controllers with reinforcement learning (RL) by learning an additive "residual" to the output of a given controller. However, the applicability of such an approach highly depends on the structure of the controller. Often, internal feedback signals of the controller limit an RL algorithm to adequately change the policy and, hence, learn the task. We propose a new formulation that addresses these limitations by also modifying the feedback signals to the controller with an RL policy and show superior performance of our approach on a contact-rich peg-insertion task under position and orientation uncertainty. In addition, we use a recent impedance control architecture as control framework and show the difficulties of standard RPL. Furthermore, we introduce an adaptive curriculum for the given task to gradually increase the task difficulty in terms of position and orientation uncertainty. A video showing the results can be found at https://youtu.be/SAZm_Krze7U .


Drones for Medical Delivery Considering Different Demands Classes: A Markov Decision Process Approach for Managing Health Centers Dispatching Medical Products

arXiv.org Artificial Intelligence

We consider the problem of optimizing the distribution operations of a hub using drones to deliver medical supplies to different geographic regions. Drones are an innovative method with many benefits including low-contact delivery thereby reducing the spread of pandemic and vaccine-preventable diseases. While we focus on medical supply delivery for this work, it is applicable to drone delivery for many other applications, including food, postal items, and e-commerce delivery. In this paper, our goal is to address drone delivery challenges by optimizing the distribution operations at a drone hub that dispatch drones to different geographic locations generating stochastic demands for medical supplies. By considering different geographic locations, we consider different classes of demand that require different flight ranges, which is directly related to the amount of charge held in a drone battery. We classify the stochastic demands based on their distance from the drone hub, use a Markov decision process to model the problem, and perform computational tests using realistic data representing a prominent drone delivery company. We solve the problem using a reinforcement learning method and show its high performance compared with the exact solution found using dynamic programming. Finally, we analyze the results and provide insights for managing the drone hub operations.


Don't Get Yourself into Trouble! Risk-aware Decision-Making for Autonomous Vehicles

arXiv.org Artificial Intelligence

Risk is traditionally described as the expected likelihood of an undesirable outcome, such as collisions for autonomous vehicles. Accurately predicting risk or potentially risky situations is critical for the safe operation of autonomous vehicles. In our previous work, we showed that risk could be characterized by two components: 1) the probability of an undesirable outcome and 2) an estimate of how undesirable the outcome is (loss). This paper is an extension to our previous work. In this paper, using our trained deep reinforcement learning model for navigating around crowds, we developed a risk-based decision-making framework for the autonomous vehicle that integrates the high-level risk-based path planning with the reinforcement learning-based low-level control. We evaluated our method in a high-fidelity simulation such as CARLA. This work can improve safety by allowing an autonomous vehicle to one day avoid and react to risky situations.