Goto

Collaborating Authors

 Undirected Networks


Online Planning in POMDPs with Self-Improving Simulators

arXiv.org Artificial Intelligence

How can we plan efficiently in a large and complex environment when the time budget is limited? However, there are three main limitations of this "twophase" Given the original simulator of the environment, paradigm, where a simulator is learned offline and which may be computationally very demanding, we then used as-is for online simulation and planning. First, no propose to learn online an approximate but much planning is possible until the offline learning phase finishes, faster simulator that improves over time. To plan which can take a long time. Second, the separation of learning reliably and efficiently while the approximate simulator and planning raises a question on what data collection policy is learning, we develop a method that adaptively should be used during training to ensure good online prediction decides which simulator to use for every simulation, during planning. We empirically demonstrate that when based on a statistic that measures the accuracy the training data is collected by a uniform random policy, the of the approximate simulator. This allows us to learned influence predictors can perform poorly during online use the approximate simulator to replace the original planning, due to distribution shift. Third, completely replacing simulator for faster simulations when it is accurate the original simulator with the approximate one after enough under the current context, thus trading training implies a risk of poor planning performance in certain off simulation speed and accuracy. Experimental situations, which is hard to detect in advance.


Generalization Error Bounds on Deep Learning with Markov Datasets

arXiv.org Machine Learning

In this paper, we derive upper bounds on generalization errors for deep neural networks with Markov datasets. These bounds are developed based on Koltchinskii and Panchenko's approach for bounding the generalization error of combined classifiers with i.i.d. datasets. The development of new symmetrization inequalities in high-dimensional probability for Markov chains is a key element in our extension, where the pseudo-spectral gap of the infinitesimal generator of the Markov chain plays as a key parameter in these inequalities. We also propose a simple method to convert these bounds and other similar ones in traditional deep learning and machine learning to Bayesian counterparts for both i.i.d. and Markov datasets. Extensions to $m$-order homogeneous Markov chains such as AR and ARMA models and mixtures of several Markov data services are given, where the spectral method in functional analysis is used to derive these results.


Exploiting Semantic Epsilon Greedy Exploration Strategy in Multi-Agent Reinforcement Learning

arXiv.org Artificial Intelligence

Multi-agent reinforcement learning (MARL) can model many real world applications. However, many MARL approaches rely on epsilon greedy for exploration, which may discourage visiting advantageous states in hard scenarios. In this paper, we propose a new approach QMIX(SEG) for tackling MARL. It makes use of the value function factorization method QMIX to train per-agent policies and a novel Semantic Epsilon Greedy (SEG) exploration strategy. SEG is a simple extension to the conventional epsilon greedy exploration strategy, yet it is experimentally shown to greatly improve the performance of MARL. We first cluster actions into groups of actions with similar effects and then use the groups in a bi-level epsilon greedy exploration hierarchy for action selection. We argue that SEG facilitates semantic exploration by exploring in the space of groups of actions, which have richer semantic meanings than atomic actions. Experiments show that QMIX(SEG) largely outperforms QMIX and leads to strong performance competitive with current state-of-the-art MARL approaches on the StarCraft Multi-Agent Challenge (SMAC) benchmark.


Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes

arXiv.org Machine Learning

Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration, but must propose a near-optimal policy for an arbitrary reward function revealed only after exploring. In the the tabular setting, it is well known that this is a more difficult problem than PAC RL -- where the agent has access to the reward function during exploration -- with optimal sample complexities in the two settings differing by a factor of $|\mathcal{S}|$, the size of the state space. We show that this separation does not exist in the setting of linear MDPs. We first develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP with sample complexity scaling as $\mathcal{O}(d^2/\epsilon^2)$. We then show a matching lower bound of $\Omega(d^2/\epsilon^2)$ on PAC RL. To our knowledge, our approach is the first computationally efficient algorithm to achieve optimal $d$ dependence in linear MDPs, even in the single-reward PAC setting. Our algorithm relies on a novel procedure which efficiently traverses a linear MDP, collecting samples in any given "feature direction", and enjoys a sample complexity scaling optimally in the (linear MDP equivalent of the) maximal state visitation probability. We show that this exploration procedure can also be applied to solve the problem of obtaining "well-conditioned" covariates in linear MDPs.


Distantly supervised end-to-end medical entity extraction from electronic health records with human-level quality

arXiv.org Artificial Intelligence

Medical entity extraction (EE) is a standard procedure used as a first stage in medical texts processing. Usually Medical EE is a two-step process: named entity recognition (NER) and named entity normalization (NEN). We propose a novel method of doing medical EE from electronic health records (EHR) as a single-step multi-label classification task by fine-tuning a transformer model pretrained on a large EHR dataset. Our model is trained end-to-end in an distantly supervised manner using targets automatically extracted from medical knowledge base. We show that our model learns to generalize for entities that are present frequently enough, achieving human-level classification quality for most frequent entities. Our work demonstrates that medical entity extraction can be done end-to-end without human supervision and with human quality given the availability of a large enough amount of unlabeled EHR and a medical knowledge base.


Probability estimation and structured output prediction for learning preferences in last mile delivery

arXiv.org Artificial Intelligence

We study the problem of learning the preferences of drivers and planners in the context of last mile delivery. Given a data set containing historical decisions and delivery locations, the goal is to capture the implicit preferences of the decision-makers. We consider two ways to use the historical data: one is through a probability estimation method that learns transition probabilities between stops (or zones). This is a fast and accurate method, recently studied in a VRP setting. Furthermore, we explore the use of machine learning to infer how to best balance multiple objectives such as distance, probability and penalties. Specifically, we cast the learning problem as a structured output prediction problem, where training is done by repeatedly calling the TSP solver. Another important aspect we consider is that for last-mile delivery, every address is a potential client and hence the data is very sparse. Hence, we propose a two-stage approach that first learns preferences at the zone level in order to compute a zone routing; after which a penalty-based TSP computes the stop routing. Results show that the zone transition probability estimation performs well, and that the structured output prediction learning can improve the results further. We hence showcase a successful combination of both probability estimation and machine learning, all the while using standard TSP solvers, both during learning and to compute the final solution; this means the methodology is applicable to other, real-life, TSP variants, or proprietary solvers.


Safety-Aware Multi-Agent Apprenticeship Learning

arXiv.org Artificial Intelligence

As the rapid development of Artifical Intelligence in the current technology field, Reinforcement Learning has been proven as a powerful technique that allows autonomous agents to learn optimal behaviors (called policies) in unknown and complex environments through models of rewards and penalization. However, in order to make this technique (Reinforcement Learning) work correctly and get the precise reward function, which returns the feedback to the learning agent about when the agent behaves correctly or not, the reward function needs to be thoroughly specified. As a result, in real-world complex environments, such as autonomous driving, specifying a correct reward function could be one of the hard tasks to tackle for the Reinforcement Learning model designers. To this end, Apprenticeship Learning techniques, in which the agent can infer a reward function from expert behaviors, are of high interest due to the fact that they could result in highly specified reward function efficiently. However, for critical tasks such as autonomous driving, we need to critically consider about the safety-related issues, so as to we need to build techniques to automatically check and ensure that the inferred rewards functions and policies resulted from the Reinforcement Learning model fulfill the needed safety requirements of the critical tasks that we have mentioned previously. In order to have a well-designed Reinforcement Learning model, which is able to generate the highly-specified reward function satisfying the safety-related considerations, the technique called "Safety-Aware Apprenticeship Learning" was built in 2018[23], which would be introduced in detail in the later sections. Although the technique "Safety-Aware Apprenticeship Learning" has been built, it only considers Single-Agent scenario. In the other word, the current "Safety-Aware Apprenticeship Learning" technique can only be applied to one agent running in an isolated environment, a fact which limits the potential implementation of this technique.


Accelerated Intravascular Ultrasound Imaging using Deep Reinforcement Learning

arXiv.org Artificial Intelligence

ABSTRACT Intravascular ultrasound (IVUS) offers a unique perspective in the treatment of vascular diseases by creating a sequence of ultrasound-slices acquired from within the vessel. However, unlike conventional hand-held ultrasound, the thin catheter only provides room for a small number of physical channels for signal transfer from a transducer-array at the tip. For continued improvement of image quality and frame rate, we present the use of deep reinforcement learning to deal with the current physical information bottleneck. V aluable inspiration has come from the field of magnetic resonance imaging (MRI), where learned acquisition schemes have brought significant acceleration in image acquisition at competing image quality. To efficiently accelerate IVUS imaging, we propose a framework that utilizes deep reinforcement learning for an optimal adaptive acquisition policy on a per-frame basis enabled by actor-critic methods and Gumbel top-K sampling. Index T erms-- Deep reinforcement learning, intravascu-lar ultrasound, compressed sensing 1. INTRODUCTION Minimally invasive vascular interventions are increasingly guided by phased array intravascular imaging.


Optimal Estimation and Computational Limit of Low-rank Gaussian Mixtures

arXiv.org Machine Learning

Structural matrix-variate observations routinely arise in diverse fields such as multi-layer network analysis and brain image clustering. While data of this type have been extensively investigated with fruitful outcomes being delivered, the fundamental questions like its statistical optimality and computational limit are largely under-explored. In this paper, we propose a low-rank Gaussian mixture model (LrMM) assuming each matrix-valued observation has a planted low-rank structure. Minimax lower bounds for estimating the underlying low-rank matrix are established allowing a whole range of sample sizes and signal strength. Under a minimal condition on signal strength, referred to as the information-theoretical limit or statistical limit, we prove the minimax optimality of a maximum likelihood estimator which, in general, is computationally infeasible. If the signal is stronger than a certain threshold, called the computational limit, we design a computationally fast estimator based on spectral aggregation and demonstrate its minimax optimality. Moreover, when the signal strength is smaller than the computational limit, we provide evidences based on the low-degree likelihood ratio framework to claim that no polynomial-time algorithm can consistently recover the underlying low-rank matrix. Our results reveal multiple phase transitions in the minimax error rates and the statistical-to-computational gap. Numerical experiments confirm our theoretical findings. We further showcase the merit of our spectral aggregation method on the worldwide food trading dataset.


Reinforcement Learning Your Way: Agent Characterization through Policy Regularization

arXiv.org Artificial Intelligence

Recent advances in reinforcement learning (RL) have increased complexity which, especially for deep RL, has brought forth challenges related to explainability [1]. The opacity of state-of-the-art RL algorithms has led to model developers having a limited understanding of their agents' policies and no influence over learned strategies [2]. While concerns surrounding explainability have been noted for AI in general, it is only more recently that attempts have been made to explain RL systems [3, 1, 4, 5]. These attempts have resulted in a wide suite of methods requiring various degrees of expert knowledge, either about the state-action domain or about the specific RL algorithm. Further, they typically rely on post-hoc analysis of learned policies which give only observational assurances of agents' behaviour. We instead propose an intrinsic method of regularizing agents' actions based on a given prior. While current methods for RL regularization aim to improve training performance - e.g., by maximizing the entropy of the action distribution [6], or by minimising the distance to a prior sub-optimal state-action distribution [7] - our aim is the characterization of our agents' behaviours. We also extend the current regularization techniques to accommodate multi-agent systems which allows intrinsic characterization of individual agents. We provide a formal argument for the rationale of our method and demonstrate its efficacy in a toy problem where agents learn to navigate to a destination on a grid by performing, e.g., only right turns (under the premise that right turns are