Undirected Networks
Visuo-Tactile Transformers for Manipulation
Chen, Yizhou, Sipos, Andrea, Van der Merwe, Mark, Fazeli, Nima
Learning representations in the joint domain of vision and touch can improve manipulation dexterity, robustness, and sample-complexity by exploiting mutual information and complementary cues. Here, we present Visuo-Tactile Transformers (VTTs), a novel multimodal representation learning approach suited for model-based reinforcement learning and planning. Our approach extends the Visual Transformer \cite{dosovitskiy2021image} to handle visuo-tactile feedback. Specifically, VTT uses tactile feedback together with self and cross-modal attention to build latent heatmap representations that focus attention on important task features in the visual domain. We demonstrate the efficacy of VTT for representation learning with a comparative evaluation against baselines on four simulated robot tasks and one real world block pushing task. We conduct an ablation study over the components of VTT to highlight the importance of cross-modality in representation learning.
POETREE: Interpretable Policy Learning with Adaptive Decision Trees
Pace, Alizée, Chan, Alex J., van der Schaar, Mihaela
Building models of human decision-making from observed behaviour is critical to better understand, diagnose and support real-world policies such as clinical care. As established policy learning approaches remain focused on imitation performance, they fall short of explaining the demonstrated decision-making process. Policy Extraction through decision Trees (POETREE) is a novel framework for interpretable policy learning, compatible with fully-offline and partially-observable clinical decision environments -- and builds probabilistic tree policies determining physician actions based on patients' observations and medical history. Fully-differentiable tree architectures are grown incrementally during optimization to adapt their complexity to the modelling task, and learn a representation of patient history through recurrence, resulting in decision tree policies that adapt over time with patient information. This policy learning method outperforms the state-of-the-art on real and synthetic medical datasets, both in terms of understanding, quantifying and evaluating observed behaviour as well as in accurately replicating it -- with potential to improve future decision support systems.
Gradient flows and randomised thresholding: sparse inversion and classification
Sparse inversion and classification problems are ubiquitous in modern data science and imaging. They are often formulated as non-smooth minimisation problems. In sparse inversion, we minimise, e.g., the sum of a data fidelity term and an L1/LASSO regulariser. In classification, we consider, e.g., the sum of a data fidelity term and a non-smooth Ginzburg--Landau energy. Standard (sub)gradient descent methods have shown to be inefficient when approaching such problems. Splitting techniques are much more useful: here, the target function is partitioned into a sum of two subtarget functions -- each of which can be efficiently optimised. Splitting proceeds by performing optimisation steps alternately with respect to each of the two subtarget functions. In this work, we study splitting from a stochastic continuous-time perspective. Indeed, we define a differential inclusion that follows one of the two subtarget function's negative subdifferential at each point in time. The choice of the subtarget function is controlled by a binary continuous-time Markov process. The resulting dynamical system is a stochastic approximation of the underlying subgradient flow. We investigate this stochastic approximation for an L1-regularised sparse inversion flow and for a discrete Allen-Cahn equation minimising a Ginzburg--Landau energy. In both cases, we study the longtime behaviour of the stochastic dynamical system and its ability to approximate the underlying subgradient flow at any accuracy. We illustrate our theoretical findings in a simple sparse estimation problem and also in low- and high-dimensional classification problems.
Contrastive Unsupervised Learning of World Model with Invariant Causal Features
Poudel, Rudra P. K., Pandya, Harit, Cipolla, Roberto
In this paper we present a world model, which learns causal features using the invariance principle. In particular, we use contrastive unsupervised learning to learn the invariant causal features, which enforces invariance across augmentations of irrelevant parts or styles of the observation. The world-model-based reinforcement learning methods independently optimize representation learning and the policy. Thus naive contrastive loss implementation collapses due to a lack of supervisory signals to the representation learning module. We propose an intervention invariant auxiliary task to mitigate this issue. Specifically, we utilize depth prediction to explicitly enforce the invariance and use data augmentation as style intervention on the RGB observation space. Our design leverages unsupervised representation learning to learn the world model with invariant causal features. Our proposed method significantly outperforms current state-of-the-art model-based and model-free reinforcement learning methods on out-of-distribution point navigation tasks on the iGibson dataset. Moreover, our proposed model excels at the sim-to-real transfer of our perception learning module. Finally, we evaluate our approach on the DeepMind control suite and enforce invariance only implicitly since depth is not available. Nevertheless, our proposed model performs on par with the state-of-the-art counterpart.
Towards Adaptive Planning of Assistive-care Robot Tasks
Hamilton, Jordan, Stefanakos, Ioannis, Calinescu, Radu, Cámara, Javier
Whilst assistive robots [7] have been embedded into social and health care environments [1, 2, 10], they have largely been limited to simple applications, such as support for social and physical activities and hall monitoring, but often without considering potential interactions with humans. To expand the range of these applications, the human user and the robot need to interact in order to perform tasks together [4]. As such, this interaction, which is still underexplored in the social care domain, should be prioritised, with an emphasis on the safety of the human [3, 9]. To enable the development of applications that support such interaction and to ensure its safety, we propose an adaptive mission and path finding framework for an autonomous robot operating in a homecare environment. The framework models the environment as a graph, with nodes representing key locations within the environment where the robot can perform local tasks. Missions are modelled as a repertoire of locations within the environment where a task requires completion. The main contributions of our'research preview' paper are: (i) a generalised approach for modelling environments as graphs with edges represented as levels of risk, (ii) a modified Dijkstra's algorithm for performing path finding in uncertain environments with a cost function to reduce risk, (iii) simple human predictive behaviour model that forecasts human intention allowing for adaptive path finding using heat maps to artificially increase the risk associated with specific edges in the graph, (iv) a framework that combines modelling methods, adaptive path finding techniques and run-time probabilistic model generation for safety verification into an end-to-end solution for autonomous robotic mission planning, (v) finally, a simulation-based case study that shows the effectiveness of the framework.
Scheduling for Urban Air Mobility using Safe Learning
Murthy, Surya, Neogi, Natasha A., Bharadwaj, Suda
This work considers the scheduling problem for Urban Air Mobility (UAM) vehicles travelling between origin-destination pairs with both hard and soft trip deadlines. Each route is described by a discrete probability distribution over trip completion times (or delay) and over inter-arrival times of requests (or demand) for the route along with a fixed hard or soft deadline. Soft deadlines carry a cost that is incurred when the deadline is missed. An online, safe scheduler is developed that ensures that hard deadlines are never missed, and that average cost of missing soft deadlines is minimized. The system is modelled as a Markov Decision Process (MDP) and safe model-based learning is used to find the probabilistic distributions over route delays and demand. Monte Carlo Tree Search (MCTS) Earliest Deadline First (EDF) is used to safely explore the learned models in an online fashion and develop a near-optimal non-preemptive scheduling policy. These results are compared with Value Iteration (VI) and MCTS (Random) scheduling solutions.
Hierarchical Reinforcement Learning with AI Planning Models
Lee, Junkyu, Katz, Michael, Agravante, Don Joven, Liu, Miao, Tasse, Geraud Nangue, Klinger, Tim, Sohrabi, Shirin
Two common approaches to sequential decision-making are AI planning (AIP) and reinforcement learning (RL). Each has strengths and weaknesses. AIP is interpretable, easy to integrate with symbolic knowledge, and often efficient, but requires an up-front logical domain specification and is sensitive to noise; RL only requires specification of rewards and is robust to noise but is sample inefficient and not easily supplied with external knowledge. We propose an integrative approach that combines high-level planning with RL, retaining interpretability, transfer, and efficiency, while allowing for robust learning of the lower-level plan actions. Our approach defines options in hierarchical reinforcement learning (HRL) from AIP operators by establishing a correspondence between the state transition model of AI planning problem and the abstract state transition system of a Markov Decision Process (MDP). Options are learned by adding intrinsic rewards to encourage consistency between the MDP and AIP transition models. We demonstrate the benefit of our integrated approach by comparing the performance of RL and HRL algorithms in both MiniGrid and N-rooms environments, showing the advantage of our method over the existing ones.
Online Policy Optimization for Robust MDP
Dong, Jing, Li, Jingwei, Wang, Baoxiang, Zhang, Jingzhao
The rapid progress of reinforcement learning (RL) algorithms enables trained agents to navigate around complicated environments and solve complex tasks. The standard reinforcement learning methods, however, may fail catastrophically in another environment, even if the two environments only differ slightly in dynamics [Farebrother et al., 2018, Packer et al., 2018, Cobbe et al., 2019, Song et al., 2019, Raileanu and Fergus, 2021]. In practical applications, such mismatch of environment dynamics are common and can be caused by a number of reasons, e.g., model deviation due to incomplete data, unexpected perturbation and possible adversarial attacks. Part of the sensitivity of standard RL algorithms stems from the formulation of the underlying Markov decision process (MDP). In a sequence of interactions, MDP assumes the dynamic to be unchanged, and the trained agent to be tested on the same dynamic thereafter. To model the potential mismatch between system dynamics, the framework of robust MDP is introduced to account for the uncertainty of the parameters of the MDP [Satia and Lave Jr, 1973, White III and Eldeib, 1994, Nilim and El Ghaoui, 2005, Iyengar, 2005].
Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees
Tiapkin, Daniil, Belomestny, Denis, Calandriello, Daniele, Moulines, Eric, Munos, Remi, Naumov, Alexey, Rowland, Mark, Valko, Michal, Menard, Pierre
We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions. The performance of an agent is measured by the regret after interacting with the environment for $T$ episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in $H$, $S$, $A$, and $T$ per state-action pair. For OPSRL we guarantee a high-probability regret bound of order at most $\widetilde{\mathcal{O}}(\sqrt{H^3SAT})$ ignoring $\text{poly}\log(HSAT)$ terms. The key novel technical ingredient is a new sharp anti-concentration inequality for linear forms which may be of independent interest. Specifically, we extend the normal approximation-based lower bound for Beta distributions by Alfers and Dinges [1984] to Dirichlet distributions. Our bound matches the lower bound of order $\Omega(\sqrt{H^3SAT})$, thereby answering the open problems raised by Agrawal and Jia [2017b] for the episodic setting.
Scheduling of Missions with Constrained Tasks for Heterogeneous Robot Systems
Vázquez, Gricel, Calinescu, Radu, Cámara, Javier
We present a formal tasK AllocatioN and scheduling apprOAch for multi-robot missions (KANOA). KANOA supports two important types of task constraints: task ordering, which requires the execution of several tasks in a specified order; and joint tasks, which indicates tasks that must be performed by more than one robot. To mitigate the complexity of robotic mission planning, KANOA handles the allocation of the mission tasks to robots, and the scheduling of the allocated tasks separately. To that end, the task allocation problem is formalised in first-order logic and resolved using the Alloy model analyzer, and the task scheduling problem is encoded as a Markov decision process and resolved using the PRISM probabilistic model checker. We illustrate the application of KANOA through a case study in which a heterogeneous robotic team is assigned a hospital maintenance mission.