Undirected Networks
Human-centered collaborative robots with deep reinforcement learning
Ghadirzadeh, Ali, Chen, Xi, Yin, Wenjie, Yi, Zhengrong, Björkman, Mårten, Kragic, Danica
Human-centered collaborative systems require proactive robot behavior with precise timing, which in turn mandates awareness of human actions, state of the environment and the task being executed, [1-4]. Proactive robot behavior is achieved by (1) recognizing the current state of the human collaborator and the environment based on real-time observations, (2) human action prediction given the observations and the model of the task, and (3) generating robot actions in line with the prediction. Human action recognition may however be highly uncertain if the human collaborator is not executing a strictly defined task plan. This is true regardless of whether perception is based on motion-capture devices or image based pose estimation. For a robot to act in a proactive manner, while at the same time avoiding actions when the risk of making a mistake is too high, it is essential for the action-decision system to take this uncertainty into consideration. We therefore propose to train the perception system and the robot policy in an end-to-end fashion using reinforcement learning (RL). This is different from earlier studies in which human action recognition and prediction are typically decoupled from robot action policy training [3-7]. Our main objective is to improve the fluency in coordination between the human and robot partners by allowing the policy to explicitly weigh the benefits of timely actions to the risk of making a mistake when uncertainties are too high.
Coupling Learning of Complex Interactions
Complex applications such as big data analytics involve different forms of coupling relationships that reflect interactions between factors related to technical, business (domain-specific) and environmental (including socio-cultural and economic) aspects. There are diverse forms of couplings embedded in poor-structured and ill-structured data. Such couplings are ubiquitous, implicit and/or explicit, objective and/or subjective, heterogeneous and/or homogeneous, presenting complexities to existing learning systems in statistics, mathematics and computer sciences, such as typical dependency, association and correlation relationships. Modeling and learning such couplings thus is fundamental but challenging. This paper discusses the concept of coupling learning, focusing on the involvement of coupling relationships in learning systems. Coupling learning has great potential for building a deep understanding of the essence of business problems and handling challenges that have not been addressed well by existing learning theories and tools. This argument is verified by several case studies on coupling learning, including handling coupling in recommender systems, incorporating couplings into coupled clustering, coupling document clustering, coupled recommender algorithms and coupled behavior analysis for groups.
Prescribing Deep Attentive Score Prediction Attracts Improved Student Engagement
Lee, Youngnam, Kim, Byungsoo, Shin, Dongmin, Kim, JungHoon, Baek, Jineon, Lee, Jinhwan, Choi, Youngduck
Intelligent Tutoring Systems (ITSs) have been developed to provide students with personalized learning experiences by adaptively generating learning paths optimized for each individual. Within the vast scope of ITS, score prediction stands out as an area of study that enables students to construct individually realistic goals based on their current position. Via the expected score provided by the ITS, a student can instantaneously compare one's expected score to one's actual score, which directly corresponds to the reliability that the ITS can instill. In other words, refining the precision of predicted scores strictly correlates to the level of confidence that a student may have with an ITS, which will evidently ensue improved student engagement. However, previous studies have solely concentrated on improving the performance of a prediction model, largely lacking focus on the benefits generated by its practical application. In this paper, we demonstrate that the accuracy of the score prediction model deployed in a real-world setting significantly impacts user engagement by providing empirical evidence. To that end, we apply a state-of-the-art deep attentive neural network-based score prediction model to Santa, a multi-platform English ITS with approximately 780K users in South Korea that exclusively focuses on the TOEIC (Test of English for International Communications) standardized examinations. We run a controlled A/B test on the ITS with two models, respectively based on collaborative filtering and deep attentive neural networks, to verify whether the more accurate model engenders any student engagement. The results conclude that the attentive model not only induces high student morale (e.g. higher diagnostic test completion ratio, number of questions answered, etc.) but also encourages active engagement (e.g. higher purchase rate, improved total profit, etc.) on Santa.
Bandit Linear Control
Reinforcement learning studies sequential decision making problems where a learning agent repeatedly interacts with an environment and aims to improve her strategy over time based on the received feedback. One of the most fundamental tradeoffs in reinforcement learning theory is the exploration vs. exploitation tradeoff, that arises whenever the learner observes only partial feedback after each of her decisions, thus having to balance between exploring new strategies and exploiting those that are already known to perform well. The most basic and well-studied form of partial feedback is the so-called "bandit" feedback, where the learner only observes the cost of her chosen action on each decision round, while obtaining no information about the performance of other actions. Traditionally, the environment dynamics in reinforcement learning are modeled as a Markov Decision Process (MDP) with a finite number of possible states and actions. The MDP model has been studied and analyzed in numerous different settings and under various assumptions on the transition parameters, the nature of the reward functions, and the feedback model. Recently, a particular focus has been given to continuous state-action MDPs, and in particular, to a specific family of models in classic control where the state transition function is linear.
Sequential Transfer in Reinforcement Learning with a Generative Model
Tirinzoni, Andrea, Poiani, Riccardo, Restelli, Marcello
We are interested in how to design reinforcement learning agents that provably reduce the sample complexity for learning new tasks by transferring knowledge from previously-solved ones. The availability of solutions to related problems poses a fundamental trade-off: whether to seek policies that are expected to achieve high (yet sub-optimal) performance in the new task immediately or whether to seek information to quickly identify an optimal solution, potentially at the cost of poor initial behavior. In this work, we focus on the second objective when the agent has access to a generative model of state-action pairs. First, given a set of solved tasks containing an approximation of the target one, we design an algorithm that quickly identifies an accurate solution by seeking the state-action pairs that are most informative for this purpose. We derive PAC bounds on its sample complexity which clearly demonstrate the benefits of using this kind of prior knowledge. Then, we show how to learn these approximate tasks sequentially by reducing our transfer setting to a hidden Markov model and employing spectral methods to recover its parameters. Finally, we empirically verify our theoretical findings in simple simulated domains.
Deployment of Machine Learning Models
Online Courses Udemy - Deployment of Machine Learning Models Build Machine Learning Model APIs Created by Soledad Galli, Christopher Samiullah English [Auto] Students also bought Data Science: Natural Language Processing (NLP) in Python Recommender Systems and Deep Learning in Python Artificial Intelligence: Reinforcement Learning in Python Unsupervised Machine Learning Hidden Markov Models in Python Deep Learning: Recurrent Neural Networks in Python Preview this course GET COUPON CODE Description Learn how to put your machine learning models into production. Deployment of machine learning models, or simply, putting models into production, means making your models available to your other business systems. By deploying models, other systems can send data to them and get their predictions, which are in turn populated back into the company systems. Through machine learning model deployment, you and your business can begin to take full advantage of the model you built. When we think about data science, we think about how to build machine learning models, we think about which algorithm will be more predictive, how to engineer our features and which variables to use to make the models more accurate.
Enforcing Almost-Sure Reachability in POMDPs
Junges, Sebastian, Jansen, Nils, Seshia, Sanjit A.
Partially-Observable Markov Decision Processes (POMDPs) are a well-known formal model for planning scenarios where agents operate under limited information about their environment. In safety-critical domains, the agent must adhere to a policy satisfying certain behavioral constraints. We study the problem of synthesizing policies that almost-surely reach some goal state while a set of bad states is never visited. In particular, we present an iterative symbolic approach that computes a winning region, that is, a set of system configurations such that all policies that stay within this set are guaranteed to satisfy the constraints. The approach generalizes and improves previous work in terms of scalability and efficacy, as demonstrated in the empirical evaluation. Additionally, we show the applicability to safe exploration by restricting agent behavior to these winning regions.
Verification of indefinite-horizon POMDPs
Bork, Alexander, Junges, Sebastian, Katoen, Joost-Pieter, Quatmann, Tim
Markov decision processes are the model to reason about systems involving nondeterministic choice and probabilistic branching. They have widespread usage in planning and scheduling, robotics, and formal methods. In the latter, the key verification question is whether for any policy, i.e., for any resolution of the nondeterminism, the probability to reach the bad states is below a threshold [3]. The verification question may be efficiently analysed using a variety of techniques such as linear programming, value iteration, or policy iteration, readily available in mature tools such as Storm [15], Prism [22] and Modest [13]. However, those verification results are often overly pessimistic.
Dynamic Regret of Policy Optimization in Non-stationary Environments
Fei, Yingjie, Yang, Zhuoran, Wang, Zhaoran, Xie, Qiaomin
We consider reinforcement learning (RL) in episodic MDPs with adversarial full-information reward feedback and unknown fixed transition kernels. We propose two model-free policy optimization algorithms, POWER and POWER++, and establish guarantees for their dynamic regret. Compared with the classical notion of static regret, dynamic regret is a stronger notion as it explicitly accounts for the non-stationarity of environments. The dynamic regret attained by the proposed algorithms interpolates between different regimes of non-stationarity, and moreover satisfies a notion of adaptive (near-)optimality, in the sense that it matches the (near-)optimal static regret under slow-changing environments. The dynamic regret bound features two components, one arising from exploration, which deals with the uncertainty of transition kernels, and the other arising from adaptation, which deals with non-stationary environments. Specifically, we show that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction. To the best of our knowledge, our work is the first dynamic regret analysis of model-free RL algorithms in non-stationary environments.
On Bellman's Optimality Principle for zs-POSGs
Buffet, Olivier, Dibangoye, Jilles, Delage, Aurélien, Saffidine, Abdallah, Thomas, Vincent
Many non-trivial sequential decision-making problems are efficiently solved by relying on Bellman's optimality principle, i.e., exploiting the fact that sub-problems are nested recursively within the original problem. Here we show how it can apply to (infinite horizon) 2-player zero-sum partially observable stochastic games (zs-POSGs) by (i) taking a central planner's viewpoint, which can only reason on a sufficient statistic called occupancy state, and (ii) turning such problems into zero-sum occupancy Markov games (zs-OMGs). Then, exploiting the Lipschitz-continuity of the value function in occupancy space, one can derive a version of the HSVI algorithm (Heuristic Search Value Iteration) that provably finds an $\epsilon$-Nash equilibrium in finite time.