Plotting

 Ie, Eugene


Effective and General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping

arXiv.org Artificial Intelligence

In instruction conditioned navigation, agents interpret natural language and their surroundings to navigate through an environment. Datasets for studying this task typically contain pairs of these instructions and reference trajectories. Yet, most evaluation metrics used thus far fail to properly account for the latter, relying instead on insufficient similarity comparisons. We address fundamental flaws in previously used metrics and show how Dynamic Time Warping (DTW), a long known method of measuring similarity between two time series, can be used for evaluation of navigation agents. For such, we define the normalized Dynamic Time Warping (nDTW) metric, that softly penalizes deviations from the reference path, is naturally sensitive to the order of the nodes composing each path, is suited for both continuous and graph-based evaluations, and can be efficiently calculated. Further, we define SDTW, which constrains nDTW to only successful paths. We collect human similarity judgments for simulated paths and find nDTW correlates better with human rankings than all other metrics. We also demonstrate that using nDTW as a reward signal for Reinforcement Learning navigation agents improves their performance on both the Room-to-Room (R2R) and Room-for-Room (R4R) datasets. The R4R results in particular highlight the superiority of SDTW over previous success-constrained metrics.


Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation

arXiv.org Artificial Intelligence

Advances in learning and representations have reinvigorated work that connects language to other modalities. A particularly exciting direction is Vision-and-Language Navigation (VLN), in which agents interpret natural language instructions and visual scenes to move through environments and reach goals. Despite recent progress, current research leaves unclear how much of a role language understanding plays in this task, especially because dominant evaluation metrics have focused on Figure 1: It's the journey, not just the goal. To give goal completion rather than the sequence of actions language its due place in VLN, we compose paths in corresponding to the instructions. Here, the R2R dataset to create longer, twistier R4R paths we highlight shortcomings of current metrics (blue). Under commonly used metrics, agents that head for the Room-to-Room dataset (Anderson et al., straight to the goal (red) are not penalized for ignoring 2018b) and propose a new metric, Coverage the language instructions: for instance, SPL yields a weighted by Length Score (CLS). We also show perfect 1.0 score for the red and only 0.17 for the orange that the existing paths in the dataset are not path. In contrast, our proposed CLS metric measures ideal for evaluating instruction following because fidelity to the reference path, strongly preferring the they are direct-to-goal shortest paths.


Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

arXiv.org Artificial Intelligence

Recommender systems have become ubiquitous, transforming user interactions with products, services and content in a wide variety of domains. In content recommendation, recommenders generally surface relevant and/or novel personalized content based on learned models of user preferences (e.g., as in collaborative filtering [Breese et al., 1998, Konstan et al., 1997, Srebro et al., 2004, Salakhutdinov and Mnih, 2007]) or predictive models of user responses to specific recommendations. Well-known applications of recommender systems include video recommendations on YouTube [Covington et al., 2016], movie recommendations on Netflix [Gomez-Uribe and Hunt, 2016] and playlist construction on Spotify [Jacobson et al., 2016]. It is increasingly common to train deep neural networks (DNNs) [van den Oord et al., 2013, Wang et al., 2015, Covington et al., 2016, Cheng et al., 2016] to predict user responses (e.g., click-through rates, content engagement, ratings, likes) to generate, score and serve candidate recommendations. Practical recommender systems largely focus on myopic prediction--estimating a user's immediate response to a recommendation--without considering the long-term impact on subsequent user behavior. This can be limiting: modeling a recommendation's stochastic impact on the future affords opportunities to trade off user engagement in the near-term for longer-term benefit (e.g., by probing a user's interests, or improving satisfaction).