Google Research
Diagnosing AI Explanation Methods with Folk Concepts of Behavior
Jacovi, Alon (Bar Ilan University and Google Research) | Bastings, Jasmijn (Google Research) | Gehrmann, Sebastian (Google Research) | Goldberg, Yoav (Bar Ilan University and the Allen Institute for Artificial Intelligence) | Filippova, Katja (Google Research)
We investigate a formalism for the conditions of a successful explanation of AI. We consider "success" to depend not only on what information the explanation contains, but also on what information the human explainee understands from it. Theory of mind literature discusses the folk concepts that humans use to understand and generalize behavior. We posit that folk concepts of behavior provide us with a "language" that humans understand behavior with. We use these folk concepts as a framework of social attribution by the human explainee--the information constructs that humans are likely to comprehend from explanations--by introducing a blueprint for an explanatory narrative (Figure 1) that explains AI behavior with these constructs. We then demonstrate that many XAI methods today can be mapped to folk concepts of behavior in a qualitative evaluation. This allows us to uncover their failure modes that prevent current methods from explaining successfully--i.e., the information constructs that are missing for any given XAI method, and whose inclusion can decrease the likelihood of misunderstanding AI behavior.
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text
Gehrmann, Sebastian (a:1:{s:5:"en_US";s:15:"Google Research";}) | Clark, Elizabeth (Google Research) | Sellam, Thibault
Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural generation models have improved to the point where their outputs can often no longer be distinguished based on the surface-level features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for evaluation research and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 generation papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.
Capturing Ambiguity in Crowdsourcing Frame Disambiguation
Dumitrache, Anca (Vrije Universiteit Amsterdam) | Aroyo, Lora (Vrije Universiteit Amsterdam) | Welty, Chris (Google Research)
FrameNet is a computational linguistics resource composed of semantic frames, high-level concepts that represent the meanings of words. In this paper, we present an approach to gather frame disambiguation annotations in sentences using a crowdsourcing approach with multiple workers per sentence to capture inter-annotator disagreement . We perform an experiment over a set of 433 sentences annotated with frames from the FrameNet corpus, and show that the aggregated crowd annotations achieve an F1 score greater than 0.67 as compared to expert linguists. We highlight cases where the crowd annotation was correct even though the expert is in disagreement, arguing for the need to have multiple annotators per sentence.ย Most importantly, we examine cases in which crowd workers could not agree, and demonstrate that these cases exhibit ambiguity, either in the sentence, frame, or the task itself, and argue that collapsing such cases to a single, discrete truth value (i.e. correct or incorrect) is inappropriate, creating arbitrary targets for machine learning.
HARP: Hierarchical Representation Learning for Networks
Chen, Haochen (Stony Brook University) | Perozzi, Bryan (Google Research) | Hu, Yifan (Yahoo! Research) | Skiena, Steven (Stony Brook University)
We present HARP, a novel method for learning low dimensional embeddings of a graphโs nodes which preserves higher-order structural features. Our proposed method achieves this by compressing the input graph prior to embedding it, effectively avoiding troublesome embedding configurations (i.e. local minima) which can pose problems to non-convex optimization. HARP works by finding a smaller graph which approximates the global structure of its input. This simplified graph is used to learn a set of initial representations, which serve as good initializations for learning representations in the original, detailed graph. We inductively extend this idea, by decomposing a graph in a series of levels, and then embed the hierarchy of graphs from the coarsest one to the original graph. HARP is a general meta-strategy to improve all of the state-of-the-art neural algorithms for embedding graphs, including DeepWalk, LINE, and Node2vec. Indeed, we demonstrate that applying HARPโs hierarchical paradigm yields improved implementations for all three of these methods, as evaluated on classification tasks on real-world graphs such as DBLP, BlogCatalog, and CiteSeer, where we achieve a performance gain over the original implementations by up to 14% Macro F1.
A Bayesian Clearing Mechanism for Combinatorial Auctions
Brero, Gianluca (University of Zurich) | Lahaie, Sรฉbastien (Google Research)
We cast the problem of combinatorial auction design in a Bayesian framework in order to incorporate prior information into the auction process and minimize the number of rounds to convergence. We first develop a generative model of agent valuations and market prices such that clearing prices become maximum a posteriori estimates given observed agent valuations. This generative model then forms the basis of an auction process which alternates between refining estimates of agent valuations and computing candidate clearing prices. We provide an implementation of the auction using assumed density filtering to estimate valuations and expectation maximization to compute prices. An empirical evaluation over a range of valuation domains demonstrates that our Bayesian auction mechanism is highly competitive against the combinatorial clock auction in terms of rounds to convergence, even under the most favorable choices of price increment for this baseline.
Clustering Small Samples With Quality Guarantees: Adaptivity With One2all PPS
Cohen, Edith (Google Research, Tel Aviv University) | Chechik, Shiri (Tel Aviv University) | Kaplan, Haim (Tel Aviv University)
Clustering of data points is a fundamental tool in data analysis. We consider points X in a relaxed metric space, where the triangle inequality holds within a constant factor. A clustering of X is a partition of X defined by a set of points Q (centroids), according to the closest centroid. The cost of clustering X by Q is V ( Q )= โ x โ X d xQ . This formulation generalizes classic k- means clustering, which uses squared distances. Two basic tasks, parametrized by k โฅ 1, are cost estimation, which returns (approximate) V ( Q ) for queries Q such that | Q | =ย k and clustering, which returns an (approximate) minimizer of V ( Q ) of size | Q |=ย k . When the data set X is very large, we seek efficient constructions of small samples that can act as surrogates for performing these tasks. Existing constructions that provide quality guarantees, however, are either worst-case, and unable to benefit from structure of real data sets, or make explicit strong assumptions on the structure. We show here how to avoid both these pitfalls using adaptive designs. The core of our design are the novel one2all probabilities, computed for a set M of centroids and ฮฑ โฅ 1:ย The clustering cost of ย each Q with cost V ( Q ) โฅ V(M)/ฮฑ can be estimated well from a sample of size O (ฮฑ | M | ฮต -2 ). For cost estimation, we apply one2all with a bicriteria approximate M , while adaptively balancing | M | and ฮฑ to optimize sample size per quality. For clustering, we present a wrapper that adaptively applies a base clustering algorithm to a sample S, using the smallest sample that provides the desired statistical guarantees on quality. We demonstrate experimentally the huge gains of using our adaptive instead of worst-case methods.
Byte-Level Machine Reading Across Morphologically Varied Languages
Kenter, Tom (University of Amsterdam) | Jones, Llion (Google Research) | Hewlett, Daniel (Google)
The machine reading task, where a computer reads a document and answers questions about it, is important in artificial intelligence research. Recently, many models have been proposed to address it. Word-level models, which have words as units of input and output, have proven to yield state-of-the-art results when evaluated on English datasets. However, in morphologically richer languages, many more unique words exist than in English due to highly productive prefix and suffix mechanisms. This may set back word-level models, since vocabulary sizes too big to allow for efficient computing may have to be employed. Multiple alternative input granularities have been proposed to avoid large input vocabularies, such as morphemes, character n-grams, and bytes. Bytes are advantageous as they provide a universal encoding format across languages, and allow for a small vocabulary size, which, moreover, is identical for every input language. In this work, we investigate whether bytes are suitable as input units across morphologically varied languages. To test this, we introduce two large-scale machine reading datasets in morphologically rich languages, Turkish and Russian. We implement 4 byte-level models, representing the major types of machine reading models and introduce a new seq2seq variant, called encoder-transformer-decoder. We show that, for all languages considered, there are models reading bytes outperforming the current state-of-the-art word-level baseline. Moreover, the newly introduced encoder-transformer-decoder performs best on the morphologically most involved dataset, Turkish. The large-scale Turkish and Russian machine reading datasets are released to public.
Scalable Feature Selection via Distributed Diversity Maximization
Zadeh, Sepehr Abbasi (Sharif University of Technology) | Ghadiri, Mehrdad (Sharif University of Technology) | Mirrokni, Vahab (Google Research) | Zadimoghaddam, Morteza (Google Research)
Feature selection is a fundamental problem in machine learning and data mining. The majority of feature selection algorithms are designed for running on a single machine (centralized setting) and they are less applicable to very large datasets. Although there are some distributed methods to tackle this problem, most of them are distributing the data horizontally which are not suitable for datasets with a large number of features and few number of instances. Thus, in this paper, we introduce a novel vertically distributable feature selection method in order to speed up this process and be able to handle very large datasets in a scalable manner. In general, feature selection methods aim at selecting relevant and non-redundant features (Minimum Redundancy and Maximum Relevance). It is much harder to consider redundancy in a vertically distributed setting than a centralized setting since there is no global access to the whole data. To the best of our knowledge, this is the first attempt toward solving the feature selection problem with a vertically distributed filter method which handles the redundancy with consistently comparable results with centralized methods. In this paper, we formalize the feature selection problem as a diversity maximization problem by introducing a mutual-information-based metric distance on the features. We show the effectiveness of our method by performing an extensive empirical study. In particular, we show that our distributed method outperforms state-of-the-art centralized feature selection algorithms on a variety of datasets. From a theoretical point of view, we have proved that the used greedy algorithm in our method achieves an approximation factor of 1/4 for the diversity maximization problem in a distributed setting with high probability. Furthermore, we improve this to 8/25 expected approximation using multiplicity in our distribution.
Deep Music: Towards Musical Dialogue
Bretan, Mason (Georgia Institute of Technology) | Oore, Sageev (Google Research) | Engel, Jesse (Google Research) | Eck, Douglas (Google Research) | Heck, Larry (Google Research)
Computer dialogue systems are designed with the intention of supporting meaningful interactions with humans. Common modes of communication include speech, text, and physical gestures. In this work we explore a communication paradigm in which the input and output channels consist of music. Specifically, we examine the musical interaction scenario of call and response. We present a system that utilizes a deep autoencoder to learn semantic embeddings of musical input. The system learns to transform these embeddings in a manner such that reconstructing from these transformation vectors produces appropriate musical responses. In order to generate a response the system employs a combination of generation and unit selection. Selection is based on a nearest neighbor search within the embedding space and for real-time application the search space is pruned using vector quantization. The live demo consists of a person playing a midi keyboard and the computer generating a response that is played through a loudspeaker.
DECK: Discovering Event Composition Knowledge from Web Images for Zero-Shot Event Detection and Recounting in Videos
Gan, Chuang (Tsinghua University) | Sun, Chen (Google Research) | Nevatia, Ram (University of Southern California)
We address the problem of zero-shot event recognition in consumer videos. An event usually consists of multiple human-human and human-object interactions over a relative long period of time. A common approach proceeds by representing videos with banks of object and action concepts, but requires additional user inputs to specify the desired concepts per event. In this paper, we provide a fully automatic algorithm to select representative and reliable concepts for event queries. This is achieved by discovering event composition knowledge (DECK) from web images. To evaluate our proposed method, we use the standard zero-shot event detection protocol (ZeroMED), but also introduce a novel zero-shot event recounting (ZeroMER) problem to select supporting evidence of the events. Our ZeroMER formulation aims to select video snippets that are relevant and diverse. Evaluation on the challenging TRECVID MED dataset show that our proposed method achieves promising results on both tasks.