Goto

Collaborating Authors

 Country


Pairwise coupling of convolutional neural networks for better explicability of classification systems

arXiv.org Machine Learning

We examine several aspects of explicability of a classification system built from neural networks. The first aspect is the pairwise explicability, which is the ability to provide the most accurate prediction when the range of possibilities is narrowed to just two. Next we consider explicability in development, which means ability to make incremental improvement in prediction accuracy based on observed deficiency of the system. Inherent stochasticity of neural network based classifiers can be interpreted using likelihood randomness explicability. Finally, sureness explicability indicates confidence of the classifying system to make any prediction at all. These concepts are examined in the framework of pairwise coupling, which is a non-trainable metamodel that originated during development of support vector machines. Several methodologies are evaluated, of which the key one is shown to be the choice of the pairwise coupling method. We compare two methods: the established Wu-Lin-Weng method with the recently proposed Bayes covariant method. Our experiments indicate that the Wu-Lin-Weng method gives more weight to a single pairwise classifier, whereas the latter tries to balance information from the whole matrix of pairwise likelihoods. This translates into higher accuracy, and better sureness predictions for the Bayes covariant method. Pairwise coupling methodology has its costs, especially in terms of the number of parameters (but not necessarily in terms of training costs). However, when additional explicability aspects beyond accuracy are desired in an application, the pairwise coupling models are a promising alternative to the established methodology.


Towards Understanding Gender Bias in Relation Extraction

arXiv.org Machine Learning

Recent developments in Neural Relation Extraction (NRE) have made significant strides towards Automated Knowledge Base Construction (AKBC). While much attention has been dedicated towards improvements in accuracy, there have been no attempts in the literature to our knowledge to evaluate social biases in NRE systems. We create WikiGenderBias, a distantly supervised dataset with a human annotated test set. WikiGenderBias has sentences specifically curated to analyze gender bias in relation extraction systems. We use WikiGenderBias to evaluate systems for bias and find that NRE systems exhibit gender biased predictions and lay groundwork for future evaluation of bias in NRE. We also analyze how name anonymization, hard debiasing for word embeddings, and counterfactual data augmentation affect gender bias in predictions and performance.


Adaptivity in Adaptive Submodularity

arXiv.org Machine Learning

Adaptive sequential decision making is one of the central challenges in machine learning and artificial intelligence. In such problems, the goal is to design an interactive policy that plans for an action to take, from a finite set of $n$ actions, given some partial observations. It has been shown that in many applications such as active learning, robotics, sequential experimental design, and active detection, the utility function satisfies adaptive submodularity, a notion that generalizes the notion of diminishing returns to policies. In this paper, we revisit the power of adaptivity in maximizing an adaptive monotone submodular function. We propose an efficient batch policy that with $O(\log n \times\log k)$ adaptive rounds of observations can achieve an almost tight $(1-1/e-\epsilon)$ approximation guarantee with respect to an optimal policy that carries out $k$ actions in a fully sequential setting. To complement our results, we also show that it is impossible to achieve a constant factor approximation with $o(\log n)$ adaptive rounds. We also extend our result to the case of adaptive stochastic minimum cost coverage where the goal is to reach a desired utility $Q$ with the cheapest policy. We first prove the conjecture by Golovin and Krause that the greedy policy achieves the asymptotically tight logarithmic approximation guarantee without resorting to stronger notions of adaptivity. We then propose a batch policy that provides the same guarantee in polylogarithmic adaptive rounds through a similar information-parallelism scheme. Our results shrink the adaptivity gap in adaptive submodular maximization by an exponential factor.


Online matrix factorization for Markovian data and applications to Network Dictionary Learning

arXiv.org Machine Learning

Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of a dependent data stream remains largely unexplored. In this paper, we show that the well-known OMF algorithm for i.i.d. Furthermore, we extend the convergence result to the case when we can only approximately solve each step of the optimization problems in the algorithm. For applications, we demonstrate dictionary learning from a sequence of images generated by a Markov Chain Monte Carlo (MCMC) sampler. Lastly, by combining online nonnegative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning, which extracts'network dictionary patches' from a given network in an online manner that encodes main features of the network. We demonstrate this technique on real-world text data. I NTRODUCTION In modern data analysis, a central step is to find a low-dimensional representation to better understand, compress, or convey the key phenomena captured in the data. Matrix factorization provides a powerful setting for one to describe data in terms of a linear combination of factors or atoms. In this setting, we have a data matrix X R d n, and we seek a factorization of X into the product W H for W R d r and H R r n . This problem has gone by many names over the decades, each with different constraints: dictionary learning, factor analysis, topic modeling, component analysis. It has applications in text analysis, image reconstruction, medical imaging, bioinformatics, and many other scientific fields more generally [SGH02, BB05, BBL 07, CWS 11, TN12, BMB 15, RPZ 18]. Each column of the data matrix is approximated by a linear combination of the columns of the dictionary matrix. Online matrix factorization is a problem setting where data are accessed in a streaming manner and the matrix factors should be updated each time. That is, we get draws of X from some distribution π and seek the best factorization such that the expected loss E X πnull null X W H null 2 F null is small. This is a relevant setting in today' s data world, where large companies, scientific instruments, and healthcare systems are collecting massive amounts of data every day . One cannot compute with the entire 1 arXiv:1911.01931v3 There are several algorithms for computing factorizations of various kinds in an online context. Many of them have algorithmic convergence guarantees, however, all these guarantees require that data are sampled at each iteration i.i.d. with respect to previous iterations. In all of the application examples mentioned above, one may make an argument for (nearly) identical distributions, but never for independence.


Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks

arXiv.org Machine Learning

Lipschitz constraints under L2 norm on deep neural networks are useful for provable adversarial robustness bounds, stable training, and Wasserstein distance estimation. While heuristic approaches such as the gradient penalty have seen much practical success, it is challenging to achieve similar practical performance while provably enforcing a Lipschitz constraint. In principle, one can design Lipschitz constrained architectures using the composition property of Lipschitz functions, but Anil et al. recently identified a key obstacle to this approach: gradient norm attenuation. They showed how to circumvent this problem in the case of fully connected networks by designing each layer to be gradient norm preserving. We extend their approach to train scalable, expressive, provably Lipschitz convolutional networks. In particular, we present the Block Convolution Orthogonal Parameterization (BCOP), an expressive parameterization of orthogonal convolution operations. We show that even though the space of orthogonal convolutions is disconnected, the largest connected component of BCOP with 2n channels can represent arbitrary BCOP convolutions over n channels. Our BCOP parameterization allows us to train large convolutional networks with provable Lipschitz bounds. Empirically, we find that it is competitive with existing approaches to provable adversarial robustness and Wasserstein distance estimation.


The Bias-Expressivity Trade-off

arXiv.org Artificial Intelligence

Learning algorithms need bias to generalize and perform better than random guessing. We examine the flexibility (expressivity) of biased algorithms. An expressive algorithm can adapt to changing training data, altering its outcome based on changes in its input. We measure expressivity by using an information-theoretic notion of entropy on algorithm outcome distributions, demonstrating a trade-off between bias and expressivity. To the degree an algorithm is biased is the degree to which it can outperform uniform random sampling, but is also the degree to which is becomes inflexible. We derive bounds relating bias to expressivity, proving the necessary trade-offs inherent in trying to create strongly performing yet flexible algorithms.


Comparing Efficiency of Expert Data Aggregation Methods

arXiv.org Artificial Intelligence

Expert estimation of objects takes place when there are no benchmark values of object weights, but these weights still have to be defined. That is why it is problematic to define the efficiency of expert estimation methods. We propose to define efficiency of such methods based on stability of their results under perturbations of input data. We compare two modifications of combinatorial method of expert data aggregation (spanning tree enumeration). Using the example of these two methods, we illustrate two approaches to efficiency evaluation. The first approach is based on usage of real data, obtained through estimation of a set of model objects by a group of experts. The second approach is based on simulation of the whole expert examination cycle (including expert estimates). During evaluation of efficiency of the two listed modifications of combinatorial expert data aggregation method the simulation-based approach proved more robust and credible. Our experimental study confirms that if weights of spanning trees are taken into consideration, the results of combinatorial data aggregation method become more stable. So, weighted spanning tree enumeration method has an advantage over non-weighted method (and, consequently, over logarithmic least squares and row geometric mean methods).


Not All Claims are Created Equal: Choosing the Right Approach to Assess Your Hypotheses

arXiv.org Artificial Intelligence

Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP community. We address this gap by contrasting various hypothesis assessment techniques, especially those not commonly used in the field (such as evaluations based on Bayesian inference). Since these statistical techniques differ in the hypotheses they can support, we argue that practitioners should first decide their target hypothesis before choosing an assessment method. This is crucial because common fallacies, misconceptions, and misinterpretation surrounding hypothesis assessment methods often stem from a discrepancy between what one would like to claim versus what the method used actually assesses. Our survey reveals that these issues are omnipresent in the NLP research community. As a step forward, we provide best practices and guidelines tailored to NLP research, as well as an easy-to-use package called 'HyBayes' for Bayesian assessment of hypotheses, complementing existing tools.


The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents

arXiv.org Artificial Intelligence

We introduce dodecaDialogue: a set of 12 tasks that measures if a conversational agent can communicate engagingly with personality and empathy, ask questions, answer questions by utilizing knowledge resources, discuss topics and situations, and perceive and converse about images. By multi-tasking on such a broad large-scale set of data, we hope to both move towards and measure progress in producing a single unified agent that can perceive, reason and converse with humans in an open-domain setting. We show that such multi-tasking improves over a BERT pre-trained baseline, largely due to multi-tasking with very large dialogue datasets in a similar domain, and that the multi-tasking in general provides gains to both text and image-based tasks using several metrics in both the fine-tune and task transfer settings. We obtain state-of-the-art results on many of the tasks, providing a strong baseline for this challenge.


A perspective on multi-agent communication for information fusion

arXiv.org Artificial Intelligence

Collaborative decision making in multi-agent systems typically requires a predefined communication protocol among agents. Usually, agent-level observations are locally processed and information is exchanged using the predefined protocol, enabling the team to perform more efficiently than each agent operating in isolation. In this work, we consider the situation where agents, with complementary sensing modalities must co-operate to achieve a common goal/task by learning an efficient communication protocol. We frame the problem within an actor-critic scheme, where the agents learn optimal policies in a centralized fashion, while taking action in a distributed manner. We provide an interpretation of the emergent communication between the agents. We observe that the information exchanged is not just an encoding of the raw sensor data but is, rather, a specific set of directive actions that depend on the overall task. Simulation results demonstrate the interpretability of the learnt communication in a variety of tasks.