Collaborating Authors


Rational Shapley Values Artificial Intelligence

Explaining the predictions of opaque machine learning algorithms is an important and challenging task, especially as complex models are increasingly used to assist in high-stakes decisions such as those arising in healthcare and finance. Most popular tools for post-hoc explainable artificial intelligence (XAI) are either insensitive to context (e.g., feature attributions) or difficult to summarize (e.g., counterfactuals). In this paper, I introduce \emph{rational Shapley values}, a novel XAI method that synthesizes and extends these seemingly incompatible approaches in a rigorous, flexible manner. I leverage tools from decision theory and causal modeling to formalize and implement a pragmatic approach that resolves a number of known challenges in XAI. By pairing the distribution of random variables with the appropriate reference class for a given explanation task, I illustrate through theory and experiments how user goals and knowledge can inform and constrain the solution set in an iterative fashion. The method compares favorably to state of the art XAI tools in a range of quantitative and qualitative comparisons.

Practical Machine Learning Safety: A Survey and Primer Artificial Intelligence

Among different ML models, Deep Neural Networks (DNNs) [130] are well-known and widely used for their powerful representation learning from high-dimensional data such as images, texts, and speech. However, as ML algorithms enter sensitive real-world domains with trustworthiness, safety, and fairness prerequisites, the need for corresponding techniques and metrics for high-stake domains is more noticeable than before. Hence, researchers in different fields propose guidelines for Trustworthy AI [208], Safe AI [5], and Explainable AI [155] as stepping stones for next generation Responsible AI [6, 247]. Furthermore, government reports and regulations on AI accountability [75], trustworthiness [216], and safety [31] are gradually creating mandating laws to protect citizens' data privacy, fair data processing, and upholding safety for AI-based products. The development and deployment of ML algorithms for open-world tasks come with reliability and dependability limitations rooting from model performance, robustness, and uncertainty limitations [156]. Unlike traditional code-based software, ML models have fundamental safety drawbacks, including performance limitations on their training set and run-time robustness in their operational domain.

Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style Machine Learning

Self-supervised representation learning has shown remarkable success in a number of domains. A common practice is to perform data augmentation via hand-crafted transformations intended to leave the semantics of the data invariant. We seek to understand the empirical success of this approach from a theoretical perspective. We formulate the augmentation process as a latent variable model by postulating a partition of the latent representation into a content component, which is assumed invariant to augmentation, and a style component, which is allowed to change. Unlike prior work on disentanglement and independent component analysis, we allow for both nontrivial statistical and causal dependencies in the latent space. We study the identifiability of the latent representation based on pairs of views of the observations and prove sufficient conditions that allow us to identify the invariant content partition up to an invertible mapping in both generative and discriminative settings. We find numerical simulations with dependent latent variables are consistent with our theory. Lastly, we introduce Causal3DIdent, a dataset of high-dimensional, visually complex images with rich causal dependencies, which we use to study the effect of data augmentations performed in practice.

Data-Driven Design-by-Analogy: State of the Art and Future Directions Artificial Intelligence

Design-by-Analogy (DbA) is a design methodology, wherein new solutions are generated in a target domain based on inspiration drawn from a source domain through cross-domain analogical reasoning [1, 2, 3]. DbA is an active research area in engineering design and various methods and tools have been proposed to support the implement of its process [4, 5, 6, 7, 8]. Studies have shown that DbA can help designers mitigate design fixation [9] and improve design ideation outcomes [10]. Fig.1 presents an example of DbA applications [11]. This case aims to solve an engineering design problem: How might we rectify the loud sonic boom generated when trains travel at high speeds through tunnels in atmospheric conditions [11, 12]? For potential design solutions to this problem, engineers explored structures in other design fields than trains or in the nature that effectively "break" the sonic-boom effect. When looking into the nature, engineers discovered that kingfisher birds could slice through the air and dive into the water at extremely high speeds to catch prey while barely making a splash. By analogy, engineers re-designed the train's front-end nose to mimic the geometry of the kingfisher's beak. This analogical design reduced noise and eliminated tunnel booms.

Non-negative matrix factorization algorithms greatly improve topic model fits Machine Learning

We report on the potential for using algorithms for non-negative matrix factorization (NMF) to improve parameter estimation in topic models. While several papers have studied connections between NMF and topic models, none have suggested leveraging these connections to develop new algorithms for fitting topic models. Importantly, NMF avoids the "sum-to-one" constraints on the topic model parameters, resulting in an optimization problem with simpler structure and more efficient computations. Building on recent advances in optimization algorithms for NMF, we show that first solving the NMF problem then recovering the topic model fit can produce remarkably better fits, and in less time, than standard algorithms for topic models. While we focus primarily on maximum likelihood estimation, we show that this approach also has the potential to improve variational inference for topic models. Our methods are implemented in the R package fastTopics.

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets Artificial Intelligence

Ensembles of machine learning models yield improved system performance as well as robust and interpretable uncertainty estimates; however, their inference costs may often be prohibitively high. Ensemble Distribution Distillation is an approach that allows a single model to efficiently capture both the predictive performance and uncertainty estimates of an ensemble. For classification, this is achieved by training a Dirichlet distribution over the ensemble members' output distributions via the maximum likelihood criterion. Although theoretically principled, this criterion exhibits poor convergence when applied to large-scale tasks where the number of classes is very high. In our work, we analyze this effect and show that for the Dirichlet log-likelihood criterion classes with low probability induce larger gradients than high-probability classes. This forces the model to focus on the distribution of the ensemble tail-class probabilities. We propose a new training objective which minimizes the reverse KL-divergence to a Proxy-Dirichlet target derived from the ensemble. This loss resolves the gradient issues of Ensemble Distribution Distillation, as we demonstrate both theoretically and empirically on the ImageNet and WMT17 En-De datasets containing 1000 and 40,000 classes, respectively.

Few-shot Learning for Topic Modeling Machine Learning

Topic models have been successfully used for analyzing text documents. However, with existing topic models, many documents are required for training. In this paper, we propose a neural network-based few-shot learning method that can learn a topic model from just a few documents. The neural networks in our model take a small number of documents as inputs, and output topic model priors. The proposed method trains the neural networks such that the expected test likelihood is improved when topic model parameters are estimated by maximizing the posterior probability using the priors based on the EM algorithm. Since each step in the EM algorithm is differentiable, the proposed method can backpropagate the loss through the EM algorithm to train the neural networks. The expected test likelihood is maximized by a stochastic gradient descent method using a set of multiple text corpora with an episodic training framework. In our experiments, we demonstrate that the proposed method achieves better perplexity than existing methods using three real-world text document sets.

Distributed NLI: Learning to Predict Human Opinion Distributions for Language Reasoning Artificial Intelligence

We introduce distributed NLI, a new NLU task with a goal to predict the distribution of human judgements for natural language inference. We show that models can capture human judgement distribution by applying additional distribution estimation methods, namely, Monte Carlo (MC) Dropout, Deep Ensemble, Re-Calibration, and Distribution Distillation. All four of these methods substantially outperform the softmax baseline. We show that MC Dropout is able to achieve decent performance without any distribution annotations while Re-Calibration can further give substantial improvements when extra distribution annotations are provided, suggesting the value of multiple annotations for the example in modeling the distribution of human judgements. Moreover, MC Dropout and Re-Calibration can achieve decent transfer performance on out-of-domain data. Despite these improvements, the best results are still far below estimated human upper-bound, indicating that the task of predicting the distribution of human judgements is still an open, challenging problem with large room for future improvements. We showcase the common errors for MC Dropout and Re-Calibration. Finally, we give guidelines on the usage of these methods with different levels of data availability and encourage future work on modeling the human opinion distribution for language reasoning.

From partners to populations: A hierarchical Bayesian account of coordination and convention Artificial Intelligence

Languages are powerful solutions to coordination problems: they provide stable, shared expectations about how the words we say correspond to the beliefs and intentions in our heads. Yet language use in a variable and non-stationary social environment requires linguistic representations to be flexible: old words acquire new ad hoc or partner-specific meanings on the fly. In this paper, we introduce a hierarchical Bayesian theory of convention formation that aims to reconcile the long-standing tension between these two basic observations. More specifically, we argue that the central computational problem of communication is not simply transmission, as in classical formulations, but learning and adaptation over multiple timescales. Under our account, rapid learning within dyadic interactions allows for coordination on partner-specific common ground, while social conventions are stable priors that have been abstracted away from interactions with multiple partners. We present new empirical data alongside simulations showing how our model provides a cognitive foundation for explaining several phenomena that have posed a challenge for previous accounts: (1) the convergence to more efficient referring expressions across repeated interaction with the same partner, (2) the gradual transfer of partner-specific common ground to novel partners, and (3) the influence of communicative context on which conventions eventually form.

Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach Machine Learning

Topic models such as the Structural Topic Model (STM) estimate latent topical clusters within text. An important step in many topic modeling applications is to explore relationships between the discovered topical structure and metadata associated with the text documents. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself. The authors of the STM, for instance, perform repeated OLS regressions of sampled topic proportions on metadata covariates by using a Monte Carlo sampling technique known as the method of composition. In this paper, we propose two improvements: first, we replace OLS with more appropriate Beta regression. Second, we suggest a fully Bayesian approach instead of the current blending of frequentist and Bayesian methods. We demonstrate our improved methodology by exploring relationships between Twitter posts by German members of parliament (MPs) and different metadata covariates.