Goto

Collaborating Authors

 Supervised Learning


Learning Adaptive Value of Information for Structured Prediction

Neural Information Processing Systems

Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to fine-tune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting non-homogeneity of the data.


Chiefs' Harrison Butker drills longest field goal in Super Bowl history, breaking record set in 1st half

FOX News

Fox News Flash top sports headlines are here. Check out what's clicking on Foxnews.com. The first score of the Super Bowl turned out to be historic, but the record didn't stand for long. Kansas City Chiefs kicker Harrison Butker drilled his second field goal of Super Bowl LVIII against the San Francisco 49ers, a 57-yarder that rewrote the record for longest kick in the "Big Game." And that previous record was hit in the first half by 49ers rookie kicker Jake Moody, who hit one home from 56 yards.


Online Structured Prediction with Fenchel--Young Losses and Improved Surrogate Regret for Online Multiclass Classification with Logistic Loss

arXiv.org Artificial Intelligence

This paper studies online structured prediction with full-information feedback. For online multiclass classification, van der Hoeven (2020) has obtained surrogate regret bounds independent of the time horizon, or \emph{finite}, by introducing an elegant \emph{exploit-the-surrogate-gap} framework. However, this framework has been limited to multiclass classification primarily because it relies on a classification-specific procedure for converting estimated scores to outputs. We extend the exploit-the-surrogate-gap framework to online structured prediction with \emph{Fenchel--Young losses}, a large family of surrogate losses including the logistic loss for multiclass classification, obtaining finite surrogate regret bounds in various structured prediction problems. To this end, we propose and analyze \emph{randomized decoding}, which converts estimated scores to general structured outputs. Moreover, by applying our decoding to online multiclass classification with the logistic loss, we obtain a surrogate regret bound of $O(B^2)$, where $B$ is the $\ell_2$-diameter of the domain. This bound is tight up to logarithmic factors and improves the previous bound of $O(dB^2)$ due to van der Hoeven (2020) by a factor of $d$, the number of classes.


On dimensionality of feature vectors in MPNNs

arXiv.org Artificial Intelligence

We revisit the classical result of Morris et al.~(AAAI'19) that message-passing graphs neural networks (MPNNs) are equal in their distinguishing power to the Weisfeiler--Leman (WL) isomorphism test. Morris et al.~show their simulation result with ReLU activation function and $O(n)$-dimensional feature vectors, where $n$ is the number of nodes of the graph. Recently, by introducing randomness into the architecture, Aamand et al.~(NeurIPS'22) were able to improve this bound to $O(\log n)$-dimensional feature vectors, although at the expense of guaranteeing perfect simulation only with high probability. In all these constructions, to guarantee equivalence to the WL test, the dimension of feature vectors in the MPNN has to increase with the size of the graphs. However, architectures used in practice have feature vectors of constant dimension. Thus, there is a gap between the guarantees provided by these results and the actual characteristics of architectures used in practice. In this paper we close this gap by showing that, for \emph{any} non-polynomial analytic (like the sigmoid) activation function, to guarantee that MPNNs are equivalent to the WL test, feature vectors of dimension $d=1$ is all we need, independently of the size of the graphs. Our main technical insight is that for simulating multi-sets in the WL-test, it is enough to use linear independence of feature vectors over rationals instead of reals. Countability of the set of rationals together with nice properties of analytic functions allow us to carry out the simulation invariant over the iterations of the WL test without increasing the dimension of the feature vectors.


What Will My Model Forget? Forecasting Forgotten Examples in Language Model Refinement

arXiv.org Artificial Intelligence

Language models deployed in the wild make errors. However, simply updating the model with the corrected error instances causes catastrophic forgetting -- the updated model makes errors on instances learned during the instruction tuning or upstream training phase. Randomly replaying upstream data yields unsatisfactory performance and often comes with high variance and poor controllability. To this end, we try to forecast upstream examples that will be forgotten due to a model update for improved controllability of the replay process and interpretability. We train forecasting models given a collection of online learned examples and corresponding forgotten upstream pre-training examples. We propose a partially interpretable forecasting model based on the observation that changes in pre-softmax logit scores of pretraining examples resemble that of online learned examples, which performs decently on BART but fails on T5 models. We further show a black-box classifier based on inner products of example representations achieves better forecasting performance over a series of setups. Finally, we show that we reduce forgetting of upstream pretraining examples by replaying examples that are forecasted to be forgotten, demonstrating the practical utility of forecasting example forgetting.


Universal Imitation Games

arXiv.org Artificial Intelligence

Alan Turing proposed in 1950 a framework called an imitation game to decide if a machine could think. Using mathematics developed largely after Turing -- category theory -- we analyze a broader class of universal imitation games (UIGs), which includes static, dynamic, and evolutionary games. In static games, the participants are in a steady state. In dynamic UIGs, "learner" participants are trying to imitate "teacher" participants over the long run. In evolutionary UIGs, the participants are competing against each other in an evolutionary game, and participants can go extinct and be replaced by others with higher fitness. We use the framework of category theory -- in particular, two influential results by Yoneda -- to characterize each type of imitation game. Universal properties in categories are defined by initial and final objects. We characterize dynamic UIGs where participants are learning by inductive inference as initial algebras over well-founded sets, and contrast them with participants learning by conductive inference over the final coalgebra of non-well-founded sets. We briefly discuss the extension of our categorical framework for UIGs to imitation games on quantum computers.


Understanding Neural Network Systems for Image Analysis using Vector Spaces and Inverse Maps

arXiv.org Artificial Intelligence

There is strong interest in developing mathematical methods that can be used to understand complex neural networks used in image analysis. In this paper, we introduce techniques from Linear Algebra to model neural network layers as maps between signal spaces. First, we demonstrate how signal spaces can be used to visualize weight spaces and convolutional layer kernels. We also demonstrate how residual vector spaces can be used to further visualize information lost at each layer. Second, we introduce the concept of invertible networks and an algorithm for computing input images that yield specific outputs. We demonstrate our approach on two invertible networks and ResNet18.


Manifold-based Verbalizer Space Re-embedding for Tuning-free Prompt-based Classification

arXiv.org Artificial Intelligence

Prompt-based classification adapts tasks to a cloze question format utilizing the [MASK] token and the filled tokens are then mapped to labels through pre-defined verbalizers. Recent studies have explored the use of verbalizer embeddings to reduce labor in this process. However, all existing studies require a tuning process for either the pre-trained models or additional trainable embeddings. Meanwhile, the distance between high-dimensional verbalizer embeddings should not be measured by Euclidean distance due to the potential for non-linear manifolds in the representation space. In this study, we propose a tuning-free manifold-based space re-embedding method called Locally Linear Embedding with Intra-class Neighborhood Constraint (LLE-INC) for verbalizer embeddings, which preserves local properties within the same class as guidance for classification. Experimental results indicate that even without tuning any parameters, our LLE-INC is on par with automated verbalizers with parameter tuning. And with the parameter updating, our approach further enhances prompt-based tuning by up to 3.2%. Furthermore, experiments with the LLaMA-7B&13B indicate that LLE-INC is an efficient tuning-free classification approach for the hyper-scale language models.


SCENE: Self-Labeled Counterfactuals for Extrapolating to Negative Examples

arXiv.org Artificial Intelligence

Detecting negatives (such as non-entailment relationships, unanswerable questions, and false claims) is an important and challenging aspect of many natural language understanding tasks. Though manually collecting challenging negative examples can help models detect them, it is both costly and domain-specific. In this work, we propose Self-labeled Counterfactuals for Extrapolating to Negative Examples (SCENE), an automatic method for synthesizing training data that greatly improves models' ability to detect challenging negative examples. In contrast with standard data augmentation, which synthesizes new examples for existing labels, SCENE can synthesize negative examples zero-shot from only positive ones. Given a positive example, SCENE perturbs it with a mask infilling model, then determines whether the resulting example is negative based on a self-training heuristic. With access to only answerable training examples, SCENE can close 69.6% of the performance gap on SQuAD 2.0, a dataset where half of the evaluation examples are unanswerable, compared to a model trained on SQuAD 2.0. Our method also extends to boolean question answering and recognizing textual entailment, and improves generalization from SQuAD to ACE-whQA, an out-of-domain extractive QA benchmark.


Continuous-time Riemannian SGD and SVRG Flows on Wasserstein Probabilistic Space

arXiv.org Artificial Intelligence

Recently, optimization on the Riemannian manifold has provided new insights to the optimization community. In this regard, the manifold taken as the probability measure metric space equipped with the second-order Wasserstein distance is of particular interest, since optimization on it can be linked to practical sampling processes. In general, the oracle (continuous) optimization method on Wasserstein space is Riemannian gradient flow (i.e., Langevin dynamics when minimizing KL divergence). In this paper, we aim to enrich the continuous optimization methods in the Wasserstein space by extending the gradient flow into the stochastic gradient descent (SGD) flow and stochastic variance reduction gradient (SVRG) flow. The two flows on Euclidean space are standard stochastic optimization methods, while their Riemannian counterparts are not explored yet. By leveraging the structures in Wasserstein space, we construct a stochastic differential equation (SDE) to approximate the discrete dynamics of desired stochastic methods in the corresponded random vector space. Then, the flows of probability measures are naturally obtained by applying Fokker-Planck equation to such SDE. Furthermore, the convergence rates of the proposed Riemannian stochastic flows are proven, and they match the results in Euclidean space.