Goto

Collaborating Authors

 Levinboim, Tomer


CausalLM is not optimal for in-context learning

arXiv.org Artificial Intelligence

Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.


Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

arXiv.org Artificial Intelligence

Recent research in robust optimization has shown an overfitting-like phenomenon in which models trained against adversarial attacks exhibit higher robustness on the training set compared to the test set. Although previous work provided theoretical explanations for this phenomenon using a robust PAC-Bayesian bound over the adversarial test error, related algorithmic derivations are at best only loosely connected to this bound, which implies that there is still a gap between their empirical success and our understanding of adversarial robustness theory. To close this gap, in this paper we consider a different form of the robust PAC-Bayesian bound and directly minimize it with respect to the model posterior. The derivation of the optimal solution connects PAC-Bayesian learning to the geometry of the robust loss surface through a Trace of Hessian (TrH) regularizer that measures the surface flatness. In practice, we restrict the TrH regularizer to the top layer only, which results in an analytical solution to the bound whose computational cost does not depend on the network depth. Finally, we evaluate our TrH regularization approach over CIFAR-10/100 and ImageNet using Vision Transformers (ViT) and compare against baseline adversarial robustness algorithms. Experimental results show that TrH regularization leads to improved ViT robustness that either matches or surpasses previous state-of-the-art approaches while at the same time requires less memory and computational cost.


PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

arXiv.org Artificial Intelligence

With the increasing abundance of pretrained models in recent years, the problem of selecting the best pretrained checkpoint for a particular downstream classification task has been gaining increased attention. Although several methods have recently been proposed to tackle the selection problem (e.g. LEEP, H-score), these methods resort to applying heuristics that are not well motivated by learning theory. In this paper we present PACTran, a theoretically grounded family of metrics for pretrained model selection and transferability measurement. We first show how to derive PACTran metrics from the optimal PAC-Bayesian bound under the transfer learning setting. We then empirically evaluate three metric instantiations of PACTran on a number of vision tasks (VTAB) as well as a language-and-vision (OKVQA) task. An analysis of the results shows PACTran is a more consistent and effective transferability measure compared to existing selection methods.


Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

arXiv.org Machine Learning

Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing PAC-Bayesian theories on meta-learning to explain performance improvements in the few-shot learning setting, where the number of training examples in the target tasks is severely limited. This gap originates from an assumption in the existing theories which supposes that the number of training examples in the observed tasks and the number of training examples in the target tasks follow the same distribution, an assumption that rarely holds in practice. By relaxing this assumption, we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theories. Furthermore, we derive a new computationally-efficient PACMAML algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets.


Learning the Kernel Matrix with Low-Rank Multiplicative Shaping

AAAI Conferences

Selecting the optimal kernel is an important and difficult challenge in applying kernel methods to pattern recognition. To address this challenge, multiple kernel learning (MKL) aims to learn a kernel from a combination of base kernel functions that perform optimally on the task. In this paper, we propose a novel MKL-themed approach to combine base kernels that are multiplicatively shaped with low-rank positive semidefinitve matrices. The proposed approach generalizes several popular MKL methods and thus provides more flexibility in modeling data. Computationally, we show how these low-rank matrices can be learned efficiently from data using convex quadratic programming. Empirical studies on several standard benchmark datasets for MKL show that the new approach often improves prediction accuracy statistically significantly over very competitive single kernel and other MKL methods.


Learning and Evaluating Human-Like NPC Behaviors in Dynamic Games

AAAI Conferences

We address the challenges of evaluating the fidelity of AI agents that are attempting to produce human-like behaviors in games. To create a believable and engaging game play experience, designers must ensure that their non-player characters (NPCs) behave in a human-like manner. Today, with the wide popularity of massively-multi-player online games, this goal may seem less important. However, if we can reliably produce human-like NPCs, this can open up an entirely new genre of game play. In this paper, we focus on emulating human behaviors in strategic game settings, and focus on a Social Ultimatum Game as the testbed for developing and evaluating a set of metrics for comparing various autonomous agents to human behavior collected from live experiments.