AITopics | Ding, Nan

Collaborating Authors

Ding, Nan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CausalLM is not optimal for in-context learning

Ding, Nan, Levinboim, Tomer, Wu, Jialin, Goodman, Sebastian, Soricut, Radu

arXiv.org Artificial IntelligenceSep-2-2023

Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2308.06912

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Add feedback

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Chen, Xi, Wang, Xiao, Changpinyo, Soravit, Piergiovanni, AJ, Padlewski, Piotr, Salz, Daniel, Goodman, Sebastian, Grycner, Adam, Mustafa, Basil, Beyer, Lucas, Kolesnikov, Alexander, Puigcerver, Joan, Ding, Nan, Rong, Keran, Akbari, Hassan, Mishra, Gaurav, Xue, Linting, Thapliyal, Ashish, Bradbury, James, Kuo, Weicheng, Seyedhosseini, Mojtaba, Jia, Chao, Ayan, Burcu Karagol, Riquelme, Carlos, Steiner, Andreas, Angelova, Anelia, Zhai, Xiaohua, Houlsby, Neil, Soricut, Radu

arXiv.org Artificial IntelligenceJun-5-2023

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2209.06794

Genre: Research Report > New Finding (0.45)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

Wang, Zifan, Ding, Nan, Levinboim, Tomer, Chen, Xi, Soricut, Radu

arXiv.org Artificial IntelligenceNov-22-2022

Recent research in robust optimization has shown an overfitting-like phenomenon in which models trained against adversarial attacks exhibit higher robustness on the training set compared to the test set. Although previous work provided theoretical explanations for this phenomenon using a robust PAC-Bayesian bound over the adversarial test error, related algorithmic derivations are at best only loosely connected to this bound, which implies that there is still a gap between their empirical success and our understanding of adversarial robustness theory. To close this gap, in this paper we consider a different form of the robust PAC-Bayesian bound and directly minimize it with respect to the model posterior. The derivation of the optimal solution connects PAC-Bayesian learning to the geometry of the robust loss surface through a Trace of Hessian (TrH) regularizer that measures the surface flatness. In practice, we restrict the TrH regularizer to the top layer only, which results in an analytical solution to the bound whose computational cost does not depend on the network depth. Finally, we evaluate our TrH regularization approach over CIFAR-10/100 and ImageNet using Vision Transformers (ViT) and compare against baseline adversarial robustness algorithms. Experimental results show that TrH regularization leads to improved ViT robustness that either matches or surpasses previous state-of-the-art approaches while at the same time requires less memory and computational cost.

artificial intelligence, bayesian inference, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2211.12624

Country: North America > United States (0.92)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

Ding, Nan, Chen, Xi, Levinboim, Tomer, Changpinyo, Beer, Soricut, Radu

arXiv.org Artificial IntelligenceJul-19-2022

With the increasing abundance of pretrained models in recent years, the problem of selecting the best pretrained checkpoint for a particular downstream classification task has been gaining increased attention. Although several methods have recently been proposed to tackle the selection problem (e.g. LEEP, H-score), these methods resort to applying heuristics that are not well motivated by learning theory. In this paper we present PACTran, a theoretically grounded family of metrics for pretrained model selection and transferability measurement. We first show how to derive PACTran metrics from the optimal PAC-Bayesian bound under the transfer learning setting. We then empirically evaluate three metric instantiations of PACTran on a number of vision tasks (VTAB) as well as a language-and-vision (OKVQA) task. An analysis of the results shows PACTran is a more consistent and effective transferability measure compared to existing selection methods.

artificial intelligence, machine learning, test error corr, (16 more...)

arXiv.org Artificial Intelligence

2203.05126

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Diagnostic Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Add feedback

Bridging the Gap Between Practice and PAC-Bayes Theory in Few-Shot Meta-Learning

Ding, Nan, Chen, Xi, Levinboim, Tomer, Goodman, Sebastian, Soricut, Radu

arXiv.org Machine LearningMay-28-2021

Despite recent advances in its theoretical understanding, there still remains a significant gap in the ability of existing PAC-Bayesian theories on meta-learning to explain performance improvements in the few-shot learning setting, where the number of training examples in the target tasks is severely limited. This gap originates from an assumption in the existing theories which supposes that the number of training examples in the observed tasks and the number of training examples in the target tasks follow the same distribution, an assumption that rarely holds in practice. By relaxing this assumption, we develop two PAC-Bayesian bounds tailored for the few-shot learning setting and show that two existing meta-learning algorithms (MAML and Reptile) can be derived from our bounds, thereby bridging the gap between practice and PAC-Bayesian theories. Furthermore, we derive a new computationally-efficient PACMAML algorithm, and show it outperforms existing meta-learning algorithms on several few-shot benchmark datasets.

neural network, pacmaml, survey article, (19 more...)

arXiv.org Machine Learning

2105.14099

Country:

North America (0.14)
Europe > Sweden (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.67)

Add feedback

Attention that does not Explain Away

Ding, Nan, Fan, Xinjie, Lan, Zhenzhong, Schuurmans, Dale, Soricut, Radu

arXiv.org Machine LearningSep-29-2020

This performance in a variety of machine learning is because for a GMM, not all Gaussian centers tasks, such as machine translation (Vaswani et al., (lower layer neurons) are required to contribute in 2017; Dehghani et al., 2019), language modeling generating output data (upper layer neurons). The (Devlin et al., 2019; Yang et al., 2019), summarization information of the centers that do not generate data (Cohan et al., 2018; Goodman et al., 2019), is lost after observing the data. This "explainingaway" dialog (Mazaré et al., 2018; Cheng et al., 2019), effect is related to the one in the directed image captioning (Sharma et al., 2018; Zhao et al., graphical model, in the sense that the existence of 2019), and visual question answering (Yu et al., the few contributed lower neurons "explain away" 2019b; Tan and Bansal, 2019). One of the most important the other muted lower neurons on generating upper components of the Transformer architecture neurons. is its self-attention mechanism, applied universally In order to compensate for this, we describe to both the encoder and the decoder components.

artificial intelligence, natural language, neuron, (21 more...)

arXiv.org Machine Learning

2009.14308

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)

Add feedback

On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators

Chen, Changyou, Ding, Nan, Carin, Lawrence

Neural Information Processing SystemsFeb-14-2020, 11:13:01 GMT

Recent advances in Bayesian learning with large-scale data have witnessed emergence of stochastic gradient MCMC algorithms (SG-MCMC), such as stochastic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian MCMC (SGHMC), and the stochastic gradient thermostat. While finite-time convergence properties of the SGLD with a 1st-order Euler integrator have recently been studied, corresponding theory for general SG-MCMCs has not been explored. In this paper we consider general SG-MCMCs with high-order integrators, and develop theory to analyze finite-time convergence properties and their asymptotic invariant measures. Our theoretical results show faster convergence rates and more accurate invariant measures for SG-MCMCs with higher-order integrators. For example, with the proposed efficient 2nd-order symmetric splitting integrator, the mean square error (MSE) of the posterior average for the SGHMC achieves an optimal convergence rate of $L {-4/5}$ at $L$ iterations, compared to $L {-2/3}$ for the SGHMC and SGLD with 1st-order Euler integrators.

artificial intelligence, integrator, machine learning, (8 more...)

Neural Information Processing Systems

Genre: Research Report (0.42)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Ding, Nan, Soricut, Radu

Neural Information Processing SystemsDec-31-2017

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Empirical evidence validates this method on automatic summarization and image captioning tasks.

deep learning, neural network, value function, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Genre: Research Report (0.46)

Add feedback

Stochastic Gradient MCMC with Stale Gradients

Chen, Changyou, Ding, Nan, Li, Chunyuan, Zhang, Yizhe, Carin, Lawrence

Neural Information Processing SystemsDec-31-2016

Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers. Experiments on synthetic data and deep neural networks validate our theory, demonstrating the effectiveness and scalability of SG-MCMC with stale gradients.

artificial intelligence, gradient, variance, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (0.68)
Europe > United Kingdom > England (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators

Chen, Changyou, Ding, Nan, Carin, Lawrence

arXiv.org Machine LearningOct-21-2016

Recent advances in Bayesian learning with large-scale data have witnessed emergence of stochastic gradient MCMC algorithms (SG-MCMC), such as stochastic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian MCMC (SGHMC), and the stochastic gradient thermostat. While finite-time convergence properties of the SGLD with a 1st-order Euler integrator have recently been studied, corresponding theory for general SG-MCMCs has not been explored. In this paper we consider general SG-MCMCs with high-order integrators, and develop theory to analyze finite-time convergence properties and their asymptotic invariant measures. Our theoretical results show faster convergence rates and more accurate invariant measures for SG-MCMCs with higher-order integrators. For example, with the proposed efficient 2nd-order symmetric splitting integrator, the {\em mean square error} (MSE) of the posterior average for the SGHMC achieves an optimal convergence rate of $L^{-4/5}$ at $L$ iterations, compared to $L^{-2/3}$ for the SGHMC and SGLD with 1st-order Euler integrators. Furthermore, convergence results of decreasing-step-size SG-MCMCs are also developed, with the same convergence rates as their fixed-step-size counterparts for a specific decreasing sequence. Experiments on both synthetic and real datasets verify our theory, and show advantages of the proposed method in two large-scale real applications.

artificial intelligence, integrator, survey article, (19 more...)

arXiv.org Machine Learning

1610.06665

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback