AITopics | Allen-Zhu, Zeyuan

Collaborating Authors

Allen-Zhu, Zeyuan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Reverse Training to Nurse the Reversal Curse

Golovneva, Olga, Allen-Zhu, Zeyuan, Weston, Jason, Sukhbaatar, Sainbayar

arXiv.org Artificial IntelligenceMay-7-2024

Large language models (LLMs) have a surprising failure: when trained on "A has a feature B", they do not generalize to "B is a feature of A", which is termed the Reversal Curse. Even when training with trillions of tokens this issue still appears due to Zipf's law - hence even if we train on the entire internet. This work proposes an alternative training scheme, called reverse training, whereby all words are used twice, doubling the amount of available tokens. The LLM is trained in both forward and reverse directions by reversing the training strings while preserving (i.e., not reversing) chosen substrings, such as entities. We show that data-matched reverse-trained models provide superior performance to standard models on standard tasks, and compute-matched reverse-trained models provide far superior performance on reversal tasks, helping resolve the reversal curse issue.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2403.13799

Country: North America > United States > New York (0.14)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Allen-Zhu, Zeyuan, Li, Yuanzhi

arXiv.org Artificial IntelligenceApr-8-2024

Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation. More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model's knowledge storage capacity. Notable insights include: * The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This arises because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train. * Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model's knowledge capacity. Language models can autonomously identify and prioritize domains rich in knowledge, optimizing their storage capacity.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2404.05405

Country: North America > United States > District of Columbia > Washington (0.24)

Genre: Research Report > New Finding (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.72)

Add feedback

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

Allen-Zhu, Zeyuan, Li, Yuanzhi

arXiv.org Artificial IntelligenceDec-26-2023

Large language models (LLMs) can store a vast amount of world knowledge, often extractable via question-answering (e.g., "What is Abraham Lincoln's birthday?"). However, do they answer such questions based on exposure to similar questions during training (i.e., cheating), or by genuinely learning to extract knowledge from sources like Wikipedia? In this paper, we investigate this issue using a controlled biography dataset. We find a strong correlation between the model's ability to extract knowledge and various diversity measures of the training data. $\textbf{Essentially}$, for knowledge to be reliably extracted, it must be sufficiently augmented (e.g., through paraphrasing, sentence shuffling) $\textit{during pretraining}$. Without such augmentation, knowledge may be memorized but not extractable, leading to 0% accuracy, regardless of subsequent instruction fine-tuning. To understand why this occurs, we employ (nearly) linear probing to demonstrate a strong connection between the observed correlation and how the model internally encodes knowledge -- whether it is linearly encoded in the hidden embeddings of entity names or distributed across other token embeddings in the training text. This paper provides $\textbf{several key recommendations for LLM pretraining in the industry}$: (1) rewrite the pretraining data -- using small, auxiliary models -- to provide knowledge augmentation, and (2) incorporate more instruction-finetuning data into the pretraining stage before it becomes too late.

accuracy, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2309.14316

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > New York > Kings County > New York City (0.14)
North America > United States > Illinois > Cook County (0.14)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Physics of Language Models: Part 1, Context-Free Grammar

Allen-Zhu, Zeyuan, Li, Yuanzhi

arXiv.org Artificial IntelligenceOct-4-2023

We design controlled experiments to study HOW generative language models, like GPT, learn context-free grammars (CFGs) -- diverse language systems with a tree-like structure capturing many aspects of natural languages, programs, and logics. CFGs are as hard as pushdown automata, and can be ambiguous so that verifying if a string satisfies the rules requires dynamic programming. We construct synthetic data and demonstrate that even for difficult (long and ambiguous) CFGs, pre-trained transformers can learn to generate sentences with near-perfect accuracy and impressive diversity. More importantly, we delve into the physical principles behind how transformers learns CFGs. We discover that the hidden states within the transformer implicitly and precisely encode the CFG structure (such as putting tree node information exactly on the subtree boundary), and learn to form "boundary to boundary" attentions resembling dynamic programming. We also cover some extension of CFGs as well as the robustness aspect of transformers against grammar mistakes. Overall, our research provides a comprehensive and empirical understanding of how transformers learn CFGs, and reveals the physical mechanisms utilized by transformers to capture the structure and rules of languages.

artificial intelligence, context-free grammar, natural language, (3 more...)

arXiv.org Artificial Intelligence

2305.13673

Genre: Research Report > Experimental Study (0.73)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.60)

Add feedback

Physics of Language Models: Part 3.2, Knowledge Manipulation

Allen-Zhu, Zeyuan, Li, Yuanzhi

arXiv.org Artificial IntelligenceSep-25-2023

Language models can store vast amounts of factual knowledge, but their ability to use this knowledge for logical reasoning remains questionable. This paper explores a language model's ability to manipulate its stored knowledge during inference. We focus on four manipulation types: retrieval (e.g., "What is person A's attribute X"), classification (e.g., "Is A's attribute X even or odd?"), comparison (e.g., "Is A greater than B in attribute X?") and inverse search (e.g., "Which person's attribute X equals T?") We observe that pre-trained language models like GPT2/3/4 excel in knowledge retrieval but struggle with simple classification or comparison tasks unless Chain of Thoughts (CoTs) are employed during both training and inference. They also perform poorly in inverse knowledge search, irrespective of the prompts. Our primary contribution is a synthetic dataset for a controlled experiment that confirms these inherent weaknesses: a language model cannot efficiently manipulate knowledge from pre-training data, even when such knowledge is perfectly stored and fully extractable in the models, and despite adequate instruct fine-tuning.

artificial intelligence, knowledge manipulation, machine learning, (3 more...)

arXiv.org Artificial Intelligence

2309.14402

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.87)

Add feedback

LoRA: Low-Rank Adaptation of Large Language Models

Hu, Edward J., Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, Wang, Shean, Chen, Weizhu

arXiv.org Artificial IntelligenceJun-17-2021

The dominant paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, conventional fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example, deploying many independent instances of fine-tuned models, each with 175B parameters, is extremely expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning. LoRA performs on-par or better than fine-tuning in model quality on both GPT-3 and GPT-2, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptations, which sheds light on the efficacy of LoRA. We release our implementation in GPT-2 at https://github.com/microsoft/LoRA .

deep learning, lora, neural network, (20 more...)

arXiv.org Artificial Intelligence

2106.09685

Country: North America > United States > Louisiana (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions

Allen-Zhu, Zeyuan, Li, Yuanzhi

arXiv.org Machine LearningJun-4-2021

In practice, by simply training a generator and a discriminator together consisting of multi-layer neural networks with non-linear activation functions, using local search algorithms such as stochastic gradient descent ascent (SGDA), the generator network can be trained efficiently to generate samples from highly-complicated distributions (such as the distribution of images). Despite the great empirical success of GAN, it remains to be one of the least understood models on the theory side of deep learning. Most of existing theories focus on the statistical properties of GANs at the global-optimum [15, 16, 20, 87]. However, on the training side, gradient descent ascent only enjoys efficient convergence to a global optimum when the loss function is convex-concave, or efficient convergence to a critical point in general settings [37, 38, 48, 53, 71, 73, 75, 77, 78]. Due to the extreme non-linearity of the networks in both the generator and the discriminator, it is highly unlikely that the training objective of GANs can be convex-concave. In particular, even if the generator and the discriminator are linear functions over prescribed feature mappings-- such as the neural tangent kernel (NTK) feature mappings [3, 8, 9, 17, 18, 32, 35, 40, 41, 47, 51, 54, 65, 69, 92, 97] -- the training objective can still be non-convex-concave.

deep learning, neural network, null, (18 more...)

arXiv.org Machine Learning

2106.02619

Country: North America > United States (0.13)

Genre:

Instructional Material (0.85)
Research Report (0.64)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Add feedback

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Allen-Zhu, Zeyuan, Li, Yuanzhi

arXiv.org Machine LearningDec-17-2020

We formally study how Ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using Knowledge Distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the SAME architecture, trained using the SAME algorithm on the SAME data set, and they only differ by the random seeds used in the initialization. We empirically show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory, especially differently from ensemble of random feature mappings or the neural-tangent-kernel feature mappings, and is potentially out of the scope of existing theorems. Thus, to properly understand ensemble and knowledge distillation in deep learning, we develop a theory showing that when data has a structure we refer to as "multi-view", then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model by training a single model to match the output of the ensemble instead of the true label. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the "dark knowledge" is hidden in the outputs of the ensemble -- that can be used in knowledge distillation -- comparing to the true data labels. In the end, we prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.

deep learning, ensemble, neural network, (14 more...)

arXiv.org Machine Learning

2012.09816

Country: North America > United States (0.27)

Genre: Research Report > New Finding (0.34)

Industry:

Education (0.65)
Materials > Chemicals > Industrial Gases > Liquified Gas (0.45)
Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.45)
Energy > Oil & Gas > Midstream (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Feature Purification: How Adversarial Training Performs Robust Deep Learning

Allen-Zhu, Zeyuan, Li, Yuanzhi

arXiv.org Machine LearningSep-17-2020

Despite the empirical success of using Adversarial Training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to purify hidden weights. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Technically, we give, to the best of our knowledge, the first result proving that the following two can hold simultaneously for training a neural network with ReLU activation. (1) Training over the original data is indeed non-robust to small adversarial perturbations of some radius. (2) Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against ANY perturbations of the same radius. Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.

deep learning, neural network, null, (17 more...)

arXiv.org Machine Learning

2005.1019

Country: North America > United States (0.13)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Gambling (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls

Allen-Zhu, Zeyuan, Hazan, Elad, Hu, Wei, Li, Yuanzhi

Neural Information Processing SystemsFeb-14-2020, 18:42:06 GMT

We propose a rank-k variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball. Our algorithm replaces the top singular-vector computation (1-SVD) in Frank-Wolfe with a top-k singular-vector computation (k-SVD), which can be done by repeatedly applying 1-SVD k times. Alternatively, our algorithm can be viewed as a rank-k restricted version of projected gradient descent. We show that our algorithm has a linear convergence rate when the objective function is smooth and strongly convex, and the optimal solution has rank at most k. This improves the convergence rate and the total time complexity of the Frank-Wolfe method and its variants.

algorithm, artificial intelligence, optimization problem, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.55)

Add feedback