AITopics | Sun, Ziteng

Collaborating Authors

Sun, Ziteng

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

InfAlign: Inference-aware language model alignment

Balashankar, Ananth, Sun, Ziteng, Berant, Jonathan, Eisenstein, Jacob, Collins, Michael, Hutter, Adrian, Lee, Jong, Nagpal, Chirag, Prost, Flavien, Sinha, Aradhana, Suresh, Ananda Theertha, Beirami, Ahmad

arXiv.org Artificial IntelligenceDec-30-2024

Language model alignment has become a critical step in training modern generative language models. The goal of alignment is to finetune a reference model such that the win rate of a sample from the aligned model over a sample from the reference model is high, subject to a KL divergence constraint. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. However, the alignment objective does not capture such inference-time decoding procedures. We show that the existing alignment framework is sub-optimal in view of such inference-time methods. We then modify the alignment objective and propose a framework for inference-aware alignment (IAPO). We prove that for any inference-time decoding algorithm, the optimal solution that optimizes the inference-time win rate of the aligned policy against the reference policy is the solution to the typical RLHF problem with a transformation of the reward. This motivates us to provide the KL-regularized calibrate-and-transform RL (CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. We particularize our study to two important inference-time strategies: best-of-N sampling and best-of-N jailbreaking, where N responses are sampled from the model and the one with the highest or lowest reward is selected. We propose specific transformations for these strategies and demonstrate that our framework offers significant improvements over existing state-of-the-art methods for language model alignment. Empirically, we outperform baselines that are designed without taking inference-time decoding into consideration by 8-12% and 4-9% on inference-time win rates over the Anthropic helpfulness and harmlessness dialog benchmark datasets.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.19792

Country: North America > United States > Michigan (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Asymptotics of Language Model Alignment

Yang, Joy Qiping, Salamatian, Salman, Sun, Ziteng, Suresh, Ananda Theertha, Beirami, Ahmad

arXiv.org Machine LearningApr-2-2024

Let $p$ denote a generative language model. Let $r$ denote a reward model that returns a scalar that captures the degree at which a draw from $p$ is preferred. The goal of language model alignment is to alter $p$ to a new distribution $\phi$ that results in a higher expected reward while keeping $\phi$ close to $p.$ A popular alignment method is the KL-constrained reinforcement learning (RL), which chooses a distribution $\phi_\Delta$ that maximizes $E_{\phi_{\Delta}} r(y)$ subject to a relative entropy constraint $KL(\phi_\Delta || p) \leq \Delta.$ Another simple alignment method is best-of-$N$, where $N$ samples are drawn from $p$ and one with highest reward is selected. In this paper, we offer a closed-form characterization of the optimal KL-constrained RL solution. We demonstrate that any alignment method that achieves a comparable trade-off between KL divergence and reward must approximate the optimal KL-constrained RL solution in terms of relative entropy. To further analyze the properties of alignment methods, we introduce two simplifying assumptions: we let the language model be memoryless, and the reward model be linear. Although these assumptions may not reflect complex real-world scenarios, they enable a precise characterization of the asymptotic behavior of both the best-of-$N$ alignment, and the KL-constrained RL method, in terms of information-theoretic quantities. We prove that the reward of the optimal KL-constrained RL solution satisfies a large deviation principle, and we fully characterize its rate function. We also show that the rate of growth of the scaled cumulants of the reward is characterized by a proper Renyi cross entropy. Finally, we show that best-of-$N$ is asymptotically equivalent to KL-constrained RL solution by proving that their expected rewards are asymptotically equal, and concluding that the two distributions must be close in KL divergence.

kl divergence, machine learning, reinforcement learning, (14 more...)

arXiv.org Machine Learning

2404.0173

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)

Add feedback

Optimal Block-Level Draft Verification for Accelerating Speculative Decoding

Sun, Ziteng, Ro, Jae Hun, Beirami, Ahmad, Suresh, Ananda Theertha

arXiv.org Artificial IntelligenceMar-15-2024

Speculative decoding has shown to be an effective method for lossless acceleration of large language models (LLMs) during inference. In each iteration, the algorithm first uses a smaller model to draft a block of tokens. The tokens are then verified by the large model in parallel and only a subset of tokens will be kept to guarantee that the final output follows the distribution of the large model. In all of the prior speculative decoding works, the draft verification is performed token-by-token independently. In this work, we propose a better draft verification algorithm that provides additional wall-clock speedup without incurring additional computation cost and draft tokens. We first formulate the draft verification step as a block-level optimal transport problem. The block-level formulation allows us to consider a wider range of draft verification algorithms and obtain a higher number of accepted tokens in expectation in one draft block. We propose a verification algorithm that achieves the optimal accepted length for the block-level transport problem. We empirically evaluate our proposed block-level verification algorithm in a wide range of tasks and datasets, and observe consistent improvements in wall-clock speedup when compared to token-level verification algorithm. To the best of our knowledge, our work is the first to establish improvement over speculative decoding through a better draft verification algorithm.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2403.10444

Country: North America > United States > Maryland (0.14)

Genre:

Workflow (0.46)
Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

SpecTr: Fast Speculative Decoding via Optimal Transport

Sun, Ziteng, Suresh, Ananda Theertha, Ro, Jae Hun, Beirami, Ahmad, Jain, Himanshu, Yu, Felix

arXiv.org Artificial IntelligenceJan-17-2024

Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is $\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with $\textit{membership cost}$. This framework can be viewed as an extension of the well-known $\textit{maximal-coupling}$ problem. This new formulation enables us to generalize the speculative decoding method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose a valid draft selection algorithm whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this $new draft selection$ algorithm, we develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2310.15141

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Subset-Based Instance Optimality in Private Estimation

Dick, Travis, Kulesza, Alex, Sun, Ziteng, Suresh, Ananda Theertha

arXiv.org Artificial IntelligenceAug-15-2023

We propose a new definition of instance optimality for differentially private estimation algorithms. Our definition requires an optimal algorithm to compete, simultaneously for every dataset $D$, with the best private benchmark algorithm that (a) knows $D$ in advance and (b) is evaluated by its worst-case performance on large subsets of $D$. That is, the benchmark algorithm need not perform well when potentially extreme points are added to $D$; it only has to handle the removal of a small number of real data points that already exist. This makes our benchmark significantly stronger than those proposed in prior work. We nevertheless show, for real-valued datasets, how to construct private algorithms that achieve our notion of instance optimality when estimating a broad class of dataset properties, including means, quantiles, and $\ell_p$-norm minimizers. For means in particular, we provide a detailed analysis and show that our algorithm simultaneously matches or exceeds the asymptotic performance of existing algorithms under a range of distributional assumptions.

algorithm, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2303.01262

Country:

North America > United States > New York (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.68)

Add feedback

The importance of feature preprocessing for differentially private linear optimization

Sun, Ziteng, Suresh, Ananda Theertha, Menon, Aditya Krishna

arXiv.org Artificial IntelligenceJul-19-2023

Training machine learning models with differential privacy (DP) has received increasing interest in recent years. One of the most popular algorithms for training differentially private models is differentially private stochastic gradient descent (DPSGD) and its variants, where at each step gradients are clipped and combined with some noise. Given the increasing usage of DPSGD, we ask the question: is DPSGD alone sufficient to find a good minimizer for every dataset under privacy constraints? As a first step towards answering this question, we show that even for the simple case of linear classification, unlike non-private optimization, (private) feature preprocessing is vital for differentially private optimization. In detail, we first show theoretically that there exists an example where without feature preprocessing, DPSGD incurs a privacy error proportional to the maximum norm of features over all samples. We then propose an algorithm called DPSGD-F, which combines DPSGD with feature preprocessing and prove that for classification tasks, it incurs a privacy error proportional to the diameter of the features $\max_{x, x' \in D} \|x - x'\|_2$. We then demonstrate the practicality of our algorithm on image classification benchmarks.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2307.11106

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Concentration Bounds for Discrete Distribution Estimation in KL Divergence

Canonne, Clément L., Sun, Ziteng, Suresh, Ananda Theertha

arXiv.org Artificial IntelligenceJun-12-2023

Discrete distribution estimation, i.e., density estimation over discrete domains, is a fundamental problem in Statistics, with a rich history (see, e.g., [9, 10] for an overview and further references). In this work, we address a simple yet surprisingly ill-understood aspect of this question: what is sample complexity of estimating an arbitrary discrete distribution in Kullback-Leibler (KL) divergence with vanishing probability of error? To describe the problem further, a few definitions are in order.

artificial intelligence, estimator, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2302.06869

Country: North America > United States (0.14)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.48)

Add feedback

Unified lower bounds for interactive high-dimensional estimation under information constraints

Acharya, Jayadev, Canonne, Clément L., Sun, Ziteng, Tyagi, Himanshu

arXiv.org Artificial IntelligenceNov-15-2022

We consider the problem of parameter estimation under local information constraints, where the estimation algorithm has access to only limited information about each sample. These constraints can be of various types, including communication constraints, where each sample must be described using a few (e.g., constant number of) bits; (local) privacy constraints, where each sample is obtained from a different user and the users seek to reveal as little as possible about their specific data; as well as many others, e.g., noisy communication channels, or limited types of data access such as linear measurements. Such problems have received a lot of attention in recent years, motivated by applications such as data analytics in distributed systems and federated learning. Our main focus is on information-theoretic lower bounds for the minimax error rates (or, equivalently, the sample complexity) of these problems. Several recent works have provided different bounds that apply to specific constraints or work for specific parametric estimation problems, sometimes without allowing for interactive protocols. Indeed, handling interactive protocols is technically challenging, and several results in prior work exhibit flaws in their analysis. In particular, even the most basic Gaussian mean estimation problem using interactive communication remains, quite surprisingly, open. We present general, "plug-and-play" lower bounds for parametric estimation under information constraints that can be used for any local information constraint and allows for interactive protocols. Our abstract bound requires very simple (and natural) assumptions to hold for the underlying parametric family; in particular, we do not require technical "regularity" conditions that are common in asymptotic statistics.

artificial intelligence, constraint, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2010.06562

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Learning in High Dimensional Spaces (0.40)

Add feedback

Discrete Distribution Estimation under User-level Local Differential Privacy

Acharya, Jayadev, Liu, Yuhan, Sun, Ziteng

arXiv.org Artificial IntelligenceNov-7-2022

We study discrete distribution estimation under user-level local differential privacy (LDP). In user-level $\varepsilon$-LDP, each user has $m\ge1$ samples and the privacy of all $m$ samples must be preserved simultaneously. We resolve the following dilemma: While on the one hand having more samples per user should provide more information about the underlying distribution, on the other hand, guaranteeing the privacy of all $m$ samples should make the estimation task more difficult. We obtain tight bounds for this problem under almost all parameter regimes. Perhaps surprisingly, we show that in suitable parameter regimes, having $m$ samples per user is equivalent to having $m$ times more users, each with only one sample. Our results demonstrate interesting phase transitions for $m$ and the privacy parameter $\varepsilon$ in the estimation risk. Finally, connecting with recent results on shuffled DP, we show that combined with random shuffling, our algorithm leads to optimal error guarantees (up to logarithmic factors) under the central model of user-level DP in certain parameter regimes. We provide several simulations to verify our theoretical findings.

algorithm, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2211.03757

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learning with User-Level Privacy

Levy, Daniel, Sun, Ziteng, Amin, Kareem, Kale, Satyen, Kulesza, Alex, Mohri, Mehryar, Suresh, Ananda Theertha

arXiv.org Machine LearningMar-2-2021

We propose and analyze algorithms to solve a range of learning tasks under user-level differential privacy constraints. Rather than guaranteeing only the privacy of individual samples, user-level DP protects a user's entire contribution ($m \ge 1$ samples), providing more stringent but more realistic protection against information leaks. We show that for high-dimensional mean estimation, empirical risk minimization with smooth losses, stochastic convex optimization, and learning hypothesis class with finite metric entropy, the privacy cost decreases as $O(1/\sqrt{m})$ as users provide more samples. In contrast, when increasing the number of users $n$, the privacy cost decreases at a faster $O(1/n)$ rate. We complement these results with lower bounds showing the worst-case optimality of our algorithm for mean estimation and stochastic convex optimization. Our algorithms rely on novel techniques for private mean estimation in arbitrary dimension with error scaling as the concentration radius $\tau$ of the distribution rather than the entire range. Under uniform convergence, we derive an algorithm that privately answers a sequence of $K$ adaptively chosen queries with privacy cost proportional to $\tau$, and apply it to solve the learning tasks we consider.

algorithm, artificial intelligence, optimization problem, (17 more...)

arXiv.org Machine Learning

2102.11845

Country: North America > United States (0.92)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback