AITopics

2412.07905

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

arXiv.org Artificial IntelligenceJun-24-2024

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Welleck, Sean, Bertsch, Amanda, Finlayson, Matthew, Schoelkopf, Hailey, Xie, Alex, Neubig, Graham, Kulikov, Ilia, Harchaoui, Zaid

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation.

large language model, machine learning, natural language, (19 more...)

2406.16838

Country:

Europe (0.67)
North America > United States > California (0.45)

Genre:

Overview (1.00)
Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningJun-16-2024

Iterated Schr\"odinger bridge approximation to Wasserstein Gradient Flows

Agarwal, Medha, Harchaoui, Zaid, Mulcahy, Garrett, Pal, Soumik

We introduce a novel discretization scheme for Wasserstein gradient flows that involves successively computing Schr\"{o}dinger bridges with the same marginals. This is different from both the forward/geodesic approximation and the backward/Jordan-Kinderlehrer-Otto (JKO) approximations. The proposed scheme has two advantages: one, it avoids the use of the score function, and, two, it is amenable to particle-based approximations using the Sinkhorn algorithm. Our proof hinges upon showing that relative entropy between the Schr\"{o}dinger bridge with the same marginals at temperature $\epsilon$ and the joint distribution of a stationary Langevin diffusion at times zero and $\epsilon$ is of the order $o(\epsilon^2)$ with an explicit dependence given by Fisher information. Owing to this inequality, we can show, using a triangular approximation argument, that the interpolated iterated application of the Schr\"{o}dinger bridge approximation converge to the Wasserstein gradient flow, for a class of gradient flows, including the heat flow. The results also provide a probabilistic and rigorous framework for the convergence of the self-attention mechanisms in transformer networks to the solutions of heat flows, first observed in the inspiring work SABP22 in machine learning research.

artificial intelligence, gradient flow, machine learning, (17 more...)

2406.10823

Country:

Europe (0.92)
Asia > Middle East > Jordan (0.24)
North America > United States > Washington > King County > Seattle (0.14)

Genre: Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Machine LearningMar-15-2024

A Primal-Dual Algorithm for Faster Distributionally Robust Optimization

Mehta, Ronak, Diakonikolas, Jelena, Harchaoui, Zaid

We consider the penalized distributionally robust optimization (DRO) problem with a closed, convex uncertainty set, a setting that encompasses the $f$-DRO, Wasserstein-DRO, and spectral/$L$-risk formulations used in practice. We present Drago, a stochastic primal-dual algorithm that achieves a state-of-the-art linear convergence rate on strongly convex-strongly concave DRO problems. The method combines both randomized and cyclic components with mini-batching, which effectively handles the unique asymmetric nature of the primal and dual problems in DRO. We support our theoretical results with numerical benchmarks in classification and regression.

artificial intelligence, machine learning, natural language, (18 more...)

2403.10763

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report (0.81)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

arXiv.org Artificial IntelligenceFeb-13-2024

JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding over Small Language Models

Fisher, Jillian, Lu, Ximing, Jung, Jaehun, Jiang, Liwei, Harchaoui, Zaid, Choi, Yejin

The permanence of online content combined with the enhanced authorship identification techniques calls for stronger computational methods to protect the identity and privacy of online authorship when needed, e.g., blind reviews for scientific papers, anonymous online reviews, or anonymous interactions in the mental health forums. In this paper, we propose an unsupervised inference-time approach to authorship obfuscation to address the unique challenges of authorship obfuscation: lack of supervision data for diverse authorship and domains, and the need for a sufficient level of revision beyond simple paraphrasing to obfuscate the authorship, all the while preserving the original content and fluency. We introduce JAMDEC, a user-controlled, inference-time algorithm for authorship obfuscation that can be in principle applied to any text and authorship. Our approach builds on small language models such as GPT2-XL in order to help avoid disclosing the original content to proprietary LLM's APIs, while also reducing the performance gap between small and large language models via algorithmic enhancement. The key idea behind our approach is to boost the creative power of smaller language models through constrained decoding, while also allowing for user-specified controls and flexibility. Experimental results demonstrate that our approach based on GPT2-XL outperforms previous state-of-the-art methods based on comparably small models, while performing competitively against GPT3.5 175B, a propriety model that is two orders of magnitudes larger.

large language model, machine learning, obfuscation, (19 more...)

2402.08761

Country:

Asia > Russia (0.67)
North America > United States > Michigan (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Government > Regional Government (1.00)
Information Technology > Security & Privacy (0.92)
Health & Medicine > Therapeutic Area (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-7-2023

MAUVE Scores for Generative Models: Theory and Practice

Pillutla, Krishna, Liu, Lang, Thickstun, John, Welleck, Sean, Swayamdipta, Swabha, Zellers, Rowan, Oh, Sewoong, Choi, Yejin, Harchaoui, Zaid

Generative artificial intelligence has made significant strides, producing text indistinguishable from human prose and remarkably photorealistic images. Automatically measuring how close the generated data distribution is to the target distribution is central to diagnosing existing models and developing better ones. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore three approaches to statistically estimate these scores: vector quantization, non-parametric estimation, and classifier-based estimation. We provide statistical bounds for the vector quantization approach. Empirically, we find that the proposed scores paired with a range of $f$-divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics. In conclusion, we present practical recommendations for using MAUVE effectively with language and image modalities.

large language model, machine learning, natural language, (19 more...)

2212.14578

Country: North America > United States > California (0.27)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
(3 more...)

arXiv.org Artificial IntelligenceOct-31-2023

Faith and Fate: Limits of Transformers on Compositionality

Dziri, Nouha, Lu, Ximing, Sclar, Melanie, Li, Xiang Lorraine, Jiang, Liwei, Lin, Bill Yuchen, West, Peter, Bhagavatula, Chandra, Bras, Ronan Le, Hwang, Jena D., Sanyal, Soumya, Welleck, Sean, Ren, Xiang, Ettinger, Allyson, Harchaoui, Zaid, Choi, Yejin

Transformer large language models (LLMs) have sparked admiration for their exceptional performance on tasks that demand intricate multi-step reasoning. Yet, these models simultaneously show failures on surprisingly trivial problems. This begs the question: Are these errors incidental, or do they signal more substantial limitations? In an attempt to demystify transformer LLMs, we investigate the limits of these models across three representative compositional tasks -- multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how autoregressive generations' performance can rapidly decay with\,increased\,task\,complexity.

large language model, machine learning, natural language, (18 more...)

2305.18654

Country:

Europe (0.67)
Asia > Middle East > UAE (0.14)
North America > United States > California (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Automobiles & Trucks > Manufacturer (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

arXiv.org Machine LearningOct-20-2023

Distributionally Robust Optimization with Bias and Variance Reduction

Mehta, Ronak, Roulet, Vincent, Pillutla, Krishna, Harchaoui, Zaid

We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3$\times$ faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains.

artificial intelligence, machine learning, prospect, (16 more...)

2310.13863

Genre: Research Report (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

arXiv.org Artificial IntelligenceOct-15-2023

FiLM: Fill-in Language Models for Any-Order Generation

Shen, Tianxiao, Peng, Hao, Shen, Ruoqi, Fu, Yao, Harchaoui, Zaid, Choi, Yejin

Language models have become the backbone of today's AI systems. However, their predominant left-to-right generation limits the use of bidirectional context, which is essential for tasks that involve filling text in the middle. We propose the Fill-in Language Model (FiLM), a new language modeling approach that allows for flexible generation at any position without adhering to a specific generation order. Its training extends the masked language modeling objective by adopting varying mask probabilities sampled from the Beta distribution to enhance the generative capabilities of FiLM. During inference, FiLM can seamlessly insert missing phrases, sentences, or paragraphs, ensuring that the outputs are fluent and are coherent with the surrounding context. In both automatic and human evaluations, FiLM outperforms existing infilling methods that rely on left-to-right language models trained on rearranged text segments. FiLM is easy to implement and can be either trained from scratch or fine-tuned from a left-to-right language model. Notably, as the model size grows, FiLM's perplexity approaches that of strong left-to-right language models of similar sizes, indicating FiLM's scalability and potential as a large language model.

film, large language model, natural language, (16 more...)

2310.0993

Country:

Europe (1.00)
Asia (0.68)
North America > United States > Illinois (0.14)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.94)
Government > Regional Government (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.73)

arXiv.org Machine LearningSep-19-2023

Statistical and Computational Guarantees for Influence Diagnostics

Fisher, Jillian, Liu, Lang, Pillutla, Krishna, Choi, Yejin, Harchaoui, Zaid

Statistical machine learning models have been increasingly used in fully or partially automatized data analysis processes and artificial intelligence applications (Rudin, 2019). The automatizing of decisions impacting the society inspire a parallel effort to develop methods to identify the factors impacting specific decisions. The heightened scrutiny on the way statistical models now operate at a large scale and at a fast pace has led to a renewed interest in statistical diagnostics such as the influence function (Cook and Weisberg, 1982; Koh and Liang, 2017; Schioppa et al., 2022; Louvet et al., 2022). The influence function or curve of a statistical estimator has been proposed to measure the sensitivity of the estimator to individual datapoints. Computing the influence of a particular datapoint boils down to computing an inverse-Hessian-vector product. Due to a greater focus on least-squares-type estimator with small samples, the computational aspects have received relatively little attention until recently (Koh and Liang, 2017; Schioppa et al., 2022), while the statistical aspects have mainly focused on large sample classical asymptotics (Rousseeuw et al., 2011; Avella-Medina, 2017). The statistical analysis of influence functions for generalized linear models presents several challenges.

artificial intelligence, machine learning, natural language, (19 more...)

2212.04014

Country: North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)