AITopics | Sanghai, Sumit

Collaborating Authors

Sanghai, Sumit

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Integrating Planning into Single-Turn Long-Form Text Generation

Liang, Yi, Wu, You, Zhuang, Honglei, Chen, Li, Shen, Jiaming, Jia, Yiling, Qin, Zhen, Sanghai, Sumit, Wang, Xuanhui, Yang, Carl, Bendersky, Michael

arXiv.org Artificial IntelligenceOct-8-2024

Generating high-quality, in-depth textual documents, such as academic papers, news articles, Wikipedia entries, and books, remains a significant challenge for Large Language Models (LLMs). In this paper, we propose to use planning to generate long form content. To achieve our goal, we generate intermediate steps via an auxiliary task that teaches the LLM to plan, reason and structure before generating the final text. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. To overcome the scarcity of training data for these intermediate steps, we leverage LLMs to generate synthetic intermediate writing data such as outlines, key information and summaries from existing full articles. Our experiments demonstrate on two datasets from different domains, namely the scientific news dataset SciNews and Wikipedia datasets in KILT-Wiki and FreshWiki, that LLMs fine-tuned with the auxiliary task generate higher quality documents. We observed +2.5% improvement in ROUGE-Lsum, and a strong 3.60 overall win/loss ratio via human SxS evaluation, with clear wins in organization, relevance, and verifiability.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.06203

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Ainslie, Joshua, Lee-Thorp, James, de Jong, Michiel, Zemlyanskiy, Yury, Lebrón, Federico, Sanghai, Sumit

arXiv.org Artificial IntelligenceDec-23-2023

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2305.13245

Country:

Europe (0.94)
North America > United States > Louisiana (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.90)

Add feedback

CoLT5: Faster Long-Range Transformers with Conditional Computation

Ainslie, Joshua, Lei, Tao, de Jong, Michiel, Ontañón, Santiago, Brahma, Siddhartha, Zemlyanskiy, Yury, Uthus, David, Guo, Mandy, Lee-Thorp, James, Tay, Yi, Sung, Yun-Hsuan, Sanghai, Sumit

arXiv.org Artificial IntelligenceOct-23-2023

Many natural language processing tasks benefit from long inputs, but processing long documents with Transformers is expensive -- not only due to quadratic attention complexity but also from applying feedforward and projection layers to every token. However, not all tokens are equally important, especially for longer documents. We propose CoLT5, a long-input Transformer model that builds on this intuition by employing conditional computation, devoting more resources to important tokens in both feedforward and attention layers. We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference, achieving SOTA on the long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2303.09752

Country:

Europe (0.93)
North America > United States > Louisiana (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Functional Interpolation for Relative Positions Improves Long Context Transformers

Li, Shanda, You, Chong, Guruganesh, Guru, Ainslie, Joshua, Ontanon, Santiago, Zaheer, Manzil, Sanghai, Sumit, Yang, Yiming, Kumar, Sanjiv, Bhojanapalli, Srinadh

arXiv.org Artificial IntelligenceOct-6-2023

Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.

functional interpolation, large language model, long context transformer, (2 more...)

arXiv.org Artificial Intelligence

2310.04418

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)

Add feedback

GLIMMER: generalized late-interaction memory reranker

de Jong, Michiel, Zemlyanskiy, Yury, FitzGerald, Nicholas, Sanghai, Sumit, Cohen, William W., Ainslie, Joshua

arXiv.org Artificial IntelligenceJun-16-2023

Memory-augmentation is a powerful approach for efficiently incorporating external information into language models, but leads to reduced performance relative to retrieving text. Recent work introduced LUMEN, a memory-retrieval hybrid that partially pre-computes memory and updates memory representations on the fly with a smaller live encoder. We propose GLIMMER, which improves on this approach through 1) exploiting free access to the powerful memory representations by applying a shallow reranker on top of memory to drastically improve retrieval quality at low cost, and 2) incorporating multi-task training to learn a general and higher quality memory and live encoder. GLIMMER achieves strong gains in performance at faster speeds compared to LUMEN and FiD on the KILT benchmark of knowledge-intensive tasks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.10231

Country:

Europe (1.00)
North America > United States > Louisiana (0.14)
North America > United States > Maryland (0.14)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute

de Jong, Michiel, Zemlyanskiy, Yury, FitzGerald, Nicholas, Ainslie, Joshua, Sanghai, Sumit, Sha, Fei, Cohen, William

arXiv.org Artificial IntelligenceJun-2-2023

Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. However, pre-encoding memory incurs a severe quality penalty as the memory representations are not conditioned on the current input. We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly using a live encoder that is conditioned on the question and fine-tuned for the task. We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget. Moreover, the advantage of LUMEN over FiD increases with model size.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2301.10448

Country:

Europe (0.68)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Louisiana (0.14)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference

de Jong, Michiel, Zemlyanskiy, Yury, Ainslie, Joshua, FitzGerald, Nicholas, Sanghai, Sumit, Sha, Fei, Cohen, William

arXiv.org Artificial IntelligenceJun-2-2023

Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that sets the state-of-the-art on many knowledge-intensive NLP tasks. However, the architecture used for FiD was chosen by making minimal modifications to a standard T5 model, which our analysis shows to be highly suboptimal for a retrieval-augmented model. In particular, FiD allocates the bulk of FLOPs to the encoder, while the majority of inference time results from memory bandwidth constraints in the decoder. We propose two simple changes to the FiD architecture to alleviate memory bandwidth constraints, and speed up inference by 7x. This allows us to use a much larger decoder at modest cost. We denote FiD with the above modifications as FiDO, and show that it strongly improves performance over existing FiD models for a wide range of inference budgets. For example, FiDO-Large-XXL performs faster inference than FiD-Base and achieves better performance than FiD-Large.

decoder, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2212.08153

Country:

Europe (0.68)
North America > United States > California (0.46)
North America > United States > Washington > King County > Seattle (0.14)

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models

Vilnis, Luke, Zemlyanskiy, Yury, Murray, Patrick, Passos, Alexandre, Sanghai, Sumit

arXiv.org Artificial IntelligenceJun-1-2023

Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. Alternatively, methods such as temperature sampling and its modifications (top-k sampling, nucleus sampling, typical decoding, and others), are embarrassingly parallel, but have no guarantees about duplicate samples. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model, compatible with common sampling variations, with provable beam diversity under certain conditions, as well as being embarrassingly parallel and providing unbiased and consistent expectations from the original model. We demonstrate the effectiveness of Figure 1: Sequence model over sequences of length two our approach on WMT machine translation, and a vocabulary of three symbols mapping points in the more than halving the standard deviation when unit interval to each sequence. An even lattice of code estimating expected BLEU score reward, and points parallelizes decoding into diverse high-probability closing the BLEU score gap between independent sequences.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2210.15458

Country: North America > United States > Hawaii (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)

Add feedback

ImPaKT: A Dataset for Open-Schema Knowledge Base Construction

Vilnis, Luke, Fisher, Zach, Kanagal, Bhargav, Murray, Patrick, Sanghai, Sumit

arXiv.org Artificial IntelligenceDec-21-2022

Large language models have ushered in a golden age of semantic parsing. The seq2seq paradigm allows for open-schema and abstractive attribute and relation extraction given only small amounts of finetuning data. Language model pretraining has simultaneously enabled great strides in natural language inference, reasoning about entailment and implication in free text. These advances motivate us to construct ImPaKT, a dataset for open-schema information extraction, consisting of around 2500 text snippets from the C4 corpus, in the shopping domain (product buying guides), professionally annotated with extracted attributes, types, attribute summaries (attribute schema discovery from idiosyncratic text), many-to-one relations between compound and atomic attributes, and implication relations. We release this data in hope that it will be useful in fine tuning semantic parsers for information extraction and knowledge base construction across a variety of domains. We evaluate the power of this approach by fine-tuning the open source UL2 language model on a subset of the dataset, extracting a set of implication relations from a corpus of product buying guides, and conducting human evaluations of the resulting predictions.

artificial intelligence, expert system, natural language, (17 more...)

arXiv.org Artificial Intelligence

2212.1077

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

An Empirical Comparison of Algorithms for Aggregating Expert Predictions

Dani, Varsha, Madani, Omid, Pennock, David M, Sanghai, Sumit, Galebach, Brian

arXiv.org Artificial IntelligenceJun-27-2012

Predicting the outcomes of future events is a challenging problem for which a variety of solution methods have been explored and attempted. We present an empirical comparison of a variety of online and offline adaptive algorithms for aggregating experts' predictions of the outcomes of five years of US National Football League games (1319 games) using expert probability elicitations obtained from an Internet contest called ProbabilitySports. We find that it is difficult to improve over simple averaging of the predictions in terms of prediction accuracy, but that there is room for improvement in quadratic loss. Somewhat surprisingly, a Bayesian estimation algorithm which estimates the variance of each expert's prediction exhibits the most consistent superior performance over simple averaging among our collection of algorithms.

algorithm, artificial intelligence, bayesian inference, (16 more...)

arXiv.org Artificial Intelligence

1206.6814

Country: North America > United States > Washington > King County > Seattle (0.14)

Industry: Leisure & Entertainment > Sports > Football (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.47)

Add feedback