AITopics | Kahn, Jacob

Collaborating Authors

Kahn, Jacob

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Altogether: Image Captioning via Re-aligning Alt-text

Xu, Hu, Huang, Po-Yao, Tan, Xiaoqing Ellen, Yeh, Ching-Feng, Kahn, Jacob, Jou, Christine, Ghosh, Gargi, Levy, Omer, Zettlemoyer, Luke, Yih, Wen-tau, Li, Shang-Wen, Xie, Saining, Feichtenhofer, Christoph

arXiv.org Artificial IntelligenceDec-28-2024

This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners' training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.17251

Country:

Oceania > Australia (0.14)
North America > United States (0.14)
Europe > Switzerland (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Fernandez, Jared, Wehrstedt, Luca, Shamis, Leonid, Elhoushi, Mostafa, Saladi, Kalyan, Bisk, Yonatan, Strubell, Emma, Kahn, Jacob

arXiv.org Artificial IntelligenceNov-20-2024

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2411.13055

Country: Europe > Italy (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Characterizing and Efficiently Accelerating Multimodal Generation Model Inference

Lee, Yejin, Sun, Anna, Hosmer, Basil, Acun, Bilge, Balioglu, Can, Wang, Changhan, Hernandez, Charles David, Puhrsch, Christian, Haziza, Daniel, Guessous, Driss, Massa, Francisco, Kahn, Jacob, Wan, Jeffrey, Reizenstein, Jeremy, Zhai, Jiaqi, Isaacson, Joe, Schlosser, Joel, Pino, Juan, Sadagopan, Kaushik Ram, Shamis, Leonid, Ma, Linjian, Hwang, Min-Jae, Chen, Mingda, Elhoushi, Mostafa, Rodriguez, Pedro, Pasunuru, Ram, Yih, Scott, Popuri, Sravya, Liu, Xing, Wu, Carole-Jean

arXiv.org Artificial IntelligenceSep-30-2024

Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To sustainably scale generative AI capabilities to billions of users in the world, inference must be fast and efficient. This paper pinpoints key system design and optimization opportunities by characterizing a family of emerging multi-modal generation models on real systems. Auto-regressive token generation is a critical latency performance bottleneck, typically dominated by GPU idle time. In addition to memory-intensive attention across the generative AI models, linear operations constitute significant inference latency due to the feed forward networks in Transformer-based models. We demonstrate that state-of-the-art optimization levers, spanning from applications to system software and hardware, set a 3.88x better baseline.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.00215

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Add feedback

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Sukhbaatar, Sainbayar, Golovneva, Olga, Sharma, Vasu, Xu, Hu, Lin, Xi Victoria, Rozière, Baptiste, Kahn, Jacob, Li, Daniel, Yih, Wen-tau, Weston, Jason, Li, Xian

arXiv.org Artificial IntelligenceMar-12-2024

We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2403.07816

Country: Asia (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Framework Tax: Disparities Between Inference Efficiency in NLP Research and Deployment

Fernandez, Jared, Kahn, Jacob, Na, Clara, Bisk, Yonatan, Strubell, Emma

arXiv.org Artificial IntelligenceDec-22-2023

Increased focus on the computational efficiency of NLP systems has motivated the design of efficient model architectures and improvements to underlying hardware accelerators. However, the resulting increases in computational throughput and reductions in floating point operations have not directly translated to improvements in wall-clock inference latency. We demonstrate that these discrepancies can be largely attributed to bottlenecks introduced by deep learning frameworks. We denote this phenomenon as the \textit{framework tax}, and observe that the disparity is growing as hardware speed increases over time. In this work, we examine this phenomenon through a series of case studies analyzing the effects of model design decisions, framework paradigms, and hardware platforms on total model latency. Code is available at https://github.com/JaredFern/Framework-Tax.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2302.06117

Country:

North America > United States (0.14)
Europe > Italy (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Lin, Xi Victoria, Chen, Xilun, Chen, Mingda, Shi, Weijia, Lomeli, Maria, James, Rich, Rodriguez, Pedro, Kahn, Jacob, Szilvasy, Gergely, Lewis, Mike, Zettlemoyer, Luke, Yih, Scott

arXiv.org Artificial IntelligenceNov-5-2023

Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average.

computational linguistic, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2310.01352

Country:

Europe (1.00)
Asia (1.00)
North America > Canada (0.68)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Flashlight: Enabling Innovation in Tools for Machine Learning

Kahn, Jacob, Pratap, Vineel, Likhomanenko, Tatiana, Xu, Qiantong, Hannun, Awni, Cai, Jeff, Tomasello, Paden, Lee, Ann, Grave, Edouard, Avidov, Gilad, Steiner, Benoit, Liptchinsky, Vitaliy, Synnaeve, Gabriel, Collobert, Ronan

arXiv.org Artificial IntelligenceJan-28-2022

As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together.

artificial intelligence, machine learning, neural network, (19 more...)

arXiv.org Artificial Intelligence

2201.12465

Country: North America > United States > California (0.15)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Differentiable Weighted Finite-State Transducers

Hannun, Awni, Pratap, Vineel, Kahn, Jacob, Hsu, Wei-Ning

arXiv.org Machine LearningOct-2-2020

E B. (2) The primary difference between ASG and CTC is the inclusion of a blank token, b, represented by the graph in figure 3a. Constructing CTC amounts to including the blank token graph when constructing the full token graph T. The intersection T Y then results in the CTC alignment graph (Figure 1b). Note, this version of CTC does not force transitions on b between repeats tokens. This requires remembering the previous state and hence is more involved (see Appendix A.1 for details). A benefit of constructing sequence-level criteria by composing operations on simpler graphs is the access to a large design space of loss functions with which we can encode useful priors. For example we could construct a "spike" CTC, a "duration-limited" CTC, or an "equally spaced" CTC by substituting the appropriate token graphs into equation 2 (see Appendix A.2 for details).

deep learning, graph, speech recognition, (21 more...)

arXiv.org Machine Learning

2010.01003

Country:

North America > United States (0.14)
Europe > Germany (0.14)

Genre:

Instructional Material (0.46)
Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)
(2 more...)

Add feedback