Goto

Collaborating Authors

 Shoham, Yoav


The Global AI Vibrancy Tool

arXiv.org Artificial Intelligence

This paper presents the latest version of the Global AI Vibrancy Tool (GVT), an interactive suite of visualizations designed to facilitate the comparison of AI vibrancy across 36 countries, using 42 indicators organized into 8 pillars. The tool offers customizable features that allow users to conduct in-depth country-level comparisons and longitudinal analyses of AI-related metrics, all based on publicly available data. By providing a transparent assessment of national progress in AI, it serves the diverse needs of policymakers, industry leaders, researchers, and the general public. Using weights for indicators and pillars developed by AI Index's panel of experts and combined into an index, the Global AI Vibrancy Ranking for 2023 places the United States first by a significant margin, followed by China and the United Kingdom. The ranking also highlights the rise of smaller nations such as Singapore when evaluated on both absolute and per capita bases. The tool offers three sub-indices for evaluating Global AI Vibrancy along different dimensions: the Innovation Index, the Economic Competitiveness Index, and the Policy, Governance, and Public Engagement Index.


Artificial Intelligence Index Report 2024

arXiv.org Artificial Intelligence

The 2024 Index is our most comprehensive to date and arrives at an important moment when AI's influence on society has never been more pronounced. This year, we have broadened our scope to more extensively cover essential trends such as technical advancements in AI, public perceptions of the technology, and the geopolitical dynamics surrounding its development. Featuring more original data than ever before, this edition introduces new estimates on AI training costs, detailed analyses of the responsible AI landscape, and an entirely new chapter dedicated to AI's impact on science and medicine. The AI Index report tracks, collates, distills, and visualizes data related to artificial intelligence (AI). Our mission is to provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of AI. The AI Index is recognized globally as one of the most credible and authoritative sources for data and insights on artificial intelligence. Previous editions have been cited in major newspapers, including the The New York Times, Bloomberg, and The Guardian, have amassed hundreds of academic citations, and been referenced by high-level policymakers in the United States, the United Kingdom, and the European Union, among other places. This year's edition surpasses all previous ones in size, scale, and scope, reflecting the growing significance that AI is coming to hold in all of our lives.


Artificial Intelligence Index Report 2023

arXiv.org Artificial Intelligence

Welcome to the sixth edition of the AI Index Report! This year, the report introduces more original data than any previous edition, including a new chapter on AI public opinion, a more thorough technical performance chapter, original analysis about large language and multimodal models, detailed trends in global AI legislation records, a study of the environmental impact of AI systems, and more. The AI Index Report tracks, collates, distills, and visualizes data related to artificial intelligence. Our mission is to provide unbiased, rigorously vetted, broadly sourced data in order for policymakers, researchers, executives, journalists, and the general public to develop a more thorough and nuanced understanding of the complex field of AI. The report aims to be the world's most credible and authoritative source for data and insights about AI.


Parallel Context Windows for Large Language Models

arXiv.org Artificial Intelligence

When applied to processing long text, Large Language Models (LLMs) are limited by their context window. Existing efforts to address this limitation involve training specialized architectures, and cannot be easily applied to off-the-shelf LLMs. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The key to the approach is to carve a long context into chunks (``windows''), restrict the attention mechanism to apply only within each window, and re-use the positional embeddings across the windows. Our main results test the PCW approach on in-context learning with models that range in size between 750 million and 178 billion parameters, and show substantial improvements for tasks with diverse input and output spaces. We show additional benefits in other settings where long context windows may be beneficial: multi-hop questions and retrieval-augmented question answering with multiple retrieved documents. Our results highlight Parallel Context Windows as a promising method for applying off-the-shelf LLMs in a range of settings that require long text sequences. We make our code publicly available at https://github.com/ai21labs/parallel-context-windows.


In-Context Retrieval-Augmented Language Models

arXiv.org Artificial Intelligence

Retrieval-Augmented Language Modeling (RALM) methods, which condition a language model (LM) on relevant documents from a grounding corpus during generation, were shown to significantly improve language modeling performance. In addition, they can mitigate the problem of factually inaccurate text generation and provide natural source attribution mechanism. Existing RALM approaches focus on modifying the LM architecture in order to facilitate the incorporation of external information, significantly complicating deployment. This paper considers a simple alternative, which we dub In-Context RALM: leaving the LM architecture unchanged and prepending grounding documents to the input, without any further training of the LM. We show that In-Context RALM that builds on off-the-shelf general purpose retrievers provides surprisingly large LM gains across model sizes and diverse corpora. We also demonstrate that the document retrieval and ranking mechanism can be specialized to the RALM setting to further boost performance. We conclude that In-Context RALM has considerable potential to increase the prevalence of LM grounding, particularly in settings where a pretrained LM must be used without modification or even via API access.


Generating Benchmarks for Factuality Evaluation of Language Models

arXiv.org Artificial Intelligence

Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing factual generation evaluation methods focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent rare and unlikely facts. We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements. We use our framework to create two benchmarks: Wiki-FACTOR and News-FACTOR. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score correlates with perplexity, but the two metrics do not always agree on model ranking; and (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation, as measured by human annotators. We make our data and code publicly available in https://github.com/AI21Labs/factor.


Human or Not? A Gamified Approach to the Turing Test

arXiv.org Artificial Intelligence

"I believe that in 50 years' time it will be possible to make computers play the imitation game so well that an average interrogator will have no more than 70% chance of making the right identification after 5 minutes of questioning." Over the course of a month, the game was played by over 1.5 million users who engaged in anonymous two-minute chat sessions with either another human or an AI language model which was prompted to behave like humans. The task of the players was to correctly guess whether they spoke to a person or to an AI. This largest scale Turing-style test conducted to date revealed some interesting facts. For example, overall users guessed the identity of their partners correctly in only 68% of the games. In the subset of the games in which users faced an AI bot, users had even lower correct guess rates of 60% (that is, not much higher than chance). While this experiment calls for many extensions and refinements, these findings already begin to shed light on the inevitable near future which will commingle humans and AI. The famous Turing test, originally proposed by Alan Turing in 1950 as "the imitation game" (Turing, 1950), was proposed as an operational test of intelligence, namely, testing a machine's ability to exhibit behavior indistinguishable from that of a human. In this proposed test, a human evaluator engages in a natural language conversation with both another human and a machine, and tries to distinguish between them. If the evaluator is unable to tell which is which, the machine is said to have passed the test.


The AI Index 2021 Annual Report

arXiv.org Artificial Intelligence

Welcome to the fourth edition of the AI Index Report. This year we significantly expanded the amount of data available in the report, worked with a broader set of external organizations to calibrate our data, and deepened our connections with the Stanford Institute for Human-Centered Artificial Intelligence (HAI). The AI Index Report tracks, collates, distills, and visualizes data related to artificial intelligence. Its mission is to provide unbiased, rigorously vetted, and globally sourced data for policymakers, researchers, executives, journalists, and the general public to develop intuitions about the complex field of AI. The report aims to be the most credible and authoritative source for data and insights about AI in the world.


PMI-Masking: Principled masking of correlated spans

arXiv.org Machine Learning

Masking tokens uniformly at random constitutes a common flaw in the pretraining of Masked Language Models (MLMs) such as BERT. We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking, a principled masking strategy based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. PMI-Masking motivates, unifies, and improves upon prior more heuristic approaches that attempt to address the drawback of random uniform token masking, such as whole-word masking, entity/phrase masking, and random-span masking. Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training. In the couple of years since BERT was introduced in a seminal paper by Devlin et al. (2019a), Masked Language Models (MLMs) have rapidly advanced the NLP frontier (Sun et al., 2019; Liu et al., 2019; Joshi et al., 2020; Raffel et al., 2019). At the heart of the MLM approach is the task of predicting a masked subset of the text given the remaining, unmasked text. The text itself is broken up into tokens, each token consisting of a word or part of a word; thus "chair" constitutes a single token, but out-of-vocabulary words like "eigen-value" are broken up into several sub-word tokens.


Towards the AI Index

AI Magazine

AI is all the rage these days, and with the attention comes the responsibility – our responsibility, as AI researchers and practitioners – to communicate clearly about our field. What areas are progressing, and at what pace? What are the successes and the challenges? Will AI improve our lives? For example, will it increase productivity, open up new business opportunities, lengthen life expectancy and improve its quality? Or is AI a threat to society? For example, will it eliminate jobs, accentuate biases, or lead to unfair concentration of knowledge and wealth? Many people are chiming in on these questions. Opinions diverge, which is to be expected in a fast-moving area with more unknowns than knowns. It is an important conversation to have. And precisely because of the inherent uncertainty, it’s important that the conversation be anchored in fact. It’s hard enough for AI practitioners to keep track of everything that’s going on in our exploding field and make sense of it, and it’s all the more daunting to outsiders. The goal of the AI Index is to provide precisely this factual basis to the conversation, in an open, not-for-profit fashion.