AITopics | Shippole, Enrico

Collaborating Authors

Shippole, Enrico

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Bridging the Data Provenance Gap Across Text, Speech and Video

Longpre, Shayne, Singh, Nikhil, Cherep, Manuel, Tiwary, Kushagra, Materzynska, Joanna, Brannon, William, Mahari, Robert, Dey, Manan, Hamdy, Mohammed, Saxena, Nayan, Anis, Ahmad Mustafa, Alghamdi, Emad A., Chien, Vu Minh, Obeng-Marnu, Naana, Yin, Da, Qian, Kun, Li, Yizhi, Liang, Minnie, Dinh, An, Mohanty, Shrestha, Mataciunas, Deividas, South, Tobin, Zhang, Jianguo, Lee, Ariel N., Lund, Campbell S., Klamm, Christopher, Sileo, Damien, Misra, Diganta, Shippole, Enrico, Klyman, Kevin, Miranda, Lester JV, Muennighoff, Niklas, Ye, Seonghyeon, Kim, Seungone, Gupta, Vipul, Sharma, Vivek, Zhou, Xuhui, Xiong, Caiming, Villa, Luis, Biderman, Stella, Pentland, Alex, Hooker, Sara, Kabbara, Jad

arXiv.org Artificial IntelligenceDec-18-2024

Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.17847

Country:

Europe (1.00)
Asia > Middle East (0.92)
Africa (0.67)
(3 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.67)

Industry:

Health & Medicine > Therapeutic Area (0.67)
Information Technology > Security & Privacy (0.67)
Media > News (0.67)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Crowson, Katherine, Baumann, Stefan Andreas, Birch, Alex, Abraham, Tanishq Mathew, Kaplan, Daniel Z., Shippole, Enrico

arXiv.org Artificial IntelligenceJan-21-2024

We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.

large language model, machine learning, resolution, (17 more...)

arXiv.org Artificial Intelligence

2401.11605

Country: Europe (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)

Add feedback

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

Longpre, Shayne, Mahari, Robert, Chen, Anthony, Obeng-Marnu, Naana, Sileo, Damien, Brannon, William, Muennighoff, Niklas, Khazam, Nathan, Kabbara, Jad, Perisetla, Kartik, Wu, Xinyi, Shippole, Enrico, Bollacker, Kurt, Wu, Tongshuang, Villa, Luis, Pentland, Sandy, Hooker, Sara

arXiv.org Artificial IntelligenceNov-4-2023

The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2310.16787

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.82)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Information Technology (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)

Add feedback

YaRN: Efficient Context Window Extension of Large Language Models

Peng, Bowen, Quesnelle, Jeffrey, Fan, Honglu, Shippole, Enrico

arXiv.org Artificial IntelligenceNov-1-2023

Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based language models. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length at https://github.com/jquesnelle/yarn

artificial intelligence, machine learning, natural language, (3 more...)

arXiv.org Artificial Intelligence

2309.00071

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback