Goto

Collaborating Authors

 laion-5b


Appendix (LAION-5B: An open large-scale dataset for training next generation image-text models) A Datasheet for LAION-5B dataset A.1 Motivation Q1

Neural Information Processing Systems

For what purpose was the dataset created? Was there a specific task in mind? YFCC with 100 million image/videos and associated metadata. Who created the dataset (e.g., which team, research group) and on behalf of which Who funded the creation of the dataset? This work was sponsored by Hugging Face and Stability AI. What do the instances that comprise the dataset represent (e.g., documents, photos, Are there multiple types of instances (e.g., movies, users, and ratings; We provide 5.8 billion image-text pairs.



LAION-5B: An open large-scale dataset for training next generation image-text models

Neural Information Processing Systems

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection.




A major AI training data set contains millions of examples of personal data

MIT Technology Review

The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that "anything you put online can [be] and probably has been scraped." The researchers found thousands of instances of validated identity documents--including images of credit cards, driver's licenses, passports, and birth certificates--as well as over 800 validated job application documents (including rรฉsumรฉs and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. A number of the rรฉsumรฉs disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When rรฉsumรฉs were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references). When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models.


LAION-5B: An open large-scale dataset for training next generation image-text models

Neural Information Processing Systems

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community.


AI Tools Are Secretly Training on Real Images of Children

WIRED

Over 170 images and personal details of children from Brazil have been scraped by an open-source dataset without their knowledge or consent, and used to train AI, claims a new report from Human Rights Watch released Monday. The images have been scraped from content posted as recently as 2023 and as far back as the mid-1990s, according to the report, long before any internet user might anticipate that their content might be used to train AI. Human Rights Watch claims that personal details of these children, alongside links to their photographs, were included in LAION-5B, a dataset that has been a popular source of training data for AI startups. "Their privacy is violated in the first instance when their photo is scraped and swept into these datasets. And then these AI tools are trained on this data and therefore can create realistic imagery of children," says Hye Jung Han, children's rights and technology researcher at Human Rights Watch and the researcher who found these images.


Building AI Safely Is Getting Harder and Harder

The Atlantic - Technology

This is Atlantic Intelligence, an eight-week series in which The Atlantic's leading thinkers on AI will help you understand the complexity and opportunities of this groundbreaking technology. The bedrock of the AI revolution is the internet, or more specifically, the ever-expanding bounty of data that the web makes available to train algorithms. ChatGPT, Midjourney, and other generative-AI models "learn" by detecting patterns in massive amounts of text, images, and videos scraped from the internet. The process entails hoovering up huge quantities of books, art, memes, and, inevitably, the troves of racist, sexist, and illicit material distributed across the web. Earlier this week, Stanford researchers found a particularly alarming example of that toxicity: The largest publicly available image data set used to train AIs, LAION-5B, reportedly contains more than 1,000 images depicting the sexual abuse of children, out of more than 5 billion in total.


A New Tool Helps Artists Thwart AI--With a Middle Finger

WIRED

When artificial intelligence image generators first rolled out, they seemed like magic. Churning out detailed imagery in minutes was, from one angle, a technical marvel. From another angle, though, it looked like mere mimicry. The models were trained on billions of images without anyone asking the humans behind them for permission. "They have sucked the creative juices of millions of artists," says Eva Toorenent, an illustrator who serves as the Netherlands adviser for the European Guild for Artificial Intelligence Regulation.