AITopics | laion-5b

Collaborating Authors

laion-5b

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Appendix (LAION-5B: An open large-scale dataset for training next generation image-text models) A Datasheet for LAION-5B dataset A.1 Motivation Q1

Neural Information Processing SystemsFeb-11-2026, 02:26:53 GMT

For what purpose was the dataset created? Was there a specific task in mind? YFCC with 100 million image/videos and associated metadata. Who created the dataset (e.g., which team, research group) and on behalf of which Who funded the creation of the dataset? This work was sponsored by Hugging Face and Stability AI. What do the instances that comprise the dataset represent (e.g., documents, photos, Are there multiple types of instances (e.g., movies, users, and ratings; We provide 5.8 billion image-text pairs.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

a1859debfb3b59d094f3504d5ebb6c25-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-11-2026, 02:26:49 GMT

arxiv preprint arxiv, dataset, laion-5b, (13 more...)

Neural Information Processing Systems

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > Canada > Ontario > Toronto (0.04)
(4 more...)

Genre: Research Report (0.48)

Industry:

Information Technology (0.67)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

LAION-5B: An open large-scale dataset for training next generation image-text models

Neural Information Processing SystemsDec-24-2025, 21:47:27 GMT

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection.

dataset, laion-5b, open large-scale dataset, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.59)

Add feedback

a1859debfb3b59d094f3504d5ebb6c25-Supplemental-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsAug-17-2025, 08:07:36 GMT

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: Europe > Italy (0.04)

Genre: Research Report (0.46)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

Add feedback

LAION-5B: An open large-scale dataset for training next generation image-text models

Neural Information Processing SystemsAug-17-2025, 08:07:31 GMT

Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > Canada > Ontario > Toronto (0.04)
(4 more...)

Genre: Research Report (0.48)

Industry:

Information Technology (0.67)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Add feedback

A major AI training data set contains millions of examples of personal data

MIT Technology ReviewJul-18-2025, 13:08:26 GMT

The bottom line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one of the coauthors, is that "anything you put online can [be] and probably has been scraped." The researchers found thousands of instances of validated identity documents--including images of credit cards, driver's licenses, passports, and birth certificates--as well as over 800 validated job application documents (including résumés and cover letters), which were confirmed through LinkedIn and other web searches as being associated with real people. A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references). When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the largest existing data set of publicly available image-text pairs, which are often used to train generative text-to-image models.

artificial intelligence, information, machine learning, (9 more...)

MIT Technology Review

Industry:

Information Technology > Security & Privacy (0.72)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.38)

Add feedback

LAION-5B: An open large-scale dataset for training next generation image-text models

Neural Information Processing SystemsJan-18-2025, 07:36:33 GMT

dataset, generation image-text model, open large-scale dataset, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.41)

Add feedback

AI Tools Are Secretly Training on Real Images of Children

WIREDJun-10-2024, 04:01:00 GMT

Over 170 images and personal details of children from Brazil have been scraped by an open-source dataset without their knowledge or consent, and used to train AI, claims a new report from Human Rights Watch released Monday. The images have been scraped from content posted as recently as 2023 and as far back as the mid-1990s, according to the report, long before any internet user might anticipate that their content might be used to train AI. Human Rights Watch claims that personal details of these children, alongside links to their photographs, were included in LAION-5B, a dataset that has been a popular source of training data for AI startups. "Their privacy is violated in the first instance when their photo is scraped and swept into these datasets. And then these AI tools are trained on this data and therefore can create realistic imagery of children," says Hye Jung Han, children's rights and technology researcher at Human Rights Watch and the researcher who found these images.

dataset, real image, secretly training, (9 more...)

WIRED

Country: South America > Brazil (0.26)

Industry: Law > Civil Rights & Constitutional Law (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.33)

Add feedback

Building AI Safely Is Getting Harder and Harder

The Atlantic - TechnologyDec-22-2023, 19:22:00 GMT

This is Atlantic Intelligence, an eight-week series in which The Atlantic's leading thinkers on AI will help you understand the complexity and opportunities of this groundbreaking technology. The bedrock of the AI revolution is the internet, or more specifically, the ever-expanding bounty of data that the web makes available to train algorithms. ChatGPT, Midjourney, and other generative-AI models "learn" by detecting patterns in massive amounts of text, images, and videos scraped from the internet. The process entails hoovering up huge quantities of books, art, memes, and, inevitably, the troves of racist, sexist, and illicit material distributed across the web. Earlier this week, Stanford researchers found a particularly alarming example of that toxicity: The largest publicly available image data set used to train AIs, LAION-5B, reportedly contains more than 1,000 images depicting the sexual abuse of children, out of more than 5 billion in total.

birhane, laion data, wong, (15 more...)

The Atlantic - Technology

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.53)
Law > Criminal Law (0.37)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.36)

Add feedback

A New Tool Helps Artists Thwart AI--With a Middle Finger

WIREDOct-12-2023, 19:30:00 GMT

When artificial intelligence image generators first rolled out, they seemed like magic. Churning out detailed imagery in minutes was, from one angle, a technical marvel. From another angle, though, it looked like mere mimicry. The models were trained on billions of images without anyone asking the humans behind them for permission. "They have sucked the creative juices of millions of artists," says Eva Toorenent, an illustrator who serves as the Netherlands adviser for the European Guild for Artificial Intelligence Regulation.

illustrator, middle finger, tool help artist thwart ai, (7 more...)

WIRED

Country:

Europe > Netherlands (0.26)
Asia > Middle East > Jordan (0.06)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback