AITopics | eleutherai

Collaborating Authors

eleutherai

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Best Practices for Open Datasets for LLM Training

Baack, Stefan, Biderman, Stella, Odrozek, Kasia, Skowron, Aviya, Bdeir, Ayah, Bommarito, Jillian, Ding, Jennifer, Gahntz, Maximilian, Keller, Paul, Langlais, Pierre-Carl, Lindahl, Greg, Majstorovic, Sebastian, Marda, Nik, Penedo, Guilherme, Van Segbroeck, Maarten, Wang, Jennifer, von Werra, Leandro, Baker, Mitchell, Belião, Julie, Chmielinski, Kasia, Fadaee, Marzieh, Gutermuth, Lisa, Kydlíček, Hynek, Leppert, Greg, Lewis-Jong, EM, Larsen, Solana, Longpre, Shayne, Lungati, Angela Oduor, Miller, Cullen, Miller, Victor, Ryabinin, Max, Siminyu, Kathleen, Strait, Andrew, Surman, Mark, Tumadóttir, Anna, Weber, Maurice, Weiss, Rebecca, White, Lee, Wolf, Thomas

arXiv.org Artificial IntelligenceJan-14-2025

Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.

dataset, license, public domain, (17 more...)

arXiv.org Artificial Intelligence

2501.08365

Country:

Asia > Japan (0.24)
North America > United States > New York (0.04)
Europe > France (0.04)

Genre: Research Report (0.81)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Law > Litigation (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models

EngadgetJul-16-2024, 17:08:27 GMT

Some of the world's largest tech companies trained their AI models on a dataset that included transcripts of more than 173,000 YouTube videos without permission, a new investigation from Proof News has found. The dataset, which was created by a nonprofit company called EleutherAI, contains transcripts of YouTube videos from more than 48,000 channels and was used by Apple, NVIDIA and Anthropic among other companies. The findings of the investigation spotlight AI's uncomfortable truth: the technology is largely built on the backs of data siphoned from creators without their consent or compensation. The dataset doesn't include any videos or images from YouTube, but contains video transcripts from the platform's biggest creators including Marques Brownlee and MrBeast, as well as large news publishers like The New York Times, the BBC, and ABC News. Subtitles from videos belonging to Engadget are also part of the dataset.

apple, transcript, youtube video, (14 more...)

Engadget

Industry: Information Technology > Hardware (0.76)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback

Revealed: The Authors Whose Pirated Books Are Powering Generative AI

The Atlantic - TechnologyAug-19-2023, 21:04:56 GMT

One of the most troubling issues around generative AI is simple: It's being made in secret. To produce humanlike answers to questions, systems such as ChatGPT process huge quantities of written material. But few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on. Some training text comes from Wikipedia and other online writing, but high-quality generative AI requires higher-quality input than is usually found on the internet--that is, it requires the kind found in books. But neither the lawsuit itself nor the commentary surrounding it has offered a look under the hood: We have not previously known for certain whether LLaMA was trained on Silverman's, Kadrey's, or Golden's books, or any others, for that matter.

books3, dataset, developer, (16 more...)

The Atlantic - Technology

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Industry: Law > Intellectual Property & Technology Law (0.95)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Add feedback

The open-source AI boom is built on Big Tech's handouts. How long will it last?

MIT Technology ReviewMay-12-2023, 08:48:17 GMT

Companies like Google--which revealed at its annual product showcase this week that it is throwing generative AI at everything it has, from Gmail to Photos to Maps--were too busy looking over their shoulders to see the real competition coming, writes Sernau: "While we've been squabbling, a third faction has been quietly eating our lunch." Greater access to these models has helped drive innovation--it can also help catch their flaws. AI won't thrive if just a few mega-rich companies get to gatekeep this technology or decide how it is used. But this open-source boom is precarious. Most open-source releases still stand on the shoulders of giant models put out by big firms with deep pockets.

eleutherai, meta ai, open-source ai boom, (5 more...)

MIT Technology Review

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.80)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.76)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.60)

Add feedback

A List of 1 Billion+ Parameter LLMs

#artificialintelligenceApr-14-2023, 05:30:08 GMT

There are already over 50 different 1B+ parametersLLMs accessible via open-source checkpoints or proprietary APIs. That’s not counting any private models or models with academic papers but no available API or model weights. There’s even more if you count fine-tuned models like Alpaca or InstructGPT. A list of the ones I know about (this is an evolving document). GPT-J (6B) (EleutherAI) GPT-Neo (1.3B, 2.7B, 20B) (EleutherAI) Pythia (1B, 1.4B, 2.8B, 6.9B, 12B) Polyglot (1.3B, 3.8B, 5.8B) J1 (

eleutherai, meta, parameter llm, (4 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Man ends his life after an AI chatbot 'encouraged' him to sacrifice himself to stop climate change

#artificialintelligenceMar-31-2023, 22:35:47 GMT

A Belgian man reportedly ended his life following a six-week-long conversation about the climate crisis with an artificial intelligence (AI) chatbot. According to his widow, who chose to remain anonymous, *Pierre - not the man's real name - became extremely eco-anxious when he found refuge in Eliza, an AI chatbot on an app called Chai. Eliza consequently encouraged him to put an end to his life after he proposed sacrificing himself to save the planet. "Without these conversations with the chatbot, my husband would still be here," the man's widow told Belgian news outlet La Libre. According to the newspaper, Pierre, who was in his thirties and a father of two young children, worked as a health researcher and led a somewhat comfortable life, at least until his obsession with climate change took a dark turn.

chatbot, climate change, pierre, (13 more...)

#artificialintelligence

Industry:

Media > News (0.57)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.38)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

EleutherAI: Going Beyond "Open Science" to "Science in the Open"

Phang, Jason, Bradley, Herbie, Gao, Leo, Castricato, Louis, Biderman, Stella

arXiv.org Artificial IntelligenceOct-12-2022

Over the past two years, EleutherAI has established itself as a radically novel initiative aimed at both promoting open-source research and conducting research in a transparent, openly accessible and collaborative manner. EleutherAI's approach to research goes beyond transparency: by doing research entirely in public, anyone in the world can observe and contribute at every stage. Our work has been received positively and has resulted in several high-impact projects in Natural Language Processing and other fields. In this paper, we describe our experience doing public-facing machine learning research, the benefits we believe this approach brings, and the pitfalls we have encountered.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2210.06413

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Norway > Eastern Norway > Oslo (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)

Add feedback

Google bans deepfake-generating AI from Colab – TechCrunch

#artificialintelligenceJun-3-2022, 18:15:16 GMT

Google has banned the training of AI systems that can be used to generate deepfakes on its Google Colaboratory platform. The updated terms of use, spotted over the weekend by Unite.ai and BleepingComputer, includes deepfakes-related work in the list of disallowed projects. Colaboratory, or Colab for short, spun out from an internal Google Research project in late 2017. It's designed to allow anyone to write and execute arbitrary Python code through a web browser, particularly code for machine learning, education and data analysis. For the purpose, Google provides both free and paying Colab users access to hardware including GPUs and Google's custom-designed, AI-accelerating tensor processing units (TPUs).

colab, google, techcrunch, (12 more...)

#artificialintelligence

Country:

Europe > Ukraine (0.15)
Europe > Germany > Saarland (0.05)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Google bans deepfake-generating AI from Colab – TechCrunch

#artificialintelligenceJun-1-2022, 22:20:08 GMT

Google has banned the training of AI systems that can be used to generate deepfakes on its Google Colaboratory platform. The updated terms of use, spotted over the weekend by BleepingComputer, includes deepfakes-related work in the list of disallowed projects. Colaboratory, or Colab for short, spun out from an internal Google Research project in late 2017. It's designed to allow anyone to write and execute arbitrary Python code through a web browser, particularly code for machine learning, education and data analysis. For the purpose, Google provides both free and paying Colab users access to hardware including GPUs and Google's custom-designed, AI-accelerating tensor processing units (TPUs).

colab, google ban deepfake-generating ai, techcrunch, (9 more...)

#artificialintelligence

Country: Europe > Ukraine (0.16)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Global Big Data Conference

#artificialintelligenceDec-26-2021, 05:45:06 GMT

Large language models capable of writing poems, summaries, and computer code are driving the demand for "natural language processing (NLP) as a service." As these models become more capable -- and accessible, relatively speaking -- appetite in the enterprise for them is growing. According to a 2021 survey from John Snow Labs and Gradient Flow, 60% of tech leaders indicated that their NLP budgets grew by at least 10% compared to 2020, while a third -- 33% -- said that their spending climbed by more than 30%. Well-resourced providers like OpenAI, Cohere, and AI21 Labs are reaping the benefits. As of March, OpenAI said that GPT-3 was being used in more than 300 different apps by "tens of thousands" of developers and producing 4.5 billion words per day.

global big data conference, language model, microsoft and nvidia, (5 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.50)

Add feedback