eleutherai
Towards Best Practices for Open Datasets for LLM Training
Baack, Stefan, Biderman, Stella, Odrozek, Kasia, Skowron, Aviya, Bdeir, Ayah, Bommarito, Jillian, Ding, Jennifer, Gahntz, Maximilian, Keller, Paul, Langlais, Pierre-Carl, Lindahl, Greg, Majstorovic, Sebastian, Marda, Nik, Penedo, Guilherme, Van Segbroeck, Maarten, Wang, Jennifer, von Werra, Leandro, Baker, Mitchell, Belião, Julie, Chmielinski, Kasia, Fadaee, Marzieh, Gutermuth, Lisa, Kydlíček, Hynek, Leppert, Greg, Lewis-Jong, EM, Larsen, Solana, Longpre, Shayne, Lungati, Angela Oduor, Miller, Cullen, Miller, Victor, Ryabinin, Max, Siminyu, Kathleen, Strait, Andrew, Surman, Mark, Tumadóttir, Anna, Weber, Maurice, Weiss, Rebecca, White, Lee, Wolf, Thomas
Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.
- Asia > Japan (0.24)
- North America > United States > New York (0.04)
- Europe > France (0.04)
- Law > Intellectual Property & Technology Law (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Law > Litigation (0.88)
Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models
Some of the world's largest tech companies trained their AI models on a dataset that included transcripts of more than 173,000 YouTube videos without permission, a new investigation from Proof News has found. The dataset, which was created by a nonprofit company called EleutherAI, contains transcripts of YouTube videos from more than 48,000 channels and was used by Apple, NVIDIA and Anthropic among other companies. The findings of the investigation spotlight AI's uncomfortable truth: the technology is largely built on the backs of data siphoned from creators without their consent or compensation. The dataset doesn't include any videos or images from YouTube, but contains video transcripts from the platform's biggest creators including Marques Brownlee and MrBeast, as well as large news publishers like The New York Times, the BBC, and ABC News. Subtitles from videos belonging to Engadget are also part of the dataset.
Revealed: The Authors Whose Pirated Books Are Powering Generative AI
One of the most troubling issues around generative AI is simple: It's being made in secret. To produce humanlike answers to questions, systems such as ChatGPT process huge quantities of written material. But few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on. Some training text comes from Wikipedia and other online writing, but high-quality generative AI requires higher-quality input than is usually found on the internet--that is, it requires the kind found in books. But neither the lawsuit itself nor the commentary surrounding it has offered a look under the hood: We have not previously known for certain whether LLaMA was trained on Silverman's, Kadrey's, or Golden's books, or any others, for that matter.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)
The open-source AI boom is built on Big Tech's handouts. How long will it last?
Companies like Google--which revealed at its annual product showcase this week that it is throwing generative AI at everything it has, from Gmail to Photos to Maps--were too busy looking over their shoulders to see the real competition coming, writes Sernau: "While we've been squabbling, a third faction has been quietly eating our lunch." Greater access to these models has helped drive innovation--it can also help catch their flaws. AI won't thrive if just a few mega-rich companies get to gatekeep this technology or decide how it is used. But this open-source boom is precarious. Most open-source releases still stand on the shoulders of giant models put out by big firms with deep pockets.
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.80)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.76)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.60)
A List of 1 Billion+ Parameter LLMs
There are already over 50 different 1B+ parametersLLMs accessible via open-source checkpoints or proprietary APIs. That’s not counting any private models or models with academic papers but no available API or model weights. There’s even more if you count fine-tuned models like Alpaca or InstructGPT. A list of the ones I know about (this is an evolving document). GPT-J (6B) (EleutherAI) GPT-Neo (1.3B, 2.7B, 20B) (EleutherAI) Pythia (1B, 1.4B, 2.8B, 6.9B, 12B) Polyglot (1.3B, 3.8B, 5.8B) J1 (
Man ends his life after an AI chatbot 'encouraged' him to sacrifice himself to stop climate change
A Belgian man reportedly ended his life following a six-week-long conversation about the climate crisis with an artificial intelligence (AI) chatbot. According to his widow, who chose to remain anonymous, *Pierre - not the man's real name - became extremely eco-anxious when he found refuge in Eliza, an AI chatbot on an app called Chai. Eliza consequently encouraged him to put an end to his life after he proposed sacrificing himself to save the planet. "Without these conversations with the chatbot, my husband would still be here," the man's widow told Belgian news outlet La Libre. According to the newspaper, Pierre, who was in his thirties and a father of two young children, worked as a health researcher and led a somewhat comfortable life, at least until his obsession with climate change took a dark turn.
- Media > News (0.57)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.38)
EleutherAI: Going Beyond "Open Science" to "Science in the Open"
Phang, Jason, Bradley, Herbie, Gao, Leo, Castricato, Louis, Biderman, Stella
Over the past two years, EleutherAI has established itself as a radically novel initiative aimed at both promoting open-source research and conducting research in a transparent, openly accessible and collaborative manner. EleutherAI's approach to research goes beyond transparency: by doing research entirely in public, anyone in the world can observe and contribute at every stage. Our work has been received positively and has resulted in several high-impact projects in Natural Language Processing and other fields. In this paper, we describe our experience doing public-facing machine learning research, the benefits we believe this approach brings, and the pitfalls we have encountered.
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Norway > Eastern Norway > Oslo (0.04)
Google bans deepfake-generating AI from Colab – TechCrunch
Google has banned the training of AI systems that can be used to generate deepfakes on its Google Colaboratory platform. The updated terms of use, spotted over the weekend by Unite.ai and BleepingComputer, includes deepfakes-related work in the list of disallowed projects. Colaboratory, or Colab for short, spun out from an internal Google Research project in late 2017. It's designed to allow anyone to write and execute arbitrary Python code through a web browser, particularly code for machine learning, education and data analysis. For the purpose, Google provides both free and paying Colab users access to hardware including GPUs and Google's custom-designed, AI-accelerating tensor processing units (TPUs).
Google bans deepfake-generating AI from Colab – TechCrunch
Google has banned the training of AI systems that can be used to generate deepfakes on its Google Colaboratory platform. The updated terms of use, spotted over the weekend by BleepingComputer, includes deepfakes-related work in the list of disallowed projects. Colaboratory, or Colab for short, spun out from an internal Google Research project in late 2017. It's designed to allow anyone to write and execute arbitrary Python code through a web browser, particularly code for machine learning, education and data analysis. For the purpose, Google provides both free and paying Colab users access to hardware including GPUs and Google's custom-designed, AI-accelerating tensor processing units (TPUs).
Global Big Data Conference
Large language models capable of writing poems, summaries, and computer code are driving the demand for "natural language processing (NLP) as a service." As these models become more capable -- and accessible, relatively speaking -- appetite in the enterprise for them is growing. According to a 2021 survey from John Snow Labs and Gradient Flow, 60% of tech leaders indicated that their NLP budgets grew by at least 10% compared to 2020, while a third -- 33% -- said that their spending climbed by more than 30%. Well-resourced providers like OpenAI, Cohere, and AI21 Labs are reaping the benefits. As of March, OpenAI said that GPT-3 was being used in more than 300 different apps by "tens of thousands" of developers and producing 4.5 billion words per day.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.50)