Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models

Engadget 

Some of the world's largest tech companies trained their AI models on a dataset that included transcripts of more than 173,000 YouTube videos without permission, a new investigation from Proof News has found. The dataset, which was created by a nonprofit company called EleutherAI, contains transcripts of YouTube videos from more than 48,000 channels and was used by Apple, NVIDIA and Anthropic among other companies. The findings of the investigation spotlight AI's uncomfortable truth: the technology is largely built on the backs of data siphoned from creators without their consent or compensation. The dataset doesn't include any videos or images from YouTube, but contains video transcripts from the platform's biggest creators including Marques Brownlee and MrBeast, as well as large news publishers like The New York Times, the BBC, and ABC News. Subtitles from videos belonging to Engadget are also part of the dataset.