Goto

Collaborating Authors

 descript


Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

Lyu, Mengyao, Li, Yan, Zhong, Huasong, Yang, Wenhao, Chen, Hui, Han, Jungong, Ding, Guiguang, Yang, Zhenheng

arXiv.org Artificial Intelligence

The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.


I Cloned My Voice and My Mother Couldn't Tell the Difference

Slate

This article is from Understanding AI, a newsletter that explores how A.I. works and how it's changing our world. A couple of weeks ago, I used A.I. software to clone my voice. The resulting audio sounded pretty convincing to me, but I wanted to see what others thought. So I created a test audio file based on the first 12 paragraphs of this article that I wrote. Seven randomly chosen paragraphs were my real voice, while the other five were generated by A.I. I asked members of my family to see if they could tell the difference.


The Digital Insider

#artificialintelligence

Low-code and no-code platforms are used to build applications, websites, mobile apps, forms, dashboards, data pipelines, and integrations. No-code platforms help business users, sometimes termed citizen developers, to migrate from spreadsheets, extend beyond email collaborations, and transition from manual task execution to using tools and automations across departments. Low-code platforms are usually for technologists and provide ways to deliver and support software with little or no coding. "You have to remember low code is just a fancy term for abstraction. We are abstracting away non-essential elements in order to simplify the user experience," says Gordon Allott, President and CEO of K3.


How Descript's generative AI makes video editing as easy as updating text

#artificialintelligence

Check out the on-demand sessions from the Low-Code/No-Code Summit to learn how to successfully innovate and achieve efficiency by upskilling and scaling citizen developers. A podcaster steps up to a mic to do a review of a new chicken nugget brand. As he begins talking and recording himself on his laptop, real-time speech-to-text transcribes his comments: "So these nuggets are, um, made from chicken, but they're made to um, um, um, um, emulate the taste of, like, like, non chicken nuggets." That doesn't sound very professional; on his screen, he strikes through those filler words -- and while he's at it, boosts the podcast's sound quality before publishing it for his audience. This is one use case for audio-video editing tool Descript, which today announced a significant product update and a $50 million series C round led by the OpenAI Startup Fund. "The whole concept of Descript -- editing video like a doc -- is only possible because of AI [artificial intelligence]," said Jay LeBoeuf, Descript's head of business and corporate development.


Descript's text-based video editor now lets you write scripts as you go

Engadget

Descript aims to simplify video editing by making it a matter of tweaking transcripts, but now you don't even need to have ready-made audio. The company has redesigned Descript with a new interface that includes a writing tool. You can write a script in Overdub on the fly and either use text-to-speech to vocalize your narration or replace it with your own recording later. This could mainly be helpful if your content doesn't have any spoken-word material, but it might also come in handy if you're not comfortable speaking. The app as a whole now centers on "Scenes," or distinct visual segments (pictured above).


The Future of Podcasting is AI

#artificialintelligence

Roughly speaking, about 22,000 new podcasts are launched in a month. There are close to 2.5 million (more than 71 million episodes) in the Apple Podcasts directory right now, according to Podcast Industry Insights. And those are just the ones we know about. They're going direct to their listeners, selling premium content and having big success," says Andy Taylor, formerly of BBC Radio and founder of Cardiff-based R&D consultancy Bwlb. And that's to say nothing of the growing volume of podcast-like content, whether created by brands for promotion or event producers that want, for example, to make talks available on-demand. Every piece of content needs to be produced and distributed, whether by audio professionals or folks learning the craft. Therefore, the more they can automate large swaths of production, the more they can focus on the content. "The different places audio is being published have just exploded," explains Jonathan Wyner chief engineer at M Works Mastering and a professor at Berklee College of Music in Boston. "With all those contexts, there is a real motivation and imperative for creators to be more versatile." Not to mention, more productive and efficient. Artificial intelligence (AI) -- software that can automate tasks previously done by humans -- holds the key to handling the tsunami of podcast content. Not only can AI speed up production, it can make podcasts sound better and set the stage for the audio experiences of tomorrow. "AI basically helps take care of repetitive tasks to quicken the workflow of the podcaster," explains Manos Chourdakis, research engineer at Nomono, which develops AI-based podcasting tools. "For example, with AI, you don't have to listen to a whole podcast to find where someone said something wrong, then replace or remove it.


Everyone will be able to clone their voice in the future

#artificialintelligence

Cloning your voice using artificial intelligence is simultaneously tedious and simple: hallmarks of a technology that's just about mature and ready to go public. All you need to do is talk into a microphone for 30 minutes or so, reading a script as carefully as you can (in my case: the voiceover from a David Attenborough documentary). After starting and stopping dozens of times to re-record your flubs and mumbles, you'll send off the resulting audio files to be processed and, in a few hours' time, be told that a copy of your voice is ready and waiting. Then, you can type anything you want into a chatbox, and your AI clone will say it back to you, with the resulting audio realistic to fool even friends and family -- at least for a few moments. The fact that such a service even exists may be news to many, and I don't believe we've begun to fully consider the impact easy access to this technology will have.


4 Ways Your Startup Can Use AI Right Now (Without Breaking The Bank)

#artificialintelligence

Artificial intelligence (AI) is what computer scientist Andrew Ng calls "the new electricity." However, despite its abilities and appeal, AI is not a fit for every situation. In my earlier article, I presented 5 scenarios to avoid investing in AI. To find out if your startup needs AI, start by prioritizing your business problems. Frame the best approach to solve these challenges and evaluate how technology can help you.


Anthony Bourdain's voice-cloning for new doc called into question: It's 'a slippery slope'

FOX News

Fox News Flash top entertainment and celebrity headlines are here. Check out what's clicking today in entertainment. The revelation that a documentary filmmaker used voice-cloning software to make the late chef Anthony Bourdain say words he never spoke has drawn criticism amid ethical concerns about use of the powerful technology. The movie "Roadrunner: A Film About Anthony Bourdain" appeared in cinemas Friday and mostly features real footage of the beloved celebrity chef and globe-trotting television host before he died in 2018. But its director, Morgan Neville, told The New Yorker that a snippet of dialogue was created using artificial intelligence technology.


Descript lets you edit videos by tweaking text scripts

Engadget

Video editing is often a time-consuming process, but Descript is trying to take the sting out of it a bit with its latest suite of tools. Descript Video transcribes your footage and turns it into a text document. Changes that you make there are reflected in your video edit. Cutting a flubbed line is as simple as deleting the transcribed text. You can even dub over any misspeaks by changing the words in the text editor -- Descript's AI-based tech can add audio in your own voice.