audioclip
Visual Acoustic Fields
Li, Yuelei, Kim, Hyunjin, Zhan, Fangneng, Qiu, Ri-Zhao, Ji, Mazeyu, Shan, Xiaojun, Zou, Xueyan, Liang, Paul, Pfister, Hanspeter, Wang, Xiaolong
Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Florida > Orange County > Orlando (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
Meta's Movie Gen Makes Convincing AI Video Clips
Meta just announced its own media-focused AI model, called Movie Gen, that can be used to generate realistic video and audioclips. The company shared multiple 10-second clips generated with Movie Gen, including a Moo Deng-esque baby hippo swimming around, to demonstrate its capabilities. While the tool is not yet available for use, this Movie Gen announcement comes shortly after its Meta Connect event, which showcased new and refreshed hardware and the latest version of its large language model, Llama 3.2. Going beyond the generation of straightforward text-to-video clips, the Movie Gen model can make targeted edits to an existing clip, like adding an object into someone's hands or changing the appearance of a surface. In one of the example videos from Meta, a woman wearing a VR headset was transformed to look like she was wearing steampunk binoculars.
- Leisure & Entertainment (0.94)
- Media > Film (0.73)
- Information Technology > Services (0.51)
Noah Raford Can Help You Prepare for a Not-So-Nice Future
Lauren Goode: Alright, I'm gonna ask the question that everyone's wondering about: What is a futurist? Gideon Lichfield: Well, I mean, I think some people imagine it's just, you know, a guy who sits around making predictions about the future, and there are probably some people who do just that. But Noah calls himself an applied futurist by which he means that he studies trends--technological, economic, demographic, political, you name it. And then he works within institutions like the government to help them take those trends into account in their decision-making and their policies. So how should they think about the impact of AI, for instance?
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Communications > Mobile (0.40)
Improving automated segmentation of radio shows with audio embeddings
Berlage, Oberon, Lux, Klaus-Michael, Graus, David
Audio features have been proven useful for increasing the performance of automated topic segmentation systems. This study explores the novel task of using audio embeddings for automated, topically coherent segmentation of radio shows. We created three different audio embedding generators using multi-class classification tasks on three datasets from different domains. We evaluate topic segmentation performance of the audio embeddings and compare it against a text-only baseline. We find that a set-up including audio embeddings generated through a non-speech sound event classification task significantly outperforms our text-only baseline by 32.3% in F1-measure. In addition, we find that different classification tasks yield audio embeddings that vary in segmentation performance.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)
- (7 more...)