Goto

Collaborating Authors

 audio network


SupplementaryMaterial: LearningRepresentations fromAudio-VisualSpatialAlignment

Neural Information Processing Systems

These are transformer networks of base dimension 512 and expansion ration 4. In other words,7 the output dimensionality of the linear transformations of parametersWkey,Wqr,Wval,W0 and8 W2 are 512, and that ofW1 is 2048. Models are pre-trained to optimize loss (7) for AVC task or9 (9)forAVTSandAVSAtasks. Asoriginallyproposed,15 lateral connections are implemented with a1 1 convolution that maps all feature maps into a16 128 dimensional space followed by a3 3convolution for increased smoothing. Thus, all pixels for which the state-of-the-art model was less25 than 75% confident were kept unlabeled. These low confidence regions were also ignored while26 computingevaluationmetrics.


Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Neural Information Processing Systems

This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a trained network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision. We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.



Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

Neural Information Processing Systems

This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a trained network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision.


Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization

Berghi, Davide, Jackson, Philip J. B.

arXiv.org Artificial Intelligence

Conventional audio-visual approaches for active speaker detection (ASD) typically rely on visually pre-extracted face tracks and the corresponding single-channel audio to find the speaker in a video. Therefore, they tend to fail every time the face of the speaker is not visible. We demonstrate that a simple audio convolutional recurrent neural network (CRNN) trained with spatial input features extracted from multichannel audio can perform simultaneous horizontal active speaker detection and localization (ASDL), independently of the visual modality. To address the time and cost of generating ground truth labels to train such a system, we propose a new self-supervised training pipeline that embraces a ``student-teacher'' learning approach. A conventional pre-trained active speaker detector is adopted as a ``teacher'' network to provide the position of the speakers as pseudo-labels. The multichannel audio ``student'' network is trained to generate the same results. At inference, the student network can generalize and locate also the occluded speakers that the teacher network is not able to detect visually, yielding considerable improvements in recall rate. Experiments on the TragicTalkers dataset show that an audio network trained with the proposed self-supervised learning approach can exceed the performance of the typical audio-visual methods and produce results competitive with the costly conventional supervised training. We demonstrate that improvements can be achieved when minimal manual supervision is introduced in the learning pipeline. Further gains may be sought with larger training sets and integrating vision with the multichannel audio system.


Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

Lee, Jiyoung, Chung, Joon Son, Chung, Soo-Whan

arXiv.org Artificial Intelligence

The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework learnt from visible attributes, called Face-TTS. This is the first time that face images are used as a condition to train a TTS model. We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. We also propose a speaker feature binding loss to enforce the similarity of the generated and the ground truth speech segments in speaker embedding space. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers. We train and evaluate the model on the LRS3 dataset, an in-the-wild audio-visual corpus containing background noise and diverse speaking styles. The project page is https://facetts.github.io.


Is Artificial Intelligence about to transform the sync industry? - Music Business Worldwide

#artificialintelligence

There's been plenty of discussion and debate on MBW's pages regarding the impact that Artificial Intelligence might have on the music business in the future. Obviously, there's its potentially seismic effect on the way musicians make music – whether that's AI producing non-human music from scratch, or providing tools that artists and songwriters can use to compose and perform in the studio. But there's also AI's application to more practical B2B tools to consider. Just last week, for example, we heard from Canada-based LANDR, which has launched an AI tool that helpfully sifts through its huge catalog of samples for those looking for a specific sound. Today, (September 4), a new twist on AI arrives via a fresh partnership between production music library Audio Network and Singapore-based machine learning company, Musiio.

  Country:
  Industry: Media > Music (1.00)

Audio Network Partners with Musiio to Harness the Power of Artificial Intelligence (AI)

#artificialintelligence

Audio Network Limited, one of the world's largest independent creators and publishers of original high-quality music for use in film, television, advertising and digital media, continues its focus on technology by partnering with Musiio to explore the power of AI to improve customer service and delivery. This industry first will equip the global music company with an added interface to their existing search platform, to make their catalogue of over 170,000 tracks even more discoverable, whilst keeping the human touch that Audio Network has always been known for. Singapore-based Musiio provides a new way of "listening" to music at scale, easily searching up to one million tracks in under two seconds and supercharging a team of music researchers to increase their efficiency in responding to music briefs. "AI has been on the fringes of the music industry for the last few years, with talk of labels signing algorithms. But recently, more commercial and practical uses of this powerful computing technology have begun to surface," explained Musiio CEO and co-founder Hazel Savage.

  Country: Asia > Singapore (0.27)
  Genre: Press Release (0.34)
  Industry: Media > Music (0.63)