Goto

Collaborating Authors

 pengi



Pengi: An Audio Language Model for Audio Tasks

Neural Information Processing Systems

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output.



OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

Chen, Shengkai, Yin, Yifang, Cao, Jinming, Xiang, Shili, Liu, Zhenguang, Zimmermann, Roger

arXiv.org Artificial Intelligence

Audio-visual segmentation aims to separate sounding objects from videos by predicting pixel-level masks based on audio signals. Existing methods primarily concentrate on closed-set scenarios and direct audio-visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training-free language-based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open-vocabulary Audio-Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio-to-text prompt generation, 2) LLM-guided prompt translation, and 3) text-to-visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS task. Moreover, we present a model-agnostic framework OpenAVS-ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo-label based self-training. This approach enhances performance by effectively utilizing large-scale unlabeled data when available. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero-shot, and few-shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F-score, respectively, in challenging scenarios.


Pengi: An Audio Language Model for Audio Tasks

Neural Information Processing Systems

In the domain of audio processing, Transfer Learning has facilitated the rise of Self-Supervised Learning and Zero-Shot Learning techniques. These approaches have led to the development of versatile models capable of tackling a wide array of tasks, while delivering state-of-the-art performance. However, current models inherently lack the capacity to produce the requisite language for open-ended tasks, such as Audio Captioning or Audio Question Answering. We introduce Pengi, a novel Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. It takes as input, an audio recording, and text, and generates free-form text as output.


Penguins Can Make Cake

AI Magazine

Until quite recently, it was taken for granted in AIand cognitive science more broadlythat activity resulted from the creation and execution of plans. In 1985, several researchers, including myself, independently realized that plans and planning are not necessary-or necessarily useful-in activity. Since this time, a number of alternatives have been proposed. This analysis is equally applicable to any other computational problem. Thus, you could conclude that vision is impossible because it requires exponential computation in the number of pixels or that, on the average, business data processing takes exponential work in the number of records.


Penguins Can Make Cake

Chapman, David

AI Magazine

Since this article is a counting argument, the conclusion time, a number of alternatives have been proposed. Presumably, in realistic cases, the Universally Bad Idea," analyzes one such number of sensors is large enough that a universal alternative, Marcel Schoppers's universal plan could not fit in your head. He also extends this analysis to a There are two reasons not to be concerned number of other systems, including Pengi about this apparent problem. They involve (A gre and Chapman 1987), which was structure and state, designed by Phil Agre and myself. Ginsberg's criticisms of universal plans rest Using universal plans, he says, is infeasible because their size is exponential in the number of possible domain states. Representing such a plan is infeasible in even quite small realistic domains. I'm sympathetic to such arguments, having made similar ones to the effect that classical planning is infeasible (Agre and Chapman 1988; Chapman 1987b). I don't understand the details of Schoppers's ideas, so I'm not sure whether this critique of universal plans per se is correct. However, I show that these arguments do not extend to Pengi. Ginsberg calls Pengi an approximate universal plan, by which he means it is like a universal plan except that it does not correctly specify what to do in every situation. However, Pengi's operation involves no plans, universal or approximate, and Pengi and universal plans, although they share some motivations, have little to do with each other as technical proposals. Ginsberg suggests number of its inputs. Pengi-like system, computation in the number of pixels or that, Blockhead, which efficiently solves the fruitcake on the average, business data processing takes problem; the way it solves it elucidates exponential work in the number of records. They have a lot The fruitcake problem is to stack a set of of structure to them, and this structure can be labeled blocks so that they spell the word exploited to exponentially reduce the computation's fruitcake. What is apparently difficult about size. I show impossible under the rules of the domain, Blockhead solving a problem involving 45 and the remainder can be categorized relatively blocks in which there are 45! 1056 configurations, cheaply to permit abstraction and There is every in every configuration, so it is not by reason to think that this same structure is approximation that it succeeds. Indeed, Ginsberg makes this and a central system. The [planning couldn't work if] there were no visual system is a small subset of Pengi's rhyme or reason to things."