Goto

Collaborating Authors

 Representation & Reasoning


Tell What You Hear From What You See - Video to Audio Generation Through Text

Neural Information Processing Systems

The content of visual and audio scenes is multi-faceted such that a video stream can be paired with various audio streams and vice-versa. Thereby, in video-to-audio generation task, it is imperative to introduce steering approaches for controlling the generated audio. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose VATT, a multi-modal generative framework that takes a video and an optional text prompt as input, and generates audio and optional textual description (caption) of the audio. Such a framework has two unique advantages: i) Video-to-Audio generation process can be refined and controlled via text which complements the context of the visual information, and ii) The model can suggest what audio to generate for the video by generating audio captions.




Fast Projected Newton-like Method for Precision Matrix Estimation under Total Positivity

Neural Information Processing Systems

The precision matrix in such a distribution is an M-matrix. This problem can be formulated as a signconstrained log-determinant program. Current algorithms are designed using the block coordinate descent method or the proximal point algorithm, which becomes computationally challenging in high-dimensional cases due to the requirement to solve numerous nonnegative quadratic programs or large-scale linear systems. To address this issue, we propose a novel algorithm based on the two-metric projection method, incorporating a carefully designed search direction and variable partitioning scheme. Our algorithm substantially reduces computational complexity, and its theoretical convergence is established. Experimental results on synthetic and real-world datasets demonstrate that our proposed algorithm provides a significant improvement in computational efficiency compared to the state-of-the-art methods.


Voxel Proposal Network via Multi-Frame Knowledge Distillation for Semantic Scene Completion Kairui Yang

Neural Information Processing Systems

Semantic scene completion is a difficult task that involves completing the geometry and semantics of a scene from point clouds in a large-scale environment. Many current methods use 3D/2D convolutions or attention mechanisms, but these have limitations in directly constructing geometry and accurately propagating features from related voxels, the completion likely fails while propagating features in a single pass without considering multiple potential pathways. And they are generally only suitable for static scenes and struggle to handle dynamic aspects. This paper introduces Voxel Proposal Network (VPNet) that completes scenes from 3D and Bird's-Eye-View (BEV) perspectives. It includes Confident Voxel Proposal based on voxel-wise coordinates to propose confident voxels with high reliability for completion.


An Information Theoretic Perspective on Conformal Prediction Qualcomm AI Research

Neural Information Processing Systems

Conformal Prediction (CP) is a distribution-free uncertainty estimation framework that constructs prediction sets guaranteed to contain the true answer with a userspecified probability. Intuitively, the size of the prediction set encodes a general notion of uncertainty, with larger sets associated with higher degrees of uncertainty. In this work, we leverage information theory to connect conformal prediction to other notions of uncertainty. More precisely, we prove three different ways to upper bound the intrinsic uncertainty, as described by the conditional entropy of the target variable given the inputs, by combining CP with information theoretical inequalities. Moreover, we demonstrate two direct and useful applications of such connection between conformal prediction and information theory: (i) more principled and effective conformal training objectives that generalize previous approaches and enable end-to-end training of machine learning models from scratch, and (ii) a natural mechanism to incorporate side information into conformal prediction. We empirically validate both applications in centralized and federated learning settings, showing our theoretical results translate to lower inefficiency (average prediction set size) for popular CP methods.


CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming Ali TehraniJamsaz, Arijit Bhattacharjee, Le Chen, Nesreen K. Ahmed Amir Yazdanbakhsh

Neural Information Processing Systems

Recent advancements in Large Language Models (LLMs) have renewed interest in automatic programming language translation. Encoder-decoder transformer models, in particular, have shown promise in translating between different programming languages. However, translating between a language and its high-performance computing (HPC) extensions remains underexplored due to challenges such as complex parallel semantics. In this paper, we introduce CodeRosetta, an encoder-decoder transformer model designed specifically for translating between programming languages and their HPC extensions.


Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models

Neural Information Processing Systems

Text-to-image diffusion models have shown remarkable success in generating personalized subjects based on a few reference images. However, current methods often fail when generating multiple subjects simultaneously, resulting in mixed identities with combined attributes from different subjects. In this work, we present MuDI, a novel framework that enables multi-subject personalization by effectively decoupling identities from multiple subjects. Our main idea is to utilize segmented subjects generated by a foundation model for segmentation (Segment Anything) for both training and inference, as a form of data augmentation for training and initialization for the generation process. Moreover, we further introduce a new metric to better evaluate the performance of our method on multisubject personalization. Experimental results show that our MuDI can produce high-quality personalized images without identity mixing, even for highly similar subjects as shown in Figure 1. Specifically, in human evaluation, MuDI obtains twice the success rate for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% against the strongest baseline. Our project page is at https://mudi-t2i.github.io/.


A Graph Theoretic Additive Approximation of Optimal Transport

Neural Information Processing Systems

Transportation cost is an attractive similarity measure between probability distributions due to its many useful theoretical properties. However, solving optimal transport exactly can be prohibitively expensive. Therefore, there has been significant effort towards the design of scalable approximation algorithms.


A Condition Number for Joint Optimization of Cycle-Consistent Networks

Neural Information Processing Systems

A recent trend in optimizing maps such as dense correspondences between objects or neural networks between pairs of domains is to optimize them jointly. In this context, there is a natural cycle-consistency constraint, which regularizes composite maps associated with cycles, i.e., they are forced to be identity maps. However, as there is an exponential number of cycles in a graph, how to sample a subset of cycles becomes critical for efficient and effective enforcement of the cycleconsistency constraint. This paper presents an algorithm that select a subset of weighted cycles to minimize a condition number of the induced joint optimization problem. Experimental results on benchmark datasets justify the effectiveness of our approach for optimizing dense correspondences between 3D shapes and neural networks for predicting dense image flows.