Plotting

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Neural Information Processing Systems

Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/


LRM-Zero: Training Large Reconstruction Models with Synthesized Data

Neural Information Processing Systems

We present LRM-Zero, a Large Reconstruction Model (LRM) trained entirely on synthesized 3D data, achieving high-quality sparse-view 3D reconstruction. The core of LRM-Zero is our procedural 3D dataset, Zeroverse, which is automatically synthesized from simple primitive shapes with random texturing and augmentations (e.g., height fields, boolean differences, and wireframes). Unlike previous 3D datasets (e.g., Objaverse) which are often captured or crafted by humans to approximate real 3D data, Zeroverse completely ignores realistic global semantics but is rich in complex geometric and texture details that are locally similar to or even more intricate than real objects. We demonstrate that our LRM-Zero, trained with our fully synthesized Zeroverse, can achieve high visual quality in the reconstruction of real-world objects, competitive with models trained on Objaverse. We also analyze several critical design choices of Zeroverse that contribute to LRM-Zero's capability and training stability. Our work demonstrates that 3D reconstruction, one of the core tasks in 3D vision, can potentially be addressed without the semantics of real-world objects. The Zeroverse's procedural synthesis code and interactive visualization are available at: https://desaixie.github.io/lrm-zero/.


Deterministic Langevin Monte Carlo with Normalizing Flows for Bayesian Inference Richard D.P. Grumitt Department of Astronomy, Tsinghua University, Beijing 100084, China Biwei Dai

Neural Information Processing Systems

We propose a general purpose Bayesian inference algorithm for expensive likelihoods, replacing the stochastic term in the Langevin equation with a deterministic density gradient term. The particle density is evaluated from the current particle positions using a Normalizing Flow (NF), which is differentiable and has good generalization properties in high dimensions. We take advantage of NF preconditioning and NF based Metropolis-Hastings updates for a faster convergence. We show on various examples that the method is competitive against state of the art sampling methods.



Color-Oriented Redundancy Reduction in Dataset Distillation

Neural Information Processing Systems

Dataset Distillation (DD) is designed to generate condensed representations of extensive image datasets, enhancing training efficiency. Despite recent advances, there remains considerable potential for improvement, particularly in addressing the notable redundancy within the color space of distilled images. In this paper, we propose AutoPalette, a framework that minimizes color redundancy at the individual image and overall dataset levels, respectively. At the image level, we employ a palette network, a specialized neural network, to dynamically allocate colors from a reduced color space to each pixel. The palette network identifies essential areas in synthetic images for model training and consequently assigns more unique colors to them. At the dataset level, we develop a color-guided initialization strategy to minimize redundancy among images. Representative images with the least replicated color patterns are selected based on the information gain. A comprehensive performance study involving various datasets and evaluation scenarios is conducted, demonstrating the superior performance of our proposed color-aware DD compared to existing DD methods.


Supplement WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

Neural Information Processing Systems

SIDER is a dataset for predicting side effect from the small molecule structure. It contains 27 classification tasks, corresponding to the 27 system organ classes following MedDRA classifications [1]. If taking a closer look at the MedDRA classification on the system organ level on its website, we can find a claim of "System Organ Classes (SOCs) which are groupings by aetiology (e.g. In addition, there is a SOC to contain issues pertaining to products and one to contain social circumstances." In fact, the two tasks among the 27 tasks are named "Social circumstances" and "Product issues", that corresponds to the claims above. Predicting such label from molecular structure alone is futile and therefore does not serve the purpose of a benchmarking dataset. The other problematic example in MoleculeNet is the PCBA dataset, originally used in [44]. However, as claimed in the original paper, "It should be noted that we did not perform any preprocessing of our datasets, such as removing potential experimental artifacts". And we have demonstrated the importance of removing the experimental artifacts in the data processing pipeline in the main text. There are more example issues with MoleculeNet that can be found in [52]. For Therapeutics Data Commons (TDC) [24], we used filters in our pipeline on small molecule-related tasks on and found issues with them. The promiscuity filter is not applied due to the long running time.


WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking, Ha Dong

Neural Information Processing Systems

While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices.


G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

Neural Information Processing Systems

Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily confuse distant images with similar visual contents, or cannot adapt to various locations worldwide with different amounts of relevant data. To resolve these limitations, we propose G3, a novel framework based on Retrieval-Augmented Generation (RAG).


ACIL: Analytic Class-Incremental Learning with Absolute Memorization and Privacy Protection

Neural Information Processing Systems

Class-incremental learning (CIL) learns a classification model with training data of different classes arising progressively. Existing CIL either suffers from serious accuracy loss due to catastrophic forgetting, or invades data privacy by revisiting used exemplars. Inspired by linear learning formulations, we propose an analytic class-incremental learning (ACIL) with absolute memorization of past knowledge while avoiding breaching of data privacy (i.e., without storing historical data). The absolute memorization is demonstrated in the sense that class-incremental learning using ACIL given present data would give identical results to that from its joint-learning counterpart which consumes both present and historical samples. This equality is theoretically validated. Data privacy is ensured since no historical data are involved during the learning process. Empirical validations demonstrate ACIL's competitive accuracy performance with near-identical results for various incremental task settings (e.g., 5-50 phases). This also allows ACIL to outperform the state-of-the-art methods for large-phase scenarios (e.g., 25 and 50 phases).


Understanding Emergent Abilities of Language Models from the Loss Perspective

Neural Information Processing Systems

Recent studies have put into question the belief that emergent abilities [58] in language models are exclusive to large models. This skepticism arises from two observations: 1) smaller models can also exhibit high performance on emergent abilities and 2) there is doubt on the discontinuous metrics used to measure these abilities. In this paper, we propose to study emergent abilities in the lens of pretraining loss, instead of model size or training compute. We demonstrate that the Transformer models with the same pre-training loss, but different model and data sizes, generate the same performance on various downstream tasks, with a fixed data corpus, tokenization, and model architecture. We also discover that a model exhibits emergent abilities on certain tasks--regardless of the continuity of metrics--when its pre-training loss falls below a specific threshold. Before reaching this threshold, its performance remains at the level of random guessing. This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses, highlighting that these abilities cannot be predicted by merely extrapolating the performance trends of models with higher pre-training losses.