Goto

Collaborating Authors

We hope to have correctly understood your questions, and will try to exhaustively address all your comments

Neural Information Processing Systems

We would like to thank you for your time and valuable feedback. Thank you for helping us to improve our manuscript! We hope to have correctly understood your questions, and will try to exhaustively address all your comments. We agree to be more specific as to what we mean by "other types of interventions" in footnote 1, p. 3, and will We thank reviewer 4 for the additional comments on the manuscript. Combining this method with A-ICP is interesting future work.


N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules

Neural Information Processing Systems

Machine learning techniques have recently been adopted in various applications in medicine, biology, chemistry, and material engineering. An important task is to predict the properties of molecules, which serves as the main subroutine in many downstream applications such as virtual screening and drug design. Despite the increasing interest, the key challenge is to construct proper representations of molecules for learning algorithms. This paper introduces N-gram graph, a simple unsupervised representation for molecules.


Stability and Generalizability in SDE Diffusion Models with Measure-Preserving Dynamics Liu Li1 Sarah Cechnicka

Neural Information Processing Systems

Inverse problems describe the process of estimating the causal factors from a set of measurements or data. Mapping of often incomplete or degraded data to parameters is ill-posed, thus data-driven iterative solutions are required, for example when reconstructing clean images from poor signals. Diffusion models have shown promise as potent generative tools for solving inverse problems due to their superior reconstruction quality and their compatibility with iterative solvers. However, most existing approaches are limited to linear inverse problems represented as Stochastic Differential Equations (SDEs). This simplification falls short of addressing the challenging nature of real-world problems, leading to amplified cumulative errors and biases. We provide an explanation for this gap through the lens of measure-preserving dynamics of Random Dynamical Systems (RDS) with which we analyse Temporal Distribution Discrepancy and thus introduce a theoretical framework based on RDS for SDE diffusion models.


On the Comparison between Multi-modal and Single-modal Contrastive Learning Andi Han

Neural Information Processing Systems

Multi-modal contrastive learning with language supervision has presented a paradigm shift in modern machine learning. By pre-training on a web-scale dataset, multi-modal contrastive learning can learn high-quality representations that exhibit impressive robustness and transferability. Despite its empirical success, the theoretical understanding is still in its infancy, especially regarding its comparison with single-modal contrastive learning. In this work, we introduce a feature learning theory framework that provides a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning. Based on a data generation model consisting of signal and noise, our analysis is performed on a ReLU network trained with the InfoMax objective function. Through a trajectory-based optimization analysis and generalization characterization on downstream tasks, we identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning. Through the cooperation between the two modalities, multi-modal learning can achieve better feature learning, leading to improvements in performance in downstream tasks compared to single-modal learning. Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning. Empirical experiments on both synthetic and real-world datasets further consolidate our theoretical findings.


Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection

Neural Information Processing Systems

Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection. However, serializing 3D voxels into 1D sequences will inevitably sacrifice the voxel spatial proximity. Such an issue is hard to be addressed by enlarging the group size with existing serializationbased methods due to the quadratic complexity of Transformers with feature sizes. Inspired by the recent advances of state space models (SSMs), we present a Voxel SSM, termed as Voxel Mamba, which employs a group-free strategy to serialize the whole space of voxels into a single sequence. The linear complexity of SSMs encourages our group-free design, alleviating the loss of spatial proximity of voxels. To further enhance the spatial proximity, we propose a Dual-scale SSM Block to establish a hierarchical structure, enabling a larger receptive field in the 1D serialization curve, as well as more complete local regions in 3D space. Moreover, we implicitly apply window partition under the group-free framework by positional encoding, which further enhances spatial proximity by encoding voxel positional information. Our experiments on Waymo Open Dataset and nuScenes dataset show that Voxel Mamba not only achieves higher accuracy than state-of-the-art methods, but also demonstrates significant advantages in computational efficiency.


Appendix

Neural Information Processing Systems

We used additionally one free parameter ฮณ as in [24], which is stretching/compressing the kernel on the time axis and inferred from the data. As non-linearity we took the same sigmoidal non-linearity as in the vertical pathway. B Data We used three different sets of data. Published recordings in response to the chirp stimulus (for details, see [5]) were used for model fitting. To test generalization performance, we used published recordings in response to sine flicker (for details, see [34]) as well as newly recorded responses to natural movies (see Section B.2).


2e6d9c6052e99fcdfa61d9b9da273ca2-AuthorFeedback.pdf

Neural Information Processing Systems

Benefit of the two-step approach (R1, R2): As R1 suggested, we examined 2-OPT's robustness as measured by the 90% While 2-OPT's computation per iteration can be substantial, the multiple (1000) restarts of SGD used by 2-OPT can be Goldstein-Price is available in Table 1. "Authors compare with related [non-myopic] methods only on a set of synthetic problems extracted from another GLASSES and will include comparisons in the final version. We're awaiting email replies from Lam et al. "The only contribution seems to be the optimization of the acquisition function which is done using stochastic gradient Our secondary contribution is to show that this is practical (i.e., fast enough to use in practice) and provides "variance on the optimization traces for EI and LCB" (R1): We think this may be because EI and LCB explore less Define Q (R1,R2): This was a typo. We'll include them in the appendix in the final version.


Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models Yang Jiao 1,2,3

Neural Information Processing Systems

Large Multimodal Model (LMM) is a hot research topic in the computer vision area and has also demonstrated remarkable potential across multiple disciplinary fields. A recent trend is to further extend and enhance the perception capabilities of LMMs. The current methods follow the paradigm of adapting the visual task outputs to language-oriented formats. This adaptation leads to the convenient development of such LMMs with minimal modifications, however, it overlooks the inductive biases within diverse visual tasks and hinders the learning of perception capabilities. To address this issue, we propose a novel LMM architecture named Lumen, which decouples the learning of perception capabilities into task-agnostic and task-specific stages. Firstly, Lumen promotes fine-grained vision-language concept alignment, which is the fundamental capability for various visual tasks. Thus the output of the task-agnostic stage is a shared representation for all visioncentric tasks we address in this paper. Afterward, the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders with negligible training efforts. Comprehensive experimental results on a series of vision-centric and VQA benchmarks indicate that our Lumen model not only achieves or surpasses the performance of existing LMM-based approaches in a range of vision-centric tasks while maintaining general visual understanding and instruction following capabilities.


Supplementary Material: TorchSpatial-A Location Encoding Framework and Benchmark for Spatial Representation Learning

Neural Information Processing Systems

Author ordering is determined by coin flip. For what purpose was the dataset created? Was there a specific task in mind? In order to systematically compare the location encoders' performance and their impact on the Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., Who funded the creation of the dataset? Dr. Gengchen Mai acknowledges the Microsoft Research What do the instances that comprise the dataset represent (e.g., documents, photos, people, The instances in all 17 datasets represent images.


TorchSpatial: A Location Encoding Framework and Benchmark for Spatial Representation Learning

Neural Information Processing Systems

Spatial representation learning (SRL) aims at learning general-purpose neural network representations from various types of spatial data (e.g., points, polylines, polygons, networks, images, etc.) in their native formats. Learning good spatial representations is a fundamental problem for various downstream applications such as species distribution modeling, weather forecasting, trajectory generation, geographic question answering, etc. Even though SRL has become the foundation of almost all geospatial artificial intelligence (GeoAI) research, we have not yet seen significant efforts to develop an extensive deep learning framework and benchmark to support SRL model development and evaluation. To fill this gap, we propose TorchSpatial, a learning framework and benchmark for location (point) encoding, which is one of the most fundamental data types of spatial representation learning. TorchSpatial contains three key components: 1) a unified location encoding framework that consolidates 15 commonly recognized location encoders, ensuring scalability and reproducibility of the implementations; 2) the LocBench benchmark tasks encompassing 7 geo-aware image classification and 10 geo-aware image regression datasets; 3) a comprehensive suite of evaluation metrics to quantify geo-aware models' overall performance as well as their geographic bias, with a novel Geo-Bias Score metric. Finally, we provide a detailed analysis and insights into the model performance and geographic bias of different location encoders. We believe TorchSpatial will foster future advancement of spatial representation learning and spatial fairness in GeoAI research. The TorchSpatial model framework and LocBench benchmark are available at https://github.com/seai-lab/ TorchSpatial, and the Geo-Bias Score evaluation framework is available at https://github.com/seai-lab/PyGBS.