Plotting

Intrinsic dimension of data representations in deep neural networks

Neural Information Processing Systems

Deep neural networks progressively transform their inputs across multiple processing layers. What are the geometrical properties of the representations learned by these networks? Here we study the intrinsic dimensionality (ID) of datarepresentations, i.e. the minimal number of parameters needed to describe a representation. We find that, in a trained network, the ID is orders of magnitude smaller than the number of units in each layer. Across layers, the ID first increases and then progressively decreases in the final layers. Remarkably, the ID of the last hidden layer predicts classification accuracy on the test set. These results can neither be found by linear dimensionality estimates (e.g., with principal component analysis), nor in representations that had been artificially linearized. They are neither found in untrained networks, nor in networks that are trained on randomized labels. This suggests that neural networks that can generalize are those that transform the data into low-dimensional, but not necessarily flat manifolds.


WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Neural Information Processing Systems

Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 highquality, human-annotated instances designed to assess the performance of LLMs in providing a complete perspective on conflicts from the retrieved documents, rather than choosing one answer over another, when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.


Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization

Neural Information Processing Systems

With high-dimensional state spaces, visual reinforcement learning (RL) faces significant challenges in exploitation and exploration, resulting in low sample efficiency and training stability. As a time-efficient diffusion model, although consistency models have been validated in online state-based RL, it is still an open question whether it can be extended to visual RL. In this paper, we investigate the impact of non-stationary distribution and the actor-critic framework on consistency policy in online RL, and find that consistency policy was unstable during the training, especially in visual RL with the high-dimensional state space. To this end, we suggest sample-based entropy regularization to stabilize the policy training, and propose a consistency policy with prioritized proximal experience regularization (CP3ER) to improve sample efficiency. CP3ER achieves new state-of-the-art (SOTA) performance in 21 tasks across DeepMind control suite and Meta-world. To the best of our knowledge, CP3ER is the first method to apply diffusion/consistency models to visual RL and demonstrates the potential of consistency models in visual RL.


Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

Neural Information Processing Systems

We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected 0-1 loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a neural tangent random feature (NTRF) model.


Response to Reviewer # 1: on NTK all focus on square loss, while the connection between NTK and NNs trained by minimizing cross-entropy loss

Neural Information Processing Systems

Q1. "The result establishes a connection to some kernel method in previous work. We clarify that our result in Section 3.2 is not a re-derivation of existing result. Therefore our results on the connection to NTK is still new and significant. Q2. "The generalization bound is only shown for the network at a randomly chosen step... any of the final step" Our generalization bound at a randomly chosen step matches the standard results for stochastic optimization. We will study it in our future work. Q3. "... how the over-parameterization requirement of this paper compares to those in related works." We will add this remark in our revision. Q1. "... width requirement is still very stringent" We clarify that the proof is correct. Evaluation of the first term of the bound in Corollary 3.10 Corollary 3.10 in Figure 1(b) by varying the level of label noise, i.e., ratio of the labels that are flipped. We will add these experimental results in the camera ready. Q4. "Suggestion: the connection to NTK is rather straightforward... in the first page?


DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Neural Information Processing Systems

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset.


DualDICE continues to provide more accurate and stable results compared to the baselines, especially in continuousaction exploring

Neural Information Processing Systems

We thank the reviewers for their close reading of the paper and helpful feedback. We are also excited to apply ideas from DualDICE to the policy improvement problem, as mentioned by the reviewers. We are exploring several potential approaches to this problem. Figure 1: We perform OPE on additional control tasks (Acrobot and Pendulum) using our method compared to a number of baselines. We find that our method continues to perform well against previous OPE methods (baselines not shown perform even worse).


Customized Subgraph Selection and Encoding for Drug-drug Interaction Prediction

Neural Information Processing Systems

Subgraph-based methods have proven to be effective and interpretable in predicting drug-drug interactions (DDIs), which are essential for medical practice and drug development. Subgraph selection and encoding are critical stages in these methods, yet customizing these components remains underexplored due to the high cost of manual adjustments. In this study, inspired by the success of neural architecture search (NAS), we propose a method to search for data-specific components within subgraph-based frameworks. Specifically, we introduce extensive subgraph selection and encoding spaces that account for the diverse contexts of drug interactions in DDI prediction. To address the challenge of large search spaces and high sampling costs, we design a relaxation mechanism that uses an approximation strategy to efficiently explore optimal subgraph configurations. This approach allows for robust exploration of the search space. Extensive experiments demonstrate the effectiveness and superiority of the proposed method, with the discovered subgraphs and encoding functions highlighting the model's adaptability.


R1: We have violations after CI since we do early stopping - satisfying them till end can sometimes hurt overall

Neural Information Processing Systems

Thank you for your detailed comments. We will make all clarifications below in the next version. We note that our formulation in Sec 3.2 can handle any Learning the constraints automatically is a direction for future work. It can work with arbitrary (including non-linear) constraints over the output variables, e,g. We missed a slight detail: we increase l over successive Λ updates using an AP.


Approximately Pareto-optimal Solutions for Bi-Objective k-Clustering Problems Jan Eube Heinrich Heine University Düsseldorf University of Bonn Düsseldorf, Germany

Neural Information Processing Systems

As a major unsupervised learning method, clustering has received a lot of attention over multiple decades. The various clustering problems that have been studied intensively include, e.g., the k-means problem and the k-center problem. However, in applications, it is common that good clusterings should optimize multiple objectives (e.g., visualizing data on a map by clustering districts into areas that are both geographically compact but also homogeneous with respect to the data). We study combinations of different objectives, for example optimizing k-center and k-means simultaneously or optimizing k-center with respect to two different metrics. Usually these objectives are conflicting and cannot be optimized simultaneously, making it necessary to find trade-offs. We develop novel algorithms for approximating the set of Pareto-optimal solutions for various combinations of two objectives. Our algorithms achieve provable approximation guarantees and we demonstrate in several experiments that the approximate Pareto front contains good clusterings that cannot be found by considering one of the objectives separately.