Plotting


On the Expressivity and Sample Complexity of Node-Individualized Graph Neural Networks

Neural Information Processing Systems

Graph neural networks (GNNs) employing message passing for graph classification are inherently limited by the expressive power of the Weisfeiler-Leman (WL) test for graph isomorphism. Node individualization schemes, which assign unique identifiers to nodes (e.g., by adding random noise to features), are a common approach for achieving universal expressiveness. However, the ability of GNNs endowed with individualization schemes to generalize beyond the training data is still an open question. To address this question, this paper presents a theoretical analysis of the sample complexity of such GNNs from a statistical learning perspective, employing Vapnik-Chervonenkis (VC) dimension and covering number bounds. We demonstrate that node individualization schemes that are permutation-equivariant result in lower sample complexity, and design novel individualization schemes that exploit these results. As an application of this analysis, we also develop a novel architecture that can perform substructure identification (i.e., subgraph isomorphism) while having a lower VC dimension compared to competing methods. Finally, our theoretical findings are validated experimentally on both synthetic and real-world datasets.


Supplementary Materials Evaluation beyond Task Performance: Analyzing Concepts in AlphaZero playing Hex

Neural Information Processing Systems

Appendix A reports implementation details, hyperparameters and compute requirements. Appendix B gives more details on each concept introduced in the main body of the paper. Appendix C demonstrates how AlphaZero often wastes moves. Appendix D has additional results across the different architectures. We use agents trained by Jones [5]. See Table 1 for hyperparameters and relative agent strengths.


Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line 1

Neural Information Processing Systems

Recently, Miller et al. [32] and Baek et al. [3] empirically demonstrated strong linear correlations between in-distribution (ID) versus out-of-distribution (OOD) accuracy and agreement. These trends, coined accuracy-on-the-line (ACL) and agreement-on-the-line (AGL), enable OOD model selection and performance estimation without labeled data. However, these phenomena also break for certain shifts, such as CIFAR10-C Gaussian Noise, posing a critical bottleneck. In this paper, we make a key finding that recent test-time adaptation (TTA) methods not only improve OOD performance, but drastically strengthen the ACL and AGL trends in models, even in shifts where models showed very weak correlations before. To analyze this, we revisit the theoretical conditions from Miller et al. [32] that outline the types of distribution shifts needed for perfect ACL in linear models. Surprisingly, these conditions are satisfied after applying TTA to deep models in the penultimate feature embedding space. In particular, TTA causes the data distribution to collapse complex shifts into those can be expressed by a singular "scaling" variable in the feature space. Our results show that by combining TTA with AGL-based estimation methods, we can estimate the OOD performance of models with high precision for a broader set of distribution shifts. This lends us a simple system for selecting the best hyperparameters and adaptation strategy without any OOD labeled data.



NoiseGPT: Label Noise Detection and Rectification through Probability Curvature

Neural Information Processing Systems

Machine learning craves high-quality data which is a major bottleneck during realistic deployment, as it takes abundant resources and massive human labor to collect and label data. Unfortunately, label noise where image data mismatches with incorrect label exists ubiquitously in all kinds of datasets, significantly degrading the learning performance of deep networks. Learning with Label Noise (LNL) has been a common strategy for mitigating the influence of noisy labels. However, existing LNL methods either require pertaining using the memorization effect to separate clean data from noisy ones or rely on dataset assumptions that cannot extend to various scenarios. Thanks to the development of Multimodal Large Language Models (MLLMs) which possess massive knowledge and hold In-Context Learning (ICL) ability, this paper proposes NoiseGPT to effectively leverage MLLMs as a knowledge expert for conducting label noise detection and rectification. Specifically, we observe a probability curvature effect of MLLMs where clean and noisy examples reside on curvatures with different smoothness, further enabling the detection of label noise.


Learning from Offline Foundation Features with Tensor Augmentations Emir Konuk 1,2

Neural Information Processing Systems

We introduce Learning from Offline Foundation Features with Tensor Augmentations (LOFF-TA), an efficient training scheme designed to harness the capabilities of foundation models in limited resource settings where their direct development is not feasible. LOFF-TA involves training a compact classifier on cached feature embeddings from a frozen foundation model, resulting in up to 37 faster training and up to 26 reduced GPU memory usage. Because the embeddings of augmented images would be too numerous to store, yet the augmentation process is essential for training, we propose to apply tensor augmentations to the cached embeddings of the original non-augmented images. LOFF-TA makes it possible to leverage the power of foundation models, regardless of their size, in settings with limited computational capacity. Moreover, LOFF-TA can be used to apply foundation models to high-resolution images without increasing compute. In certain scenarios, we find that training with LOFF-TA yields better results than directly fine-tuning the foundation model.


Can Large Language Models Explore In-Context? 2

Neural Information Processing Systems

We investigate the extent to which contemporary Large Language Models (LLMs) can engage in exploration, a core capability in reinforcement learning and decision making. We focus on native performance of existing LLMs, without training interventions. We deploy LLMs as agents in simple multi-armed bandit environments, specifying the environment description and interaction history entirely in-context, i.e., within the LLM prompt.


Supplementary Material and Datasheet for the WorldStrat Dataset J. Cornebise, I. Oršolić, F. Kalaitzis 2022-06-16 4 2 Cloud coverage statistics 4 3 Full List of Hyperparameters for Benchmark

Neural Information Processing Systems

Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? LCCS comprises of 23 classes and 14 sub-classes. The dataset, along with its machine-readable metadata, is hosted on CERN-backed Zenodo data repository: https://zenodo.org/record/6810792 Its longterm maintenance is discussed in the Datasheet. This includes reproducible code for the Benchmarks of Section 4 of [Cornebise et al., 2022a], following the ML Reproducibility Checklist [Pineau et al., 2021a,b]. The project also has its own website available at https://worldstrat.github.io/, The authors hereby state that they bear all responsibility in case of violation of rights, etc., and confirm that the data license is as follows: The low-resolution imagery, labels, metadata, and pretrained models are released under Creative Commons with Attribution 4.0 International (CC BY 4.0) The mean of the cloud coverage over the Sentinel 2 product areas is 7.98 %, with a standard deviation of 14.22. The quantiles are: 0.025: 0.00% 0.25: 0.00% 0.5: 0.66% 0.75: 10.05% 0.975: 49.95% It is important to note that this cloud cover percentage, as mentioned in the article and datasheet, is calculated on the entire product size of the provider, which varies in size but is much larger than the 2.5km we target. This means that even an image with a large cloud cover percentage can be cloud free, and in extreme cases (though unlikely), vice-versa. Also there are indeed considerable difference across sampled regions and land cover types. A simple example would be rainforests and non-desert equatorial regions. Using a strict no-cloud policy would make sampling enough low-resolution images either impossible or would make the temporal difference extremely large (up to 7 years for some AOIs). With that in mind, we strived to keep the cloud coverage as low as possible, ideally under 5%, while maintaining the temporal difference as small as possible.