Plotting

The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies

Neural Information Processing Systems

We study the relationship between the frequency of a function and the speed at which a neural network learns it. We build on recent results that show that the dynamics of overparameterized neural networks trained with gradient descent can be well approximated by a linear system. When normalized training data is uniformly distributed on a hypersphere, the eigenfunctions of this linear system are spherical harmonic functions. We derive the corresponding eigenvalues for each frequency after introducing a bias term in the model. This bias term had been omitted from the linear network model without significantly affecting previous theoretical results. However, we show theoretically and experimentally that a shallow neural network without bias cannot represent or learn simple, low frequency functions with odd frequencies. Our results lead to specific predictions of the time it will take a network to learn functions of varying frequency. These predictions match the empirical behavior of both shallow and deep networks.


Reviewer 1: However, when it is applied as convolution, phase shifts are further included, and therefore the eigenvectors of H

Neural Information Processing Systems

We thank the reviewers for extremely helpful suggestions. Below we discuss the most significant points. The kernel is an even function, and so its decomposition includes only cosine functions. The target accuracy depends on both ฮด and ษ›. In each graph the predictions (in orange) were scaled by a single multiplicative constant to fit the measurements.


What Makes Partial-Label Learning Algorithms Effective?

Neural Information Processing Systems

A partial label (PL) specifies a set of candidate labels for an instance and partiallabel learning (PLL) trains multi-class classifiers with PLs. Recently, many methods that incorporate techniques from other domains have shown strong potential. The expectation that stronger techniques would enhance performance has resulted in prominent PLL methods becoming not only highly complicated but also quite different from one another, making it challenging to choose the best direction for future algorithm design. While it is exciting to see higher performance, this leaves open a fundamental question: what makes a PLL method effective? We present a comprehensive empirical analysis of this question and summarize the success of PLL so far into some minimal algorithm design principles. Our findings reveal that high accuracy on benchmark-simulated datasets with PLs can misleadingly amplify the perceived effectiveness of some general techniques, which may improve representation learning but have limited impact on addressing the inherent challenges of PLs. We further identify the common behavior among successful PLL methods as a progressive transition from uniform to one-hot pseudo-labels, highlighting the critical role of mini-batch PL purification in achieving top performance. Based on our findings, we introduce a minimal working algorithm that is surprisingly simple yet effective, and propose an improved strategy to implement the design principles, suggesting a promising direction for improvements in PLL.




Bayes Consistency vs. H-Consistency: The Interplay between Surrogate Loss Functions and the Scoring Function Class

Neural Information Processing Systems

A fundamental question in multiclass classification concerns understanding the consistency properties of surrogate risk minimization algorithms, which minimize a (often convex) surrogate to the multiclass 0-1 loss. In particular, the framework of calibrated surrogates has played an important role in analyzing Bayes consistency of such algorithms, i.e. in studying convergence to a Bayes optimal classifier (Zhang, 2004; Tewari and Bartlett, 2007). However, follow-up work has suggested this framework can be of limited value when studying H-consistency; in particular, concerns have been raised that even when the data comes from an underlying linear model, minimizing certain convex calibrated surrogates over linear scoring functions fails to recover the true model (Long and Servedio, 2013). In this paper, we investigate this apparent conundrum. We find that while some calibrated surrogates can indeed fail to provide H-consistency when minimized over a naturallooking but naรฏvely chosen scoring function class F, the situation can potentially be remedied by minimizing them over a more carefully chosen class of scoring functions F. In particular, for the popular one-vs-all hinge and logistic surrogates, both of which are calibrated (and therefore provide Bayes consistency) under realizable models, but were previously shown to pose problems for realizable H-consistency, we derive a form of scoring function class F that enables H-consistency. When H is the class of linear models, the class F consists of certain piecewise linear scoring functions that are characterized by the same number of parameters as in the linear case, and minimization over which can be performed using an adaptation of the min-pooling idea from neural network training. Our experiments confirm that the one-vs-all surrogates, when trained over this class of nonlinear scoring functions F, yield better linear multiclass classifiers than when trained over standard linear scoring functions.



Thompson Sampling For Combinatorial Bandits: Polynomial Regret and Mismatched Sampling Paradox

Neural Information Processing Systems

We consider Thompson Sampling (TS) for linear combinatorial semi-bandits and subgaussian rewards. We propose the first known TS whose finite-time regret does not scale exponentially with the dimension of the problem. We further show the mismatched sampling paradox: A learner who knows the rewards distributions and samples from the correct posterior distribution can perform exponentially worse than a learner who does not know the rewards and simply samples from a well-chosen Gaussian posterior.


Appendix A Distribution Shift in Graph-Structured Data

Neural Information Processing Systems

Distribution shift appears when the joint distribution differs between source domain and target domain [7, 69]. Assuming that the relationship between the input and class variables is unchanged, there are two kinds of distribution shift, i.e., covariate shift and label shift (prior probability shift) [70]. A.1 Covariate Shift Covariate shift [71] refers to changes in the distribution of the input variables, which can be defined formally as follows: Definition 1 (Covariate Shift). However, in graph-structured data, node representation is not only affected by the data attributes but also graph structure. Thus, covariate shift in graph data can be decoupled as feature shift and structure shift [26]. A.2 Label Shift Label shift refers to changes in the distribution of the class variable Y.


Revisiting, Benchmarking and Understanding Unsupervised Graph Domain Adaptation

Neural Information Processing Systems

Unsupervised Graph Domain Adaptation (UGDA) involves the transfer of knowledge from a label-rich source graph to an unlabeled target graph under domain discrepancies. Despite the proliferation of methods designed for this emerging task, the lack of standard experimental settings and fair performance comparisons makes it challenging to understand which and when models perform well across different scenarios. To fill this gap, we present the first comprehensive benchmark for unsupervised graph domain adaptation named GDABench, which encompasses 16 algorithms across diverse adaptation tasks. Through extensive experiments, we observe that the performance of current UGDA models varies significantly across different datasets and adaptation scenarios. Specifically, we recognize that when the source and target graphs face significant distribution shifts, it is imperative to formulate strategies to effectively address and mitigate graph structural shifts. We also find that with appropriate neighbourhood aggregation mechanisms, simple GNN variants can even surpass state-of-the-art UGDA baselines. To facilitate reproducibility, we have developed an easy-to-use library PyGDA for training and evaluating existing UGDA methods, providing a standardized platform in this community.