Appendix for " Beyond the Signs: Nonparametric Tensor Completion via Sign Series "

Neural Information Processing Systems

The appendix consists of proofs (Section A), additional theoretical results (Section B), and numerical experiments (Section C). When g is strictly increasing, the mapping x g(x) is sign preserving. See Section B.2 for constructive examples. Based on the definition of classification loss L(,), the function Risk() relies only on the sign pattern of the tensor. The equality (2) is attained when z = sgn(θ π) or θ = π.


Beyond the Signs: Nonparametric Tensor Completion via Sign Series

Neural Information Processing Systems

We consider the problem of tensor estimation from noisy observations with possibly missing entries. A nonparametric approach to tensor completion is developed based on a new model which we coin as sign representable tensors. The model represents the signal tensor of interest using a series of structured sign tensors. Unlike earlier methods, the sign series representation effectively addresses both low-and high-rank signals, while encompassing many existing tensor models-- including CP models, Tucker models, single index models, structured tensors with repeating entries--as special cases. We provably reduce the tensor estimation problem to a series of structured classification tasks, and we develop a learning reduction machinery to empower existing low-rank tensor algorithms for more challenging high-rank estimation. Excess risk bounds, estimation errors, and sample complexities are established. We demonstrate the outperformance of our approach over previous methods on two datasets, one on human brain connectivity networks and the other on topic data mining.



Stateful ODE-Nets using Basis Function Expansions

Neural Information Processing Systems

The recently-introduced class of ordinary differential equation networks (ODE-Nets) establishes a fruitful connection between deep learning and dynamical systems. In this work, we reconsider formulations of the weights as continuous-indepth functions using linear combinations of basis functions which enables us to leverage parameter transformations such as function projections. In turn, this view allows us to formulate a novel stateful ODE-Block that handles stateful layers. The benefits of this new ODE-Block are twofold: first, it enables incorporating meaningful continuous-in-depth batch normalization layers to achieve state-of-theart performance; second, it enables compressing the weights through a change of basis, without retraining, while maintaining near state-of-the-art performance and reducing both inference time and memory footprint. Performance is demonstrated by applying our stateful ODE-Block to (a) image classification tasks using convolutional units and (b) sentence-tagging tasks using transformer encoder units.




We sincerely thank all the reviewers for their helpful comments

Neural Information Processing Systems

We sincerely thank all the reviewers for their helpful comments. A: The actual accuracy is 75.8%, A: In Fig.3, the curves are not simple linear regression and unknown to us. A: We set the magnitude of RandAugment to be 9 with a std as 0.5 in all networks. We found that "resolution and depth are The performance of TinyNets is about 0.3-3.8%


Supplementary Material for MoSo Haoru Tan 1, 3 Sitong Wu2, 3 Yukang Chen 2

Neural Information Processing Systems

Before the proof, we first revisit the definition of MoSo. The MoSo score for a specific sample z selected from the training set S is Mpzq " L S{z, w The MoSo score could be efficiently approximated with linear complexity, that is, ˆMpzq " E Hence, we have LpS, wq " L N. (8) Here, we use the Taylor approximation again to approximate L Moreover, sometimes numerical instability may occur due to factors such as N or T being too large, so we completely ignore this insignificant constant in our applications. By supposing the loss function is l-Lipschitz continuous and the gradient norm of the network parameter is upper-bounded by g, and setting the learning rate as a constant η, the approximation error of Eq. (2) is bounded by: |Mpzq ˆMpzq| ď O plη ` 1qgT ` ηg T, (12) where T is the maximum iteration. N Bɛ 3 By jointly considering Eq.(16) and Eq.(18) and then taking the summation from t " 1 to T, we have that, Proposition 1.2 has been proven.


Data Pruning via Moving-one-Sample-out Haoru Tan 1, 3 Sitong Wu2, 3 Yukang Chen 2

Neural Information Processing Systems

In this paper, we propose a novel data-pruning approach called moving-one-sampleout (MoSo), which aims to identify and remove the least informative samples from the training set. The core insight behind MoSo is to determine the importance of each sample by assessing its impact on the optimal empirical risk. This is achieved by measuring the extent to which the empirical risk changes when a particular sample is excluded from the training set. Instead of using the computationally expensive leaving-one-out-retraining procedure, we propose an efficient first-order approximator that only requires gradient information from different training stages. The key idea behind our approximation is that samples with gradients that are consistently aligned with the average gradient of the training set are more informative and should receive higher scores, which could be intuitively understood as follows: if the gradient from a specific sample is consistent with the average gradient vector, it implies that optimizing the network using the sample will yield a similar effect on all remaining samples. Experimental results demonstrate that MoSo effectively mitigates severe performance degradation at high pruning ratios and achieves satisfactory performance across various settings.