Goto

Collaborating Authors

 inthefollowing


Adaptive Subspace Modeling With Functional Tucker Decomposition

arXiv.org Machine Learning

Tensors provide a structured representation for multidimensional data, yet discretization can obscure important information when such data originates from continuous processes. We address this limitation by introducing a functional Tucker decomposition (FTD) that embeds mode-wise continuity constraints directly into the decomposition. The FTD employs reproducing kernel Hilbert spaces (RKHS) to model continuous modes without requiring an a-priori basis, while preserving the multi-linear subspace structure of the Tucker model. Through RKHS-driven representation, the model yields adaptive and expressive factor descriptions that enable targeted modeling of subspaces. The value of this approach is demonstrated in domain-variant tensor classification. In particular, we illustrate its effectiveness with classification tasks in hyperspectral imaging and multivariate time series analysis, highlighting the benefits of combining structural decomposition with functional adaptability.


Self-Regularized Learning Methods

arXiv.org Machine Learning

We introduce a general framework for analyzing learning algorithms based on the notion of self-regularization, which captures implicit complexity control without requiring explicit regularization. This is motivated by previous observations that many algorithms, such as gradient-descent based learning, exhibit implicit regularization. In a nutshell, for a self-regularized algorithm the complexity of the predictor is inherently controlled by that of the simplest comparator achieving the same empirical risk. This framework is sufficiently rich to cover both classical regularized empirical risk minimization and gradient descent. Building on self-regularization, we provide a thorough statistical analysis of such algorithms including minmax-optimal rates, where it suffices to show that the algorithm is self-regularized -- all further requirements stem from the learning problem itself. Finally, we discuss the problem of data-dependent hyperparameter selection, providing a general result which yields minmax-optimal rates up to a double logarithmic factor and covers data-driven early stopping for RKHS-based gradient descent.




c86ff2d301940fce9357de92c5222b44-Supplemental-Conference.pdf

Neural Information Processing Systems

Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models. While a general analysis of when SGD works has been elusive, there has been a lot of recent progress in understanding the convergence of Gradient Flow (GF) on the population loss, partly due to the simplicity thatacontinuous-time analysis buysus.




Appendix Expanded

Neural Information Processing Systems

Notable instances of this architecture include, e.g., [33,37,51,105],and the spectral approaches proposed in, e.g., [14, 29, 64, 81]--all of which descend from early work in [65, 80, 102, 97]. Fork =1,the power ofthe algorithm has been completely characterized [4,63]. In general, a different mappingM()could be used, depending on the neighborhood information that we would like to aggregate. The following result relates the power of thek-WLandδ-k-WL. Proposition1(restated, Proposition 1 in the main text).


Appendix: AnAdaptiveKernelApproachtoFederatedLearning ofHeterogeneousCausalEffects

Neural Information Processing Systems

For example, if an individual appears in all of the sources, the trained model would be biased by data of this individual (there is imbalance caused by the use of more data from this particular individual than the others). Hence, this condition would ensure that such bias does not exist. Toaddress suchaproblem, wepropose a pre-training step to exclude such duplicated individuals. The pre-training step are summarized as follows: (1) Suppose thatanindividual canbeuniquely identified viaasetoffeatures. The causal effects are unidentifiable if the confounders are unobserved.


AppendixforTask-FreeContinualLearningVia OnlineDiscrepancyDistanceLearning

Neural Information Processing Systems

Theorem1.Let Pi represent the distribution of all seen training samples (including all previous Agoodtrade-offbetween themodel'scomplexityandgeneralization performance, observedfrom Eq. (12), is allowing each component to learn the underlying data distribution of a unique target set. By satisfying the ideal selection process (Eq.(22) of the paper) and also consideringthateachcomponent Gtfinishedthetrainingon Mkt atTkt,weassumethatthedynamic 4 expansion modelG can be seen as a single modelh trained on all previously learnt memories Maximal Interfered Retrieval (MIR), [1] is one of 5 themostpopular memory-based approaches, whichusesamemory bufferwithasample selection criterion. Since Pi would involve several underlying data distributions as the number of training steps (i) increases, the diversity in the memory plays an important role to ensure a tight GB in Eq.(15). G be single model which consists of a classifierh HandaVAEmodelv. M be a memory buffer updated at the training stepTi. Figure 1: The learning process of the proposed ODDL-S, which consists of three phases.