Goto

Collaborating Authors

 dxdy


min

Neural Information Processing Systems

LetAbean nHermitian matrixandletBbea(n 1) (n 1)matrixwhich is constructed by deleting thei-th row andi-th column ofA. Denote thatΦ = [ϕ(x1),...,ϕ(xn)] Rn D, where D is the dimension of feature spaceH. Performing rank-n singular value decomposition (SVD) onΦ, we have Φ = HΣV, where H Rn n, Σ Rn n is a diagonal matrix whose diagonal elements are the singular values of Φ,andV RD n. F(α) in Eq.(21) is proven differentiable and thep-th component of the gradient is F(α) αp = Then, a reduced gradient descent algorithm [26] is adopted to optimize Eq.(21). The three deep neural networks are pre-trained on the ImageNet[5].


Andforanyα>0,theLaplaciansatisfies Gλ c1 2α Gλ+2Md whereM =c23α/c1+c22. 2. IfG (x)=kg (x)k2 isC1,then k G (x)k c2kg (x)k,g (x) > G (x) c1G (x). Proof. Claim1.Note Gλ =2 2Fλgλandthat 1 2 c1I 2Fλ= mX

Neural Information Processing Systems

As a remark, using the Krylov-Bogoliubov existence theorem (see Corollary 11.8 of [6]), fixed points to(4)exist as long as one can show{ρt,t 0}istight. The learning rate is set differently foreachtask. Obviously, the HV indicator (Eq.(10)) can also be used as an objective function for optimizing solution sets. For example, [25, 7] greedily add new points to obtain the highest expected HV improvement. However, the landscape of the HV indicator is piece-wise constant (similar to the 0-1 loss in classification) and is difficult to optimize with gradient descent. Particularly, for all the dominated points inthe solution set, their gradient iszero.


Hessian-guided Perturbed Wasserstein Gradient Flows for Escaping Saddle Points

Yamamoto, Naoya, Kim, Juno, Suzuki, Taiji

arXiv.org Machine Learning

Wasserstein gradient flow (WGF) is a common method to perform optimization over the space of probability measures. While WGF is guaranteed to converge to a first-order stationary point, for nonconvex functionals the converged solution does not necessarily satisfy the second-order optimality condition; i.e., it could converge to a saddle point. In this work, we propose a new algorithm for probability measure optimization, perturbed Wasserstein gradient flow (PWGF), that achieves second-order optimality for general nonconvex objectives. PWGF enhances WGF by injecting noisy perturbations near saddle points via a Gaussian process-based scheme. By pushing the measure forward along a random vector field generated from a Gaussian process, PWGF helps the solution escape saddle points efficiently by perturbing the solution towards the smallest eigenvalue direction of the Wasserstein Hessian. We theoretically derive the computational complexity for PWGF to achieve a second-order stationary point. Furthermore, we prove that PWGF converges to a global optimum in polynomial time for strictly benign objectives.


ROCK: A variational formulation for occupation kernel methods in Reproducing Kernel Hilbert Spaces

Rielly, Victor, Lahouel, Kamel, Nguyen, Chau, Jedynak, Bruno

arXiv.org Machine Learning

We present a Representer Theorem result for a large class of weak formulation problems. We provide examples of applications of our formulation both in traditional machine learning and numerical methods as well as in new and emerging techniques. Finally we apply our formulation to generalize the multivariate occupation kernel (MOCK) method for learning dynamical systems from data proposing the more general Riesz Occupation Kernel (ROCK) method. Our generalized methods are both more computationally efficient and performant on most of the benchmarks we test against.


Sinkhorn algorithms and linear programming solvers for optimal partial transport problems

Bai, Yikun

arXiv.org Artificial Intelligence

In this note, we generalize the classical optimal partial transport (OPT) problem by modifying the mass destruction/creation term to function-based terms, introducing what we term ``generalized optimal partial transport'' problems. We then discuss the dual formulation of these problems and the associated Sinkhorn solver. Finally, we explore how these new OPT problems relate to classical optimal transport (OT) problems and introduce a linear programming solver tailored for these generalized scenarios.


Understanding Test-Time Augmentation

Kimura, Masanari

arXiv.org Artificial Intelligence

Test-Time Augmentation (TTA) is a very powerful heuristic that takes advantage of data augmentation during testing to produce averaged output. Despite the experimental effectiveness of TTA, there is insufficient discussion of its theoretical aspects. In this paper, we aim to give theoretical guarantees for TTA and clarify its behavior.


Learning High-Dimensional Nonparametric Differential Equations via Multivariate Occupation Kernel Functions

Rielly, Victor, Lahouel, Kamel, Lew, Ethan, Wells, Michael, Haney, Vicky, Jedynak, Bruno

arXiv.org Artificial Intelligence

Learning a nonparametric system of ordinary differential equations (ODEs) from $n$ trajectory snapshots in a $d$-dimensional state space requires learning $d$ functions of $d$ variables. Explicit formulations scale quadratically in $d$ unless additional knowledge about system properties, such as sparsity and symmetries, is available. In this work, we propose a linear approach to learning using the implicit formulation provided by vector-valued Reproducing Kernel Hilbert Spaces. By rewriting the ODEs in a weaker integral form, which we subsequently minimize, we derive our learning algorithm. The minimization problem's solution for the vector field relies on multivariate occupation kernel functions associated with the solution trajectories. We validate our approach through experiments on highly nonlinear simulated and real data, where $d$ may exceed 100. We further demonstrate the versatility of the proposed method by learning a nonparametric first order quasilinear partial differential equation.


The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making

Yu, Shujian, Li, Hongming, Løkse, Sigurd, Jenssen, Robert, Príncipe, José C.

arXiv.org Artificial Intelligence

The Cauchy-Schwarz (CS) divergence was developed by Pr\'{i}ncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., the rigorous faithfulness guarantee, the lower computational complexity, the higher statistical power, and the much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely the time series clustering and the uncertainty-guided exploration for sequential decision making.


Learning low-dimensional state embeddings and metastable clusters from time series data

Sun, Yifan, Duan, Yaqi, Gong, Hao, Wang, Mengdi

arXiv.org Machine Learning

This paper studies how to find compact state embeddings from high-dimensional Markov state trajectories, where the transition kernel has a small intrinsic rank. In the spirit of diffusion map, we propose an efficient method for learning a low-dimensional state embedding and capturing the process's dynamics. This idea also leads to a kernel reshaping method for more accurate nonparametric estimation of the transition function. State embedding can be used to cluster states into metastable sets, thereby identifying the slow dynamics. Sharp statistical error bounds and misclassification rate are proved. Experiment on a simulated dynamical system shows that the state clustering method indeed reveals metastable structures. We also experiment with time series generated by layers of a Deep-Q-Network when playing an Atari game. The embedding method identifies game states to be similar if they share similar future events, even though their raw data are far different.


Learning Curves, Model Selection and Complexity of Neural Networks

Murata, Noboru, Yoshizawa, Shuji, Amari, Shun-ichi

Neural Information Processing Systems

Learning curves show how a neural network is improved as the number of t.raiuing examples increases and how it is related to the network complexity. The present paper clarifies asymptotic properties and their relation of t.wo learning curves, one concerning the predictive loss or generalization loss and the other the training loss. The result gives a natural definition of the complexity of a neural network. Moreover, it provides a new criterion of model selection.