Goto

Collaborating Authors

 Country


A Deterministic Streaming Sketch for Ridge Regression

arXiv.org Machine Learning

We provide a deterministic space-efficient algorithm for estimating ridge regression. For $n$ data points with $d$ features and a large enough regularization parameter, we provide a solution within $\varepsilon$ L$_2$ error using only $O(d/\varepsilon)$ space. This is the first $o(d^2)$ space algorithm for this classic problem. The algorithm sketches the covariance matrix by variants of Frequent Directions, which implies it can operate in insertion-only streams and a variety of distributed data settings. In comparisons to randomized sketching algorithms on synthetic and real-world datasets, our algorithm has less empirical error using less space and similar time.


Nested Barycentric Coordinate System as an Explicit Feature Map

arXiv.org Machine Learning

We propose a new embedding method which is particularly well-suited for settings where the sample size greatly exceeds the ambient dimension. Our technique consists of partitioning the space into simplices and then embedding the data points into features corresponding to the simplices' barycentric coordinates. We then train a linear classifier in the rich feature space obtained from the simplices. The decision boundary may be highly non-linear, though it is linear within each simplex (and hence piecewise-linear overall). Further, our method can approximate any convex body. We give generalization bounds based on empirical margin and a novel hybrid sample compression technique. An extensive empirical evaluation shows that our method consistently outperforms a range of popular kernel embedding methods.


Mutual Information-based State-Control for Intrinsically Motivated Reinforcement Learning

arXiv.org Machine Learning

In reinforcement learning, an agent learns to reach a set of goals by means of an external reward signal. In the natural world, intelligent organisms learn from internal drives, bypassing the need for external signals, which is beneficial for a wide range of tasks. Motivated by this observation, we propose to formulate an intrinsic objective as the mutual information between the goal states and the controllable states. This objective encourages the agent to take control of its environment. Subsequently, we derive a surrogate objective of the proposed reward function, which can be optimized efficiently. Lastly, we evaluate the developed framework in different robotic manipulation and navigation tasks and demonstrate the efficacy of our approach. A video showing experimental results is available at \url{https://youtu.be/CT4CKMWBYz0}.


Wasserstein Exponential Kernels

arXiv.org Machine Learning

In the context of kernel methods, the similarity between data points is encoded by the kernel function which is often defined thanks to the Euclidean distance, a common example being the squared exponential kernel. Recently, other distances relying on optimal transport theory - such as the Wasserstein distance between probability distributions - have shown their practical relevance for different machine learning techniques. In this paper, we study the use of exponential kernels defined thanks to the regularized Wasserstein distance and discuss their positive definiteness. More specifically, we define Wasserstein feature maps and illustrate their interest for supervised learning problems involving shapes and images. Empirically, Wasserstein squared exponential kernels are shown to yield smaller classification errors on small training sets of shapes, compared to analogous classifiers using Euclidean distances.


$\epsilon$-shotgun: $\epsilon$-greedy Batch Bayesian Optimisation

arXiv.org Machine Learning

Bayesian optimisation is a popular, surrogate model-based approach for optimising expensive black-box functions. Given a surrogate model, the next location to expensively evaluate is chosen via maximisation of a cheap-to-query acquisition function. We present an $\epsilon$-greedy procedure for Bayesian optimisation in batch settings in which the black-box function can be evaluated multiple times in parallel. Our $\epsilon$-shotgun algorithm leverages the model's prediction, uncertainty, and the approximated rate of change of the landscape to determine the spread of batch solutions to be distributed around a putative location. The initial target location is selected either in an exploitative fashion on the mean prediction, or -- with probability $\epsilon$ -- from elsewhere in the design space. This results in locations that are more densely sampled in regions where the function is changing rapidly and in locations predicted to be good (i.e close to predicted optima), with more scattered samples in regions where the function is flatter and/or of poorer quality. We empirically evaluate the $\epsilon$-shotgun methods on a range of synthetic functions and two real-world problems, finding that they perform at least as well as state-of-the-art batch methods and in many cases exceed their performance.


Sharpe Ratio in High Dimensions: Cases of Maximum Out of Sample, Constrained Maximum, and Optimal Portfolio Choice

arXiv.org Machine Learning

In this paper, we analyze maximum Sharpe ratio when the number of assets in a portfolio is larger than its time span. One obstacle in this large dimensional setup is the singularity of the sample covariance matrix of the excess asset returns. To solve this issue, we benefit from a technique called nodewise regression, which was developed by Meinshausen and Buhlmann (2006). It provides a sparse/weakly sparse and consistent estimate of the precision matrix, using the Lasso method. We analyze three issues. One of the key results in our paper is that mean-variance efficiency for the portfolios in large dimensions is established. Then tied to that result, we also show that the maximum out-of-sample Sharpe ratio can be consistently estimated in this large portfolio of assets. Furthermore, we provide convergence rates and see that the number of assets slow down the convergence up to a logarithmic factor. Then, we provide consistency of maximum Sharpe Ratio when the portfolio weights add up to one, and also provide a new formula and an estimate for constrained maximum Sharpe ratio. Finally, we provide consistent estimates of the Sharpe ratios of global minimum variance portfolio and Markowitz's (1952) mean variance portfolio. In terms of assumptions, we allow for time series data. Simulation and out-of-sample forecasting exercise shows that our new method performs well compared to factor and shrinkage based techniques.


Proximity Preserving Binary Code using Signed Graph-Cut

arXiv.org Machine Learning

We introduce a binary embedding framework, called Proximity Preserving Code (PPC), which learns similarity and dissimilarity between data points to create a compact and affinity-preserving binary code. This code can be used to apply fast and memory-efficient approximation to nearest-neighbor searches. Our framework is flexible, enabling different proximity definitions between data points. In contrast to previous methods that extract binary codes based on unsigned graph partitioning, our system models the attractive and repulsive forces in the data by incorporating positive and negative graph weights. The proposed framework is shown to boil down to finding the minimal cut of a signed graph, a problem known to be NP-hard. We offer an efficient approximation and achieve superior results by constructing the code bit after bit. We show that the proposed approximation is superior to the commonly used spectral methods with respect to both accuracy and complexity. Thus, it is useful for many other problems that can be translated into signed graph cut.


Online Passive-Aggressive Total-Error-Rate Minimization

arXiv.org Machine Learning

We provide a new online learning algorithm which utilizes online passive-aggressive learning (PA) and total-error-rate minimization (TER) for binary classification. The PA learning establishes not only large margin training but also the capacity to handle non-separable data. The TER learning on the other hand minimizes an approximated classification error based objective function. We propose an online PATER algorithm which combines those useful properties. In addition, we also present a weighted PATER algorithm to improve the ability to cope with data imbalance problems. Experimental results demonstrate that the proposed PATER algorithms achieves better performances in terms of efficiency and effectiveness than the existing state-of-the-art online learning algorithms in real-world data sets.


Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making

arXiv.org Machine Learning

The Markov assumption (MA) is fundamental to the empirical validity of reinforcement learning. In this paper, we propose a novel Forward-Backward Learning procedure to test MA in sequential decision making. The proposed test does not assume any parametric form on the joint distribution of the observed data and plays an important role for identifying the optimal policy in high-order Markov decision processes and partially observable MDPs. We apply our test to both synthetic datasets and a real data example from mobile health studies to illustrate its usefulness.


Semiparametric Bayesian Forecasting of Spatial Earthquake Occurrences

arXiv.org Machine Learning

Self-exciting Hawkes processes are used to model events which cluster in time and space, and have been widely studied in seismology under the name of the Epidemic Type Aftershock Sequence (ETAS) model. In the ETAS framework, the occurrence of the mainshock earthquakes in a geographical region is assumed to follow an inhomogeneous spatial point process, and aftershock events are then modelled via a separate triggering kernel. Most previous studies of the ETAS model have relied on point estimates of the model parameters due to the complexity of the likelihood function, and the difficulty in estimating an appropriate mainshock distribution. In order to take estimation uncertainty into account, we instead propose a fully Bayesian formulation of the ETAS model which uses a nonparametric Dirichlet process mixture prior to capture the spatial mainshock process. Direct inference for the resulting model is problematic due to the strong correlation of the parameters for the mainshock and triggering processes, so we instead use an auxiliary latent variable routine to perform efficient inference.