Goto

Collaborating Authors

 Computational Learning Theory


Hierarchical robust aggregation of sales forecasts at aggregated levels in e-commerce, based on exponential smoothing and Holt's linear trend method

arXiv.org Machine Learning

We revisit the interest of classical statistical techniques for sales forecasting like exponential smoothing and extensions thereof (as Holt's linear trend method). We do so by considering ensemble forecasts, given by several instances of these classical techniques tuned with different (sets of) parameters, and by forming convex combinations of the elements of ensemble forecasts over time, in a robust and sequential manner. The machine-learning theory behind this is called "robust online aggregation", or "prediction with expert advice", or "prediction of individual sequences" (see Cesa-Bianchi and Lugosi, 2006). We apply this methodology to a hierarchical data set of sales provided by the e-commerce company Cdiscount and output forecasts at the levels of subsubfamilies, subfamilies and families of items sold, for various forecasting horizons (up to 6-week-ahead). The performance achieved is better than what would be obtained by optimally tuning the classical techniques on a train set and using their forecasts on the test set. The performance is also good from an intrinsic point of view (in terms of mean absolute percentage of error). While getting these better forecasts of sales at the levels of subsubfamilies, subfamilies and families is interesting per se, we also suggest to use them as additional features when forecasting demand at the item level.


A combinatorial conjecture from PAC-Bayesian machine learning

arXiv.org Machine Learning

We present a proof of a combinatorial conjecture from the second author's Ph.D. thesis. The proof relies on binomial and multinomial sums identities. We also discuss the relevance of the conjecture in the context of PAC-Bayesian machine learning.


SAT Heritage: a community-driven effort for archiving, building and running more than thousand SAT solvers

arXiv.org Artificial Intelligence

SAT research has a long history of source code and binary releases, thanks to competitions organized every year. However, since every cycle of competitions has its own set of rules and an adhoc way of publishing source code and binaries, compiling or even running any solver may be harder than what it seems. Moreover, there has been more than a thousand solvers published so far, some of them released in the early 90's. If the SAT community wants to archive and be able to keep track of all the solvers that made its history, it urgently needs to deploy an important effort. We propose to initiate a community-driven effort to archive and to allow easy compilation and running of all SAT solvers that have been released so far. We rely on the best tools for archiving and building binaries (thanks to Docker, GitHub and Zenodo) and provide a consistent and easy way for this. Thanks to our tool, building (or running) a solver from its source (or from its binary) can be done in one line.


Provably Good Solutions to the Knapsack Problem via Neural Networks of Bounded Size

arXiv.org Machine Learning

In view of the undisputed success of neural networks and due to the remarkable recent improvements in their ability to solve a huge variety of practical problems, the development of a satisfying and rigorous mathematical understanding of their performance is one of the main challenges in the field of learning theory. Against this background, we study the expressive power of neural networks through the example of the classical NP-hard Knapsack Problem. Our main contribution is a class of recurrent neural networks (RNNs) with rectified linear units that are iteratively applied to each item of a Knapsack instance and thereby compute optimal or provably good solution values. In order to find optimum Knapsack solutions, an RNN of depth four and width depending quadratically on the profit of an optimum Knapsack solution is sufficient. We also prove the following tradeoff between the size of an RNN and the quality of the computed Knapsack solution: For Knapsack instances consisting of $n$ items, an RNN of depth five and width $w$ computes a solution of value at least $1-\mathcal{O}(n^2/\sqrt{w})$ times the optimum solution value. Our results build upon a dynamic programming formulation of the Knapsack Problem as well as a careful rounding of profit values that is also at the core of the well-known fully polynomial-time approximation scheme for the Knapsack Problem. Finally, we point out that similar results can be achieved for other optimization problems that can be solved by dynamic programming, such as, e.g., various Shortest Path Problems and the Longest Common Subsequence Problem.


Technical Perspective: Algorithm Selection as a Learning Problem

Communications of the ACM

The following paper by Gupta and Roughgarden--"Data-Driven Algorithm Design"--addresses the issue that the best algorithm to use for many problems depends on what the input "looks like." Certain algorithms work better for certain types of inputs, whereas other algorithms work better for others. This is especially the case for NP-hard problems, where we do not expect to ever have algorithms that work well on all inputs: instead, we often have various heuristics that each work better in different settings. Moreover, heuristic strategies often have parameters or hyperparameters that must be set in some way. The authors present a theoretical formulation and analysis of algorithm selection using the well-developed framework of PAC-learning to analyze fundamental learning questions.


Data-Driven Algorithm Design

Communications of the ACM

The best algorithm for a computational problem generally depends on the "relevant inputs," a concept that depends on the application domain and often defies formal articulation. Although there is a large literature on empirical approaches to selecting the best algorithm for a given application domain, there has been surprisingly little theoretical analysis of the problem. Our framework captures several state-of-the-art empirical and theoretical approaches to the problem, and our results identify conditions under which these approaches are guaranteed to perform well. We interpret our results in the contexts of learning greedy heuristics, instance feature-based algorithm selection, and parameter tuning in machine learning. Rigorously comparing algorithms is hard. Two different algorithms for a computational problem generally have incomparable performance: one algorithm is better on some inputs but worse on the others. The simplest and most common solution in the theoretical analysis of algorithms is to summarize the performance of an algorithm using a single number, such as its worst-case performance or its average-case performance with respect to an input distribution. This approach effectively advocates using the algorithm with the best summarizing value (e.g., the smallest worst-case running time). Solving a problem "in practice" generally means identifying an algorithm that works well for most or all instances of interest. When the "instances of interest" are easy to specify formally in advance--say, planar graphs, the traditional analysis approaches often give accurate performance predictions and identify useful algorithms.


Proper Learning, Helly Number, and an Optimal SVM Bound

arXiv.org Machine Learning

The classical PAC sample complexity bounds are stated for any Empirical Risk Minimizer (ERM) and contain an extra logarithmic factor $\log(1/{\epsilon})$ which is known to be necessary for ERM in general. It has been recently shown by Hanneke (2016) that the optimal sample complexity of PAC learning for any VC class C is achieved by a particular improper learning algorithm, which outputs a specific majority-vote of hypotheses in C. This leaves the question of when this bound can be achieved by proper learning algorithms, which are restricted to always output a hypothesis from C. In this paper we aim to characterize the classes for which the optimal sample complexity can be achieved by a proper learning algorithm. We identify that these classes can be characterized by the dual Helly number, which is a combinatorial parameter that arises in discrete geometry and abstract convexity. In particular, under general conditions on C, we show that the dual Helly number is bounded if and only if there is a proper learner that obtains the optimal joint dependence on $\epsilon$ and $\delta$. As further implications of our techniques we resolve a long-standing open problem posed by Vapnik and Chervonenkis (1974) on the performance of the Support Vector Machine by proving that the sample complexity of SVM in the realizable case is $\Theta((n/{\epsilon})+(1/{\epsilon})\log(1/{\delta}))$, where $n$ is the dimension. This gives the first optimal PAC bound for Halfspaces achieved by a proper learning algorithm, and moreover is computationally efficient.


Surrogate Assisted Optimisation for Travelling Thief Problems

AAAI Conferences

The travelling thief problem (TTP) is a multi-component optimisation problem involving two interdependent NP-hard components: the travelling salesman problem (TSP) and the knapsack problem (KP). Recent state-of-the-art TTP solvers modify the underlying TSP and KP solutions in an iterative and interleaved fashion. The TSP solution (cyclic tour) is typically changed in a deterministic way, while changes to the KP solution typically involve a random search, effectively resulting in a quasi-meandering exploration of the TTP solution space. Once a plateau is reached, the iterative search of the TTP solution space is restarted by using a new initial TSP tour. We propose to make the search more efficient though an adaptive surrogate model (based on a customised form of Support Vector Regression) that learns the characteristics of initial TSP tours that lead to good TTP solutions. The model is used to filter out non-promising initial TSP tours, in effect reducing the amount of time spent to find a good TTP solution. Experiments on a broad range of benchmark TTP instances indicate that the proposed approach filters out a considerable number of non-promising initial tours, at the cost of missing only a small number of the best TTP solutions.


HyperVAE: A Minimum Description Length Variational Hyper-Encoding Network

arXiv.org Machine Learning

We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters \theta is drawn from a distribution p(\theta) which is modeled by a hyper-level VAE. We propose a variational inference using Gaussian mixture models to implicitly encode the parameters \theta into a low dimensional Gaussian distribution. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(\theta). HyperVAE can encode the parameters \theta in full in contrast to common hyper-networks practices, which generate only the scale and bias vectors as target-network parameters. Thus HyperVAE preserves much more information about the model for each task in the latent space. We discuss HyperVAE using the minimum description length (MDL) principle and show that it helps HyperVAE to generalize. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes, demonstrating its efficacy.


Solving Weighted Abduction via Max-SAT Solvers

AAAI Conferences

Abduction is a form of inference that seeks the best explanation for the given observation. Because it provides a reasoning process based on background knowledge, it is used in applications that need convincing explanations. In this study, we consider weighted abduction, which is one of the commonly used mathematical models for abduction. The main difficulty associated with applying weighted abduction to real problems is its computational complexity. A state-of-the-art method formulates weighted abduction as an integer linear programming (ILP) problem and solves it using efficient ILP solvers; however, it is still limited to solving problems that include at most 100 rules of background knowledge and observations. In this study, we first formulate the weighted abduction problem as a Max-SAT problem whose hard clauses are mostly Horn clauses. Then, we propose to solve the problem using modern Max-SAT solvers. In our experiments, the proposed method solved the problems much faster than the state-of-the-art ILP-based weighted abduction.