Europe
Neural Network Architecture Beyond Width and Depth
This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). ANestNet of height sis built with each hidden neuron activated by a NestNet of height s 1.
Robust Regression Revisited: Acceleration and Improved Estimation Rates
Parameter estimation in generalized linear models, such as linear and logistic regression problems, is among the most fundamental and well-studied statistical optimization problems. It serves as the primary workhorse in statistical studies arising from a variety of disciplines, ranging from economics [Smi12], biology [VGSM05], and the social sciences [Gor10].
Robust Regression Revisited: Acceleration and Improved Estimation Rates
We study fast algorithms for statistical regression problems under the strong contamination model, where the goal is to approximately optimize a generalized linear model (GLM) given adversarially corrupted samples. Prior works in this line of research were based on the robust gradient descent framework of [PSBR20], a firstorder method using biased gradient queries, or the Sever framework of [DKK+19], an iterative outlier-removal method calling a stationary point finder. We present nearly-linear time algorithms for robust regression problems with improved runtime or estimation guarantees compared to the state-of-the-art.
Supplementary Material for Mixture weights optimisation for Alpha-Divergence Variational Inference Kamรฉlia Daudel1,2, Randal Douc3
Assume that p and k are as in (A1). Then, the two following assertions hold. A.3 The case ฮฑ < 1 for the Power Descent algorithm Let ฮฑ = 1, ฮท (0,1], ฮบbe such that (ฮฑ 1)ฮบ 0and let the initial probability measure ยต1 M1(T) be such that ฮจฮฑ(ยต1) < . A common way to approximate intractable integrals of the form (16) is to resort to Importance Sampling methods and in that case we are also interested in ensuring that the support of the variational approximation q Q (with q = ยตk in our case) is included in the support of p. Seeking to solve the Variational Inference optimation problem inf Dฮฑ(ยตK||P) for ฮฑ < 1 enables this to happen, as opposed to the case ฮฑ 1 for which the ฮฑ-divergenve exhibits the so-called mode-seeking property [2, 3, 4]. As a whole, well-chosen samplers and variance reduction methods appear to be a necessity even in the case ฮฑ = 1 so that the obtained Monte Carlo estimator of ฮธ 7 bยต,ฮฑ(ฮธ)do not suffer from a too large variance.
Mixture weights optimisation for Alpha-Divergence Variational Inference
This paper focuses on ฮฑ-divergence minimisation methods for Variational Inference. We consider the case where the posterior density is approximated by a mixture model and we investigate algorithms optimising the mixture weights of this mixture model by ฮฑ-divergence minimisation, without any information on the underlying distribution of its mixture components parameters. The Power Descent, defined for all ฮฑ = 1, is one such algorithm and we establish in our work the full proof of its convergence towards the optimal mixture weights when ฮฑ < 1. Since the ฮฑ-divergence recovers the widely-used exclusive Kullback-Leibler when ฮฑ 1, we then extend the Power Descent to the case ฮฑ = 1 and show that we obtain an Entropic Mirror Descent. This leads us to investigate the link between Power Descent and Entropic Mirror Descent: first-order approximations allow us to introduce the Rรฉnyi Descent, a novel algorithm for which we prove an O(1/N) convergence rate. Lastly, we compare numerically the behavior of the unbiased Power Descent and of the biased Rรฉnyi Descent and we discuss the potential advantages of one algorithm over the other.
Recursive Bayesian Networks: Generalising and Unifying Probabilistic Context-Free Grammars and Dynamic Bayesian Networks
Probabilistic context-free grammars (PCFGs) and dynamic Bayesian networks (DBNs) are widely used sequence models with complementary strengths and limitations. While PCFGs allow for nested hierarchical dependencies (tree structures), their latent variables (non-terminal symbols) have to be discrete. In contrast, DBNs allow for continuous latent variables, but the dependencies are strictly sequential (chain structure). Therefore, neither can be applied if the latent variables are assumed to be continuous and also to have a nested hierarchical dependency structure. In this paper, we present Recursive Bayesian Networks (RBNs), which generalise and unify PCFGs and DBNs, combining their strengths and containing both as special cases. RBNs define a joint distribution over tree-structured Bayesian networks with discrete or continuous latent variables. The main challenge lies in performing joint inference over the exponential number of possible structures and the continuous variables. We provide two solutions: 1) For arbitrary RBNs, we generalise inside and outside probabilities from PCFGs to the mixed discrete-continuous case, which allows for maximum posterior estimates of the continuous latent variables via gradient descent, while marginalising over network structures.
Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis
We introduce Resilient Multiple Choice Learning (rMCL), an extension of the MCL approach for conditional distribution estimation in regression settings where multiple targets may be sampled for each training input. Multiple Choice Learning is a simple framework to tackle multimodal density estimation, using the WinnerTakes-All (WTA) loss for a set of hypotheses. In regression settings, the existing MCL variants focus on merging the hypotheses, thereby eventually sacrificing the diversity of the predictions. In contrast, our method relies on a novel learned scoring scheme underpinned by a mathematical framework based on Voronoi tessellations of the output space, from which we can derive a probabilistic interpretation. After empirically validating rMCL with experiments on synthetic data, we further assess its merits on the sound source localization task, demonstrating its practical usefulness and the relevance of its interpretation.
CATER: Intellectual Property Protection on Text Generation APIs via Conditional Watermarks
Previous works have validated that text generation APIs can be stolen through imitation attacks, causing IP violations. In order to protect the IP of text generation APIs, recent work has introduced a watermarking algorithm and utilized the null-hypothesis test as a post-hoc ownership verification on the imitation models. However, we find that it is possible to detect those watermarks via sufficient statistics of the frequencies of candidate watermarking words. To address this drawback, in this paper, we propose a novel Conditional wATERmarking framework (CATER) for protecting the IP of text generation APIs. An optimization method is proposed to decide the watermarking rules that can minimize the distortion of overall word distributions while maximizing the change of conditional word selections. Theoretically, we prove that it is infeasible for even the savviest attacker (they know how CATER works) to reveal the used watermarks from a large pool of potential word pairs based on statistical inspection. Empirically, we observe that high-order conditions lead to an exponential growth of suspicious (unused) watermarks, making our crafted watermarks more stealthy. In addition, CATER can effectively identify IP infringement under architectural mismatch and cross-domain imitation attacks, with negligible impairments on the generation quality of victim APIs. We envision our work as a milestone for stealthily protecting the IP of text generation APIs.
Scale-invariant Learning by Physics Inversion
Solving inverse problems, such as parameter estimation and optimal control, is a vital part of science. Many experiments repeatedly collect data and rely on machine learning algorithms to quickly infer solutions to the associated inverse problems. We find that state-of-the-art training techniques are not well-suited to many problems that involve physical processes. The highly nonlinear behavior, common in physical processes, results in strongly varying gradients that lead first-order optimizers like SGD or Adam to compute suboptimal optimization directions. We propose a novel hybrid training approach that combines higherorder optimization methods with machine learning techniques. We take updates from a scale-invariant inverse problem solver and embed them into the gradientdescent-based learning pipeline, replacing the regular gradient of the physical process. We demonstrate the capabilities of our method on a variety of canonical physical systems, showing that it yields significant improvements on a wide range of optimization and learning problems.