The recent empirical success of unsupervised cross-domain mapping algorithms, between two domains that share common characteristics, is not well-supported by theoretical justifications. This lacuna is especially troubling, given the clear ambiguity in such mappings. We work with the adversarial training method called the Wasserstein GAN and derive a novel generalization bound, which limits the risk between the learned mapping $h$ and the target mapping $y$, by a sum of two terms: (i) the risk between $h$ and the most distant alternative mapping that was learned by the same cross-domain mapping algorithm, and (ii) the minimal Wasserstein GAN divergence between the target domain and the domain obtained by applying a hypothesis $h^*$ on the samples of the source domain, where $h^*$ is a hypothesis selected by the same algorithm. The bound is directly related to Occam's razor and encourages the selection of the minimal architecture that supports a small Wasserstein GAN divergence. The bound leads to multiple algorithmic consequences, including a method for hyperparameters selection and for an early stopping in cross-domain mapping GANs. We also demonstrate a novel capability for unsupervised learning of estimating confidence in the mapping of every specific sample. Lastly, we show how non-minimal architectures can be effectively trained by an inverted knowledge distillation, in which a minimal architecture is used to train a larger one, leading to higher quality outputs.
We present a method to represent 3D objects using higher order functions, where an object is encoded directly into the weights and biases of a small `mapping' network by a larger encoder network. This mapping network can be used to reconstruct 3D objects by applying its encoded transformation to points sampled from a simple canonical space. We first demonstrate that an encoder network can produce mappings that reconstruct objects from single images more accurately than state of the art point set reconstruction methods. Next, we show that our method yields meaningful gains for robot motion planning problems that use this object representation for collision avoidance. We also demonstrate that our formulation allows for a novel method of object interpolation in a latent function space, where we compose the roots of the reconstruction functions for various objects to generate new, coherent objects. Finally, we demonstrate the coding efficiency of our approach: encoding objects directly as a neural network is highly parameter efficient when compared with object representations that encode the object of interest as a latent vector `codeword'. Our smallest reconstruction network has only about 7000 parameters and shows reconstruction quality generally better than state-of-the-art codeword-based object representation architectures with millions of parameters.
Machine understanding of complex images is a key goal of artificial intelligence. One challenge underlying this task is that visual scenes contain multiple inter-related objects, and that global context plays an important role in interpreting the scene. A natural modeling framework for capturing such effects is structured prediction, which optimizes over complex labels, while modeling within-label interactions. However, it is unclear what principles should guide the design of a structured prediction model that utilizes the power of deep learning components. Here we propose a design principle for such architectures that follows from a natural requirement of permutation invariance.