Marwood, David
Making Images from Images: Interleaving Denoising and Transformation
Baluja, Shumeet, Marwood, David, Baluja, Ashwin
Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.
Diversity and Diffusion: Observations on Synthetic Image Distributions with Stable Diffusion
Marwood, David, Baluja, Shumeet, Alon, Yair
Recent progress in text-to-image (TTI) systems, such as StableDiffusion, Imagen, and DALL-E 2, have made it possible to create realistic images with simple text prompts. It is tempting to use these systems to eliminate the manual task of obtaining natural images for training a new machine learning classifier. However, in all of the experiments performed to date, classifiers trained solely with synthetic images perform poorly at inference, despite the images used for training appearing realistic. Examining this apparent incongruity in detail gives insight into the limitations of the underlying image generation processes. Through the lens of diversity in image creation vs.accuracy of what is created, we dissect the differences in semantic mismatches in what is modeled in synthetic vs. natural images. This will elucidate the roles of the image-languag emodel, CLIP, and the image generation model, diffusion. We find four issues that limit the usefulness of TTI systems for this task: ambiguity, adherence to prompt, lack of diversity, and inability to represent the underlying concept. We further present surprising insights into the geometry of CLIP embeddings.
Table-Based Neural Units: Fully Quantizing Networks for Multiply-Free Inference
Covell, Michele, Marwood, David, Baluja, Shumeet, Johnston, Nick
In this work, we propose to quantize all parts of standard classification networks and replace the activation-weight--multiply step with a simple table-based lookup. This approach results in networks that are free of floating-point operations and free of multiplications, suitable for direct FPGA and ASIC implementations. It also provides us with two simple measures of per-layer and network-wide compactness as well as insight into the distribution characteristics of activationoutput and weight values. We run controlled studies across different quantization schemes, both fixed and adaptive and, within the set of adaptive approaches, both parametric and model-free. We implement our approach to quantization with minimal, localized changes to the training process, allowing us to benefit from advances in training continuous-valued network architectures. We apply our approach successfully to AlexNet, ResNet, and MobileNet. We show results that are within 1.6% of the reported, non-quantized performance on MobileNet using only 40 entries in our table. This performance gap narrows to zero when we allow tables with 320 entries. Our results give the best accuracies among multiply-free networks.
No Multiplication? No Floating Point? No Problem! Training Networks for Efficient Inference
Baluja, Shumeet, Marwood, David, Covell, Michele, Johnston, Nick
A different body of research has focused on quantizing and clustering network weights (Yi et al., 2008; Courbariaux et al., 2016; Rastegari et al., 2016; Deng et al., 2017; Wu et al., 2018). For successful deployment of deep neural networks on highly resource constrained devices (hearing aids, earbuds, wearables), we must simplify the types of operations and the memory/power resources required during inference. Completely avoiding inference-time floating point operations is one of the simplest ways to design networks for these highly constrained environments. By quantizing both our in-network non-linearities and our network weights, we can move to simple, compact networks without floating point operations, without multiplications, and without nonlinear function computations. Our approach allows us to explore the spectrum of possible networks, ranging from fully continuous versions down to networks with bi-level weights and activations. Our results show that quantization can be done with little or no loss of performance on both regression tasks (auto-encoding) and multi-class classification tasks (ImageNet). The memory needed to deploy our quantized networks is less than one-third of the equivalent architecture that uses floating-point operations. The activations in our networks emit only a small number of predefined, quantized values (typically 32) and all of the network's weight are drawn from a small number of unique values (typically 100-1000) found by employing a novel periodic adaptive clustering step during training. Almost all recent neural-network training algorithms rely on gradient-based learning. This has moved the research field away from using discrete-valued inference, with hard thresholds, to smooth, continuous-valued activation functions (Werbos, 1974; Rumelhart et al., 1986). Unfortunately, this causes inference to be done with floating-point operations, making it difficult to deploy on an increasinglylarge set of low-cost, limited-memory, low-power hardware in both commercial (Lane et al., 2015) and research settings (Bourzac, 2017). Avoiding all floating point operations allows the inference network to realize the power-saving gains available with fixed-point processing (Finnerty & Ratigner, 2017).