Goto

Collaborating Authors

 Baluja, Shumeet


Evolving Deeper LLM Thinking

arXiv.org Artificial Intelligence

We explore an evolutionary search strategy for scaling inference time compute in Large Language Models. The proposed approach, Mind Evolution, uses a language model to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver.


Making Images from Images: Interleaving Denoising and Transformation

arXiv.org Artificial Intelligence

Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.


Diversity and Diffusion: Observations on Synthetic Image Distributions with Stable Diffusion

arXiv.org Artificial Intelligence

Recent progress in text-to-image (TTI) systems, such as StableDiffusion, Imagen, and DALL-E 2, have made it possible to create realistic images with simple text prompts. It is tempting to use these systems to eliminate the manual task of obtaining natural images for training a new machine learning classifier. However, in all of the experiments performed to date, classifiers trained solely with synthetic images perform poorly at inference, despite the images used for training appearing realistic. Examining this apparent incongruity in detail gives insight into the limitations of the underlying image generation processes. Through the lens of diversity in image creation vs.accuracy of what is created, we dissect the differences in semantic mismatches in what is modeled in synthetic vs. natural images. This will elucidate the roles of the image-languag emodel, CLIP, and the image generation model, diffusion. We find four issues that limit the usefulness of TTI systems for this task: ambiguity, adherence to prompt, lack of diversity, and inability to represent the underlying concept. We further present surprising insights into the geometry of CLIP embeddings.


Hiding Images in Plain Sight: Deep Steganography

Neural Information Processing Systems

Steganography is the practice of concealing a secret message within another, ordinary, message. Commonly, steganography is used to unobtrusively hide a small message within the noisy regions of a larger image. In this study, we attempt to place a full size color image within another image of the same size. Deep neural networks are simultaneously trained to create the hiding and revealing processes and are designed to specifically work as a pair. The system is trained on images drawn randomly from the ImageNet database, and works well on natural images from a wide variety of sources.


Table-Based Neural Units: Fully Quantizing Networks for Multiply-Free Inference

arXiv.org Machine Learning

In this work, we propose to quantize all parts of standard classification networks and replace the activation-weight--multiply step with a simple table-based lookup. This approach results in networks that are free of floating-point operations and free of multiplications, suitable for direct FPGA and ASIC implementations. It also provides us with two simple measures of per-layer and network-wide compactness as well as insight into the distribution characteristics of activationoutput and weight values. We run controlled studies across different quantization schemes, both fixed and adaptive and, within the set of adaptive approaches, both parametric and model-free. We implement our approach to quantization with minimal, localized changes to the training process, allowing us to benefit from advances in training continuous-valued network architectures. We apply our approach successfully to AlexNet, ResNet, and MobileNet. We show results that are within 1.6% of the reported, non-quantized performance on MobileNet using only 40 entries in our table. This performance gap narrows to zero when we allow tables with 320 entries. Our results give the best accuracies among multiply-free networks.


No Multiplication? No Floating Point? No Problem! Training Networks for Efficient Inference

arXiv.org Machine Learning

A different body of research has focused on quantizing and clustering network weights (Yi et al., 2008; Courbariaux et al., 2016; Rastegari et al., 2016; Deng et al., 2017; Wu et al., 2018). For successful deployment of deep neural networks on highly resource constrained devices (hearing aids, earbuds, wearables), we must simplify the types of operations and the memory/power resources required during inference. Completely avoiding inference-time floating point operations is one of the simplest ways to design networks for these highly constrained environments. By quantizing both our in-network non-linearities and our network weights, we can move to simple, compact networks without floating point operations, without multiplications, and without nonlinear function computations. Our approach allows us to explore the spectrum of possible networks, ranging from fully continuous versions down to networks with bi-level weights and activations. Our results show that quantization can be done with little or no loss of performance on both regression tasks (auto-encoding) and multi-class classification tasks (ImageNet). The memory needed to deploy our quantized networks is less than one-third of the equivalent architecture that uses floating-point operations. The activations in our networks emit only a small number of predefined, quantized values (typically 32) and all of the network's weight are drawn from a small number of unique values (typically 100-1000) found by employing a novel periodic adaptive clustering step during training. Almost all recent neural-network training algorithms rely on gradient-based learning. This has moved the research field away from using discrete-valued inference, with hard thresholds, to smooth, continuous-valued activation functions (Werbos, 1974; Rumelhart et al., 1986). Unfortunately, this causes inference to be done with floating-point operations, making it difficult to deploy on an increasinglylarge set of low-cost, limited-memory, low-power hardware in both commercial (Lane et al., 2015) and research settings (Bourzac, 2017). Avoiding all floating point operations allows the inference network to realize the power-saving gains available with fixed-point processing (Finnerty & Ratigner, 2017).


Learning to Attack: Adversarial Transformation Networks

AAAI Conferences

With the rapidly increasing popularity of deep neural networks for image recognition tasks, a parallel interest in generating adversarial examples to attack the trained models has arisen. To date, these approaches have involved either directly computing gradients with respect to the image pixels or directly solving an optimization on the image pixels. We generalize this pursuit in a novel direction: can a separate network be trained to efficiently attack another fully trained network? We demonstrate that it is possible, and that the generated attacks yield startling insights into the weaknesses of the target network. We call such a network an Adversarial Transformation Network (ATN). ATNs transform any input into an adversarial attack on the target network, while being minimally perturbing to the original inputs and the target network's outputs. Further, we show that ATNs are capable of not only causing the target network to make an error, but can be constructed to explicitly control the type of misclassification made. We demonstrate ATNs on both simple MNIST-digit classifiers and state-of-the-art ImageNet classifiers deployed by Google, Inc.: Inception ResNet-v2.


Hiding Images in Plain Sight: Deep Steganography

Neural Information Processing Systems

Steganography is the practice of concealing a secret message within another, ordinary, message. Commonly, steganography is used to unobtrusively hide a small message within the noisy regions of a larger image. In this study, we attempt to place a full size color image within another image of the same size. Deep neural networks are simultaneously trained to create the hiding and revealing processes and are designed to specifically work as a pair. The system is trained on images drawn randomly from the ImageNet database, and works well on natural images from a wide variety of sources. Beyond demonstrating the successful application of deep learning to hiding images, we carefully examine how the result is achieved and explore extensions. Unlike many popular steganographic methods that encode the secret message within the least significant bits of the carrier image, our approach compresses and distributes the secret image's representation across all of the available bits.


Labeling the Features Not the Samples: Efficient Video Classification with Minimal Supervision

AAAI Conferences

Feature selection is essential for effective visual recognition. We propose an efficient joint classifier learning and feature selection method that discovers sparse, compact representations of input features from a vast sea of candidates, with an almost unsupervised formulation. Our method requires only the following knowledge, which we call the feature sign - whether or not a particular feature has on average stronger values over positive samples than over negatives. We show how this can be estimated using as few as a single labeled training sample per class. Then, using these feature signs, we extend an initial supervised learning problem into an (almost) unsupervised clustering formulation that can incorporate new data without requiring ground truth labels. Our method works both as a feature selection mechanism and as a fully competitive classifier. It has important properties, low computational cost annd excellent accuracy, especially in difficult cases of very limited training data. We experiment on large-scale recognition in video and show superior speed and performance to established feature selection approaches such as AdaBoost, Lasso, greedy forward-backward selection, and powerful classifiers such as SVM.