Natural Language
Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data
Few-shot learning is valuable in many real-world applications, but learning a generalizable model without overfitting to the few labeled datapoints is challenging. In this work, we focus on Few-shot Learning with Auxiliary Data (FLAD), a training paradigm that assumes access to auxiliary data during few-shot learning in hopes of improving generalization. Previous works have proposed automated methods for mixing auxiliary and target data, but these methods typically scale linearly (or worse) with the number of auxiliary datasets, limiting their practicality. In this work we relate FLAD to the explore-exploit dilemma that is central to the multi-armed bandit setting and derive algorithms whose computational complexity is independent of the number of auxiliary datasets, allowing us to scale to 100 more auxiliary datasets than prior methods. We propose two algorithms - EXP3-FLAD and UCB1-FLAD - and compare them with prior FLAD methods that either explore or exploit, finding that the combination of exploration and exploitation is crucial. Through extensive experimentation we find that our methods outperform all pre-existing FLAD methods by 4% and lead to the first 3 billion parameter language models that outperform the 175 billion parameter GPT-3. Overall, our work suggests that the discovery of better, more efficient mixing strategies for FLAD may provide a viable path towards substantially improving generalization in few-shot learning. All of our code is available at github.com/alon-albalak/FLAD.
Local Hyper-Flow Diffusion
Recently, hypergraphs have attracted a lot of attention due to their ability to capture complex relations among entities. The insurgence of hypergraphs has resulted in data of increasing size and complexity that exhibit interesting small-scale and local structure, e.g., small-scale communities and localized node-ranking around a given set of seed nodes. Popular and principled ways to capture the local structure are the local hypergraph clustering problem and the related seed set expansion problem. In this work, we propose the first local diffusion method that achieves edge-sizeindependent Cheeger-type guarantee for the problem of local hypergraph clustering while applying to a rich class of higher-order relations that covers a number of previously studied special cases.
Rethinking Score Distillation as a Bridge Between Image Distributions David McAllister 1 Songwei Ge2 Jia-Bin Huang 2 David W. Jacobs 2
Score distillation sampling (SDS) has proven to be an important tool, enabling the use of large-scale diffusion priors for tasks operating in data-poor domains. Unfortunately, SDS has a number of characteristic artifacts that limit its usefulness in general-purpose applications. In this paper, we make progress toward understanding the behavior of SDS and its variants by viewing them as solving an optimal-cost transport path from a source distribution to a target distribution. Under this new interpretation, these methods seek to transport corrupted images (source) to the natural image distribution (target). We argue that current methods' characteristic artifacts are caused by (1) linear approximation of the optimal path and (2) poor estimates of the source distribution. We show that calibrating the text conditioning of the source distribution can produce high-quality generation and translation results with little extra overhead. Our method can be easily applied across many domains, matching or beating the performance of specialized methods. We demonstrate its utility in text-to-2D, text-based NeRF optimization, translating paintings to real images, optical illusion generation, and 3D sketch-to-real. We compare our method to existing approaches for score distillation sampling and show that it can produce high-frequency details with realistic colors.
I just watched Gmail generate AI responses for me - and they were scarily accurate
The Google I/O keynote took place earlier this week, and the company took the stage to unveil new features across all of its product offerings. This included AI upgrades to the Google Workspace suite of applications, which millions of users rely on every day to get their work done, including Google Docs, Meet, Slides, Gmail, and Vids. Also: Google's popular AI tool gets its own Android app - how to use NotebookLM on your phone The features unveiled this year focused on practicality. They embed AI features into the Google apps you already use every day to speed up your daily workflow by performing tedious and time-consuming tasks, such as cleaning out your inbox. Everyone can relate to being bombarded with emails.
Assessing Social and Intersectional Biases in Contextualized Word Representations
Social bias in machine learning has drawn significant attention, with work ranging from demonstrations of bias in a multitude of applications, curating definitions of fairness for different contexts, to developing algorithms to mitigate bias. In natural language processing, gender bias has been shown to exist in context-free word embeddings. Recently, contextual word representations have outperformed word embeddings in several downstream NLP tasks. These word representations are conditioned on their context within a sentence, and can also be used to encode the entire sentence. In this paper, we analyze the extent to which state-of-the-art models for contextual word representations, such as BERT and GPT-2, encode biases with respect to gender, race, and intersectional identities. Towards this, we propose assessing bias at the contextual word level. This novel approach captures the contextual effects of bias missing in context-free word embeddings, yet avoids confounding effects that underestimate bias at the sentence encoding level. We demonstrate evidence of bias at the corpus level, find varying evidence of bias in embedding association tests, show in particular that racial bias is strongly encoded in contextual word models, and observe that bias effects for intersectional minorities are exacerbated beyond their constituent minority identities. Further, evaluating bias effects at the contextual word level captures biases that are not captured at the sentence level, confirming the need for our novel approach.
Unpaired Image-to-Image Translation with Density Changing Regularization
Unpaired image-to-image translation aims to translate an input image to another domain such that the output image looks like an image from another domain while important semantic information are preserved. Inferring the optimal mapping with unpaired data is impossible without making any assumptions. In this paper, we make a density changing assumption where image patches of high probability density should be mapped to patches of high probability density in another domain. Then we propose an efficient way to enforce this assumption: we train the flows as density estimators and penalize the variance of density changes. Despite its simplicity, our method achieves the best performance on benchmark datasets and needs only 56 86% of training time of the existing state-of-the-art method. The training and evaluation code are avaliable at https://github.com/Mid-Push/
Inside OpenAI's Empire
OpenAI started as a non-profit dedicated to building safe A.I. Now, they're obsessed with building artificial general intelligence by any means necessary - even if they don't quite know what that is. Subscribe to Slate Plus to access ad-free listening to the whole What Next family and all your favorite Slate podcasts. Subscribe today on Apple Podcasts by clicking "Try Free" at the top of our show page. Sign up now at slate.com/whatnextplus to get access wherever you listen.
Masked Pre-training Enables Universal Zero-shot Denoiser 1 Yi Jin
In this work, we observe that model trained on vast general images via masking strategy, has been naturally embedded with their distribution knowledge, thus spontaneously attains the underlying potential for strong image denoising. Based on this observation, we propose a novel zero-shot denoising paradigm, i.e., Masked Pre-train then Iterative fill (MPI). MPI first trains model via masking and then employs pre-trained weight for high-quality zero-shot image denoising on a single noisy image. Concretely, MPI comprises two key procedures: 1) Masked Pre-training involves training model to reconstruct massive natural images with random masking for generalizable representations, gathering the potential for valid zero-shot denoising on images with varying noise degradation and even in distinct image types.
Reformulating Zero-shot Action Recognition for Multi-label Actions (Supplementary Material)
Standard video models expect frame dimensions with the same height and width, so we crop a square region around the actor and resize it to the network specific dimensions (112 112). We present some examples of AVA video frames with their annotations as well as the generated crops in Figure 1. This square crop can cause multiple actors to appear within one clip, as seen in the second example, but it ensures the aspect ratio of the person is not altered, which is necessary as this is the manner in which the video model is trained. Figure 1: Example of original ground-truth bounding boxes (left) in the AVA dataset, with the cropped actors on the right. For PS-ZSAR prediction confidences are obtained from the softmax probabilities output by our pair-wise similarity function.
Bringing Image Structure to Video via Frame-Clip Consistency of Object Tokens
Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. On the other hand, one does often have access to a small set of annotated images, either within or outside the domain of interest. Here we ask how such images can be leveraged for downstream video understanding tasks. We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model.