Precision-Recall Divergence Optimization for Generative Modeling with GANs and Normalizing Flows
Achieving a balance between image quality (precision) and diversity (recall) is a significant challenge in the domain of generative models. Current state-of-theart models primarily rely on optimizing heuristics, such as the Frรฉchet Inception Distance. While recent developments have introduced principled methods for evaluating precision and recall, they have yet to be successfully integrated into the training of generative models. Our main contribution is a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows, which explicitly optimizes a user-defined trade-off between precision and recall. More precisely, we show that achieving a specified precisionrecall trade-off corresponds to minimizing a unique f-divergence from a family we call the PR-divergences. Conversely, any f-divergence can be written as a linear combination of PR-divergences and corresponds to a weighted precisionrecall trade-off. Through comprehensive evaluations, we show that our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.
Nonconvex Low-Rank Tensor Completion from Noisy Data
Changxiao Cai, Gen Li, H. Vincent Poor, Yuxin Chen
We study a completion problem of broad practical interest: the reconstruction of a low-rank symmetric tensor from highly incomplete and randomly corrupted observations of its entries. While a variety of prior work has been dedicated to this problem, prior algorithms either are computationally too expensive for largescale applications, or come with sub-optimal statistical guarantees. Focusing on "incoherent" and well-conditioned tensors of a constant CP rank, we propose a two-stage nonconvex algorithm -- (vanilla) gradient descent following a rough initialization -- that achieves the best of both worlds. Specifically, the proposed nonconvex algorithm faithfully completes the tensor and retrieves individual tensor factors within nearly linear time, while at the same time enjoying near-optimal statistical guarantees (i.e.
A Experimental Protocol
We selected hyperparameters using the four disjoint validation corruptions provided with CIFAR-10-C and ImageNet-C [12]. As the other benchmarks are only test sets and do not provide validation sets, we used the same hyperparameters found using the corruption validation sets and do not perform any additional tuning. We considered the following hyperparameters when performing a grid search. Beyond learning rate and number of gradient steps, we also evaluated using a simple "threshold" by performing adaptation only when the marginal entropy was greater than 50% of the maximum value (log 1000 for ImageNet-C), though we found that this resulted in slightly worse validation performance. We also considered different values of the prior strength N for single point BN adaptation, and we found that 16 performed best on the validation sets as suggested in Schneider et al. [40].
Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context Aditya Kanade Microsoft Research Microsoft Research Bangalore, India Shuvendu K. Lahiri Microsoft Research
Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding.
LIP
From clinical development of cancer therapies to investigations into partisan bias, adaptive sequential designs have become increasingly popular method for causal inference, as they offer the possibility of improved precision over their non-adaptive counterparts. However, even in simple settings (e.g. two treatments) the extent to which adaptive designs can improve precision is not sufficiently well understood. In this work, we study the problem of Adaptive Neyman Allocation in a designbased potential outcomes framework, where the experimenter seeks to construct an adaptive design which is nearly as efficient as the optimal (but infeasible) nonadaptive Neyman design, which has access to all potential outcomes. Motivated by connections to online optimization, we propose Neyman Ratio and Neyman Regret as two (equivalent) performance measures of adaptive designs for this problem.
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories in diverse environments. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which effectively bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from raw images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The resulting single-stage system, called FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining.