Not enough data to create a plot.
Try a different view from the menu above.
Cohen, Scott
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
Slyman, Eric, Lee, Stefan, Cohen, Scott, Kafle, Kushal
Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hua, Hang, Shi, Jing, Kafle, Kushal, Jenni, Simon, Zhang, Daoan, Collomosse, John, Cohen, Scott, Luo, Jiebo
Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.
Latent Feature-Guided Diffusion Models for Shadow Removal
Mei, Kangfu, Figueroa, Luis, Lin, Zhe, Ding, Zhihong, Cohen, Scott, Patel, Vishal M.
Motivated by the success of diffusionbased Recovering textures under shadows has remained a challenging image restoration models [38, 41], we adapt diffusion problem due to the difficulty of inferring shadowfree models for the task of shadow removal by conditioning on scenes from shadow images. In this paper, we propose the input shadow image and corresponding shadow mask as the use of diffusion models as they offer a promising approach a baseline approach to generate shadow-free images. However, to gradually refine the details of shadow regions preserving and generating high-fidelity textures and during the diffusion process. Our method improves this colors in the shadow region after removal is non-trivial. The process by conditioning on a learned latent feature space baseline model appears to favor borrowing textures from that inherits the characteristics of shadow-free images, thus the surrounding non-shadow areas rather than focusing on avoiding the limitation of conventional methods that condition restoring the original details underneath the shadow, which on degraded images only. Additionally, we propose results in incorrect color mixtures and loss of detail in the to alleviate potential local optima during training by fusing shadow region. In Figure 1, we show one of the representative noise features with the diffusion network. We demonstrate issues of image-mask conditioning, i.e., the model synthesizes the effectiveness of our approach which outperforms results containing an incorrect color mixture.
Answering Questions about Data Visualizations using Efficient Bimodal Fusion
Kafle, Kushal, Shrestha, Robik, Price, Brian, Cohen, Scott, Kanan, Christopher
Chart question answering (CQA) is a newly proposed visual question answering (VQA) task where an algorithm must answer questions about data visualizations, e.g. bar charts, pie charts, and line graphs. CQA requires capabilities that natural-image VQA algorithms lack: fine-grained measurements, optical character recognition, and handling out-of-vocabulary words in both questions and answers. Without modifications, state-of-the-art VQA algorithms perform poorly on this task. Here, we propose a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL). PReFIL first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learned embeddings to answer the given question. Despite its simplicity, PReFIL greatly surpasses state-of-the art systems and human baselines on both the FigureQA and DVQA datasets. Additionally, we demonstrate that PReFIL can be used to reconstruct tables by asking a series of questions about a chart.
Sherlock: Scalable Fact Learning in Images
Elhoseiny, Mohamed (Rutgers University) | Cohen, Scott (Adobe Research) | Chang, Walter (Adobe Research) | Price, Brian (Adobe Research) | Elgammal, Ahmed (Rutgers University)
We study scalable and uniform understanding of facts in images. Existing visual recognition systems are typically modeled differently for each fact type such as objects, actions, and interactions. We propose a setting where all these facts can be modeled simultaneously with a capacity to understand an unbounded number of facts in a structured way. The training data comes as structured facts in images, including (1) objects (e.g., <boy>), (2) attributes (e.g., <boy, tall>), (3) actions (e.g., <boy, playing>), and (4) interactions (e.g., <boy, riding, a horse >). Each fact has a semantic language view (e.g., < boy, playing>) and a visual view (an image with this fact). We show that learning visual facts in a structured way enables not only a uniform but also generalizable visual understanding. We propose and investigate recent and strong approaches from the multiview learning literature and also introduce two learning representation models as potential baselines. We applied the investigated methods on several datasets that we augmented with structured facts and a large scale dataset of more than 202,000 facts and 814,000 images. Our experiments show the advantage of relating facts by the structure by the proposed models compared to the designed baselines on bidirectional fact retrieval.
SURGE: Surface Regularized Geometry Estimation from a Single Image
Wang, Peng, Shen, Xiaohui, Russell, Bryan, Cohen, Scott, Price, Brian, Yuille, Alan L.
This paper introduces an approach to regularize 2.5D surface normal and depth predictions at each pixel given a single input image. The approach infers and reasons about the underlying 3D planar surfaces depicted in the image to snap predicted normals and depths to inferred planar surfaces, all while maintaining fine detail within objects. Our approach comprises two components: (i) a fourstream convolutional neural network (CNN) where depths, surface normals, and likelihoods of planar region and planar boundary are predicted at each pixel, followed by (ii) a dense conditional random field (DCRF) that integrates the four predictions such that the normals and depths are compatible with each other and regularized by the planar region and planar boundary information. The DCRF is formulated such that gradients can be passed to the surface normal and depth CNNs via backpropagation. In addition, we propose new planar wise metrics to evaluate geometry consistency within planar surfaces, which are more tightly related to dependent 3D editing applications. We show that our regularization yields a 30% relative improvement in planar consistency on the NYU v2 dataset.