This Looks Like Those: Illuminating Prototypical Concepts Using Multiple Visualizations Brandon Zhao
Existing work in prototype-based image classification uses a "this looks like that" reasoning process, which dissects a test image by finding prototypical parts and combining evidence from these prototypes to make a final classification. However, all of the existing prototypical part-based image classifiers provide only one-to-one comparisons, where a single training image patch serves as a prototype to compare with a part of our test image. With these single-image comparisons, it can often be difficult to identify the underlying concept being compared (e.g., "is it comparing the color or the shape?"). Our proposed method modifies the architecture of prototype-based networks to instead learn prototypical concepts which are visualized using multiple image patches. Having multiple visualizations of the same prototype allows us to more easily identify the concept captured by that prototype (e.g., "the test image and the related training patches are all the same shade of blue"), and allows our model to create richer, more interpretable visual explanations. Our experiments show that our "this looks like those" reasoning process can be applied as a modification to a wide range of existing prototypical image classification networks while achieving comparable accuracy on benchmark datasets.
Neural Multi-Objective Combinatorial Optimization with Diversity Enhancement (Appendix) A Reference point and hypervolume ratio The normalized hypervolume (HV) ratio is calculated as HV
NHDE-P, deploying NHDE to PMOCO [14], employs a hypernetwork to tackle the weight ฮป and diversity factor w for the corresponding subproblem. Specifically, according to the given ฮป and w, the hypernetwork generates the decoder parameters of the heterogeneous graph attention (HGA) model ฮธ, which is an encoder-decoder-styled architecture, i.e., ฮธ(ฮป, w) = [ฮธ Following [14], the hypernetwork adopts a simple MLP model with two 256-dimensional hidden layers and ReLu activation. The MLP first maps an input with M + 2 dimensions to a hidden embedding h(ฮป, w), which is then used to generate the decoder parameters by linear projection. NHDE-M, deploying NHDE to MDRL [15], consists of three processes. In the inference process, the submodel is used to solve the corresponding subproblem.
Supplementary Material for " AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery "
In Sec. 2 we include a The data is publicly available at https://allclear.cs.cornell.edu. We include a datasheet for our dataset following the methodology from "Datasheets for Datasets" Gebru In this section, we include the prompts from Gebru et al. [2021] in blue, and in For what purpose was the dataset created? Was there a specific task in mind? The dataset was created to facilitate research development on cloud removal in satellite imagery. Specifically, our task is more temporally aligned than previous benchmarks.
Dog owners who ruminate about work stress may pass anxiety to their pooch: study
Petco Love Lost is a free platform that uses AI-powered photo matching to reunite lost pets with their families. If your job has you feeling tense, your dog might be feeling it too. A new study published in Scientific Reports finds that stress from work can affect your dog at home. The research, led by Tanya Mitropoulos and Allison Andrukonis, shows that when dog owners dwell on work problems after hours, a habit known as "work-related rumination," their pets show more signs of stress. Researchers surveyed 85 working dog owners.
G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models
Worldwide geolocalization aims to locate the precise location at the coordinate level of photos taken anywhere on the Earth. It is very challenging due to 1) the difficulty of capturing subtle location-aware visual semantics, and 2) the heterogeneous geographical distribution of image data. As a result, existing studies have clear limitations when scaled to a worldwide context. They may easily confuse distant images with similar visual contents, or cannot adapt to various locations worldwide with different amounts of relevant data. To resolve these limitations, we propose G3, a novel framework based on Retrieval-Augmented Generation (RAG).
Conditional Score Guidance for Text-Driven Image-to-Image Translation Hyunsoo Lee
We present a novel algorithm for text-driven image-to-image translation based on a pretrained text-to-image diffusion model. Our method aims to generate a target image by selectively editing regions of interest in a source image, defined by a modifying text, while preserving the remaining parts. In contrast to existing techniques that solely rely on a target prompt, we introduce a new score function that additionally considers both the source image and the source text prompt, tailored to address specific translation tasks. To this end, we derive the conditional score function in a principled way, decomposing it into the standard score and a guiding term for target image generation. For the gradient computation about the guiding term, we assume a Gaussian distribution for the posterior distribution and estimate its mean and variance to adjust the gradient without additional training. In addition, to improve the quality of the conditional score guidance, we incorporate a simple yet effective mixup technique, which combines two cross-attention maps derived from the source and target latents. This strategy is effective for promoting a desirable fusion of the invariant parts in the source image and the edited regions aligned with the target prompt, leading to high-fidelity target image generation. Through comprehensive experiments, we demonstrate that our approach achieves outstanding image-to-image translation performance on various tasks. Code is available at https://github.com/Hleephilip/CSG.
Flexible neural representation for physics prediction
Damian Mrowca, Chengxu Zhuang, Elias Wang, Nick Haber, Li F. Fei-Fei, Josh Tenenbaum, Daniel L. Yamins
Humans have a remarkable capacity to understand the physical dynamics of objects in their environment, flexibly capturing complex structures and interactions at multiple levels of detail. Inspired by this ability, we propose a hierarchical particlebased object representation that covers a wide variety of types of three-dimensional objects, including both arbitrary rigid geometrical shapes and deformable materials. We then describe the Hierarchical Relation Network (HRN), an end-to-end differentiable neural network based on hierarchical graph convolution, that learns to predict physical dynamics in this representation. Compared to other neural network baselines, the HRN accurately handles complex collisions and nonrigid deformations, generating plausible dynamics predictions at long time scales in novel settings, and scaling to large scene configurations. These results demonstrate an architecture with the potential to form the basis of next-generation physics predictors for use in computer vision, robotics, and quantitative cognitive science.
Graph Convolutions Enrich the Self-Attention in Transformers!
Transformers, renowned for their self-attention mechanism, have achieved state-ofthe-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective.