Goto

Collaborating Authors

 Daw, Arka


Open World Scene Graph Generation using Vision Language Models

Dutta, Amartya, Mehrab, Kazi Sajeed, Sawhney, Medha, Neog, Abhilash, Khurana, Mridul, Fatemi, Sepideh, Pradhan, Aanish, Maruf, M., Lourentzou, Ismini, Daw, Arka, Karpatne, Anuj

arXiv.org Artificial Intelligence

Scene-Graph Generation (SGG) seeks to recognize objects in an image and distill their salient pairwise relationships. Most methods depend on dataset-specific supervision to learn the variety of interactions, restricting their usefulness in open-world settings, involving novel objects and/or relations. Even methods that leverage large Vision Language Models (VLMs) typically require benchmark-specific fine-tuning. We introduce Open-World SGG, a training-free, efficient, model-agnostic framework that taps directly into the pretrained knowledge of VLMs to produce scene graphs with zero additional learning. Casting SGG as a zero-shot structured-reasoning problem, our method combines multimodal prompting, embedding alignment, and a lightweight pair-refinement strategy, enabling inference over unseen object vocabularies and relation sets. To assess this setting, we formalize an Open-World evaluation protocol that measures performance when no SGG-specific data have been observed either in terms of objects and relations. Experiments on Visual Genome, Open Images V6, and the Panoptic Scene Graph (PSG) dataset demonstrate the capacity of pretrained VLMs to perform relational understanding without task-level training.


Masking the Gaps: An Imputation-Free Approach to Time Series Modeling with Missing Data

Neog, Abhilash, Daw, Arka, Khorasgani, Sepideh Fatemi, Karpatne, Anuj

arXiv.org Artificial Intelligence

A significant challenge in time-series (TS) modeling is the presence of missing values in real-world TS datasets. Traditional two-stage frameworks, involving imputation followed by modeling, suffer from two key drawbacks: (1) the propagation of imputation errors into subsequent TS modeling, (2) the trade-offs between imputation efficacy and imputation complexity. While one-stage approaches attempt to address these limitations, they often struggle with scalability or fully leveraging partially observed features. To this end, we propose a novel imputation-free approach for handling missing values in time series termed Missing Feature-aware Time Series Modeling (MissTSM) with two main innovations. First, we develop a novel embedding scheme that treats every combination of time-step and feature (or channel) as a distinct token. Second, we introduce a novel Missing Feature-Aware Attention (MFAA) Layer to learn latent representations at every time-step based on partially observed features. We evaluate the effectiveness of MissTSM in handling missing values over multiple benchmark datasets.


AI-generated Image Detection: Passive or Watermark?

Guo, Moyang, Hu, Yuepeng, Jiang, Zhengyuan, Li, Zeyu, Sadovnik, Amir, Daw, Arka, Gong, Neil

arXiv.org Artificial Intelligence

While text-to-image models offer numerous benefits, they also pose significant societal risks. Detecting AI-generated images is crucial for mitigating these risks. Detection methods can be broadly categorized into passive and watermark-based approaches: passive detectors rely on artifacts present in AI-generated images, whereas watermark-based detectors proactively embed watermarks into such images. A key question is which type of detector performs better in terms of effectiveness, robustness, and efficiency. However, the current literature lacks a comprehensive understanding of this issue. In this work, we aim to bridge that gap by developing ImageDetectBench, the first comprehensive benchmark to compare the effectiveness, robustness, and efficiency of passive and watermark-based detectors. Our benchmark includes four datasets, each containing a mix of AI-generated and non-AI-generated images. We evaluate five passive detectors and four watermark-based detectors against eight types of common perturbations and three types of adversarial perturbations. Our benchmark results reveal several interesting findings. For instance, watermark-based detectors consistently outperform passive detectors, both in the presence and absence of perturbations. Based on these insights, we provide recommendations for detecting AI-generated images, e.g., when both types of detectors are applicable, watermark-based detectors should be the preferred choice. Our code and data are publicly available at https://github.com/moyangkuo/ImageDetectBench.git.


Hiding-in-Plain-Sight (HiPS) Attack on CLIP for Targetted Object Removal from Images

Daw, Arka, Chung, Megan Hong-Thanh, Mahbub, Maria, Sadovnik, Amir

arXiv.org Artificial Intelligence

Machine learning models are known to be vulnerable to adversarial attacks, but traditional attacks have mostly focused on single-modalities. With the rise of large multi-modal models (LMMs) like CLIP, which combine vision and language capabilities, new vulnerabilities have emerged. However, prior work in multimodal targeted attacks aim to completely change the model's output to what the adversary wants. In many realistic scenarios, an adversary might seek to make only subtle modifications to the output, so that the changes go unnoticed by downstream models or even by humans. We introduce Hiding-in-Plain-Sight (HiPS) attacks, a novel class of adversarial attacks that subtly modifies model predictions by selectively concealing target object(s), as if the target object was absent from the scene. We propose two HiPS attack variants, HiPS-cls and HiPS-cap, and demonstrate their effectiveness in transferring to downstream image captioning models, such as CLIP-Cap, for targeted object removal from image captions.


A Unified Framework for Forward and Inverse Problems in Subsurface Imaging using Latent Space Translations

Gupta, Naveen, Sawhney, Medha, Daw, Arka, Lin, Youzuo, Karpatne, Anuj

arXiv.org Artificial Intelligence

In subsurface imaging, learning the mapping from velocity maps to seismic waveforms (forward problem) and waveforms to velocity (inverse problem) is important for several applications. While traditional techniques for solving forward and inverse problems are computationally prohibitive, there is a growing interest in leveraging recent advances in deep learning to learn the mapping between velocity maps and seismic waveform images directly from data. Despite the variety of architectures explored in previous works, several open questions still remain unanswered such as the effect of latent space sizes, the importance of manifold learning, the complexity of translation models, and the value of jointly solving forward and inverse problems. We propose a unified framework to systematically characterize prior research in this area termed the Generalized Forward-Inverse (GFI) framework, building on the assumption of manifolds and latent space translations. We show that GFI encompasses previous works in deep learning for subsurface imaging, which can be viewed as specific instantiations of GFI. We also propose two new model architectures within the framework of GFI: Latent U-Net and Invertible X-Net, leveraging the power of U-Nets for domain translation and the ability of IU-Nets to simultaneously learn forward and inverse translations, respectively. We show that our proposed models achieve state-of-the-art (SOTA) performance for forward and inverse problems on a wide range of synthetic datasets, and also investigate their zero-shot effectiveness on two real-world-like datasets.


Hierarchical Conditioning of Diffusion Models Using Tree-of-Life for Studying Species Evolution

Khurana, Mridul, Daw, Arka, Maruf, M., Uyeda, Josef C., Dahdul, Wasila, Charpentier, Caleb, Bakış, Yasin, Bart, Henry L. Jr., Mabee, Paula M., Lapp, Hilmar, Balhoff, James P., Chao, Wei-Lun, Stewart, Charles, Berger-Wolf, Tanya, Karpatne, Anuj

arXiv.org Artificial Intelligence

A central problem in biology is to understand how organisms evolve and adapt to their environment by acquiring variations in the observable characteristics or traits of species across the tree of life. With the growing availability of large-scale image repositories in biology and recent advances in generative modeling, there is an opportunity to accelerate the discovery of evolutionary traits automatically from images. Toward this goal, we introduce Phylo-Diffusion, a novel framework for conditioning diffusion models with phylogenetic knowledge represented in the form of HIERarchical Embeddings (HIER-Embeds). We also propose two new experiments for perturbing the embedding space of Phylo-Diffusion: trait masking and trait swapping, inspired by counterpart experiments of gene knockout and gene editing/swapping. Our work represents a novel methodological advance in generative modeling to structure the embedding space of diffusion models using tree-based knowledge. Our work also opens a new chapter of research in evolutionary biology by using generative models to visualize evolutionary changes directly from images. We empirically demonstrate the usefulness of Phylo-Diffusion in capturing meaningful trait variations for fishes and birds, revealing novel insights about the biological mechanisms of their evolution.


Learning the boundary-to-domain mapping using Lifting Product Fourier Neural Operators for partial differential equations

Kashi, Aditya, Daw, Arka, Meena, Muralikrishnan Gopalakrishnan, Lu, Hao

arXiv.org Artificial Intelligence

Neural operators such as the Fourier Neural Operator (FNO) have been shown to provide resolution-independent deep learning models that can learn mappings between function spaces. For example, an initial condition can be mapped to the solution of a partial differential equation (PDE) at a future time-step using a neural operator. Despite the popularity of neural operators, their use to predict solution functions over a domain given only data over the boundary (such as a spatially varying Dirichlet boundary condition) remains unexplored. In this paper, we refer to such problems as boundary-to-domain problems; they have a wide range of applications in areas such as fluid mechanics, solid mechanics, heat transfer etc. We present a novel FNO-based architecture, named Lifting Product FNO (or LP-FNO) which can map arbitrary boundary functions defined on the lower-dimensional boundary to a solution in the entire domain. Specifically, two FNOs defined on the lower-dimensional boundary are lifted into the higher dimensional domain using our proposed lifting product layer. We demonstrate the efficacy and resolution independence of the proposed LP-FNO for the 2D Poisson equation.


Beyond Discriminative Regions: Saliency Maps as Alternatives to CAMs for Weakly Supervised Semantic Segmentation

Maruf, M., Daw, Arka, Dutta, Amartya, Bu, Jie, Karpatne, Anuj

arXiv.org Artificial Intelligence

In recent years, several Weakly Supervised Semantic Segmentation (WS3) methods have been proposed that use class activation maps (CAMs) generated by a classifier to produce pseudo-ground truths for training segmentation models. While CAMs are good at highlighting discriminative regions (DR) of an image, they are known to disregard regions of the object that do not contribute to the classifier's prediction, termed non-discriminative regions (NDR). In contrast, attribution methods such as saliency maps provide an alternative approach for assigning a score to every pixel based on its contribution to the classification prediction. This paper provides a comprehensive comparison between saliencies and CAMs for WS3. Our study includes multiple perspectives on understanding their similarities and dissimilarities. Moreover, we provide new evaluation metrics that perform a comprehensive assessment of WS3 performance of alternative methods w.r.t. CAMs. We demonstrate the effectiveness of saliencies in addressing the limitation of CAMs through our empirical studies on benchmark datasets. Furthermore, we propose random cropping as a stochastic aggregation technique that improves the performance of saliency, making it a strong alternative to CAM for WS3.


Mitigating Propagation Failures in Physics-informed Neural Networks using Retain-Resample-Release (R3) Sampling

Daw, Arka, Bu, Jie, Wang, Sifan, Perdikaris, Paris, Karpatne, Anuj

arXiv.org Artificial Intelligence

This is reflected in et al., 2021). Despite the success of PINNs, it is known that several recent studies on characterizing the "failure PINNs sometimes fail to converge to the correct solution modes" of PINNs, although a thorough understanding in problems involving complicated PDEs, as reflected in of the connection between PINN failure several recent studies on characterizing the "failure modes" modes and sampling strategies is missing. In of PINNs (Wang et al., 2021; 2022c; Krishnapriyan et al., this paper, we provide a novel perspective of failure 2021). Many of these failure modes are related to the susceptibility modes of PINNs by hypothesizing that training of PINNs in getting stuck at trivial solutions acting PINNs relies on successful "propagation" of as poor local minima, due to the unique optimization challenges solution from initial and/or boundary condition of PINNs. In particular, training PINNs is different points to interior points. We show that PINNs from conventional deep learning problems as we only have with poor sampling strategies can get stuck at access to the correct solution on the initial and/or boundary trivial solutions if there are propagation failures, points, while for all interior points, we can only compute characterized by highly imbalanced PDE residual PDE residuals. Also, minimizing PDE residuals does not fields. To mitigate propagation failures, we propose guarantee convergence to a correct solution since there are a novel Retain-Resample-Release sampling many trivial solutions of commonly observed PDEs that (R3) algorithm that can incrementally accumulate show 0 residuals. While previous studies have mainly focused collocation points in regions of high PDE on modifying network architectures or balancing loss residuals with little to no computational overhead.


Multi-task Learning for Source Attribution and Field Reconstruction for Methane Monitoring

Daw, Arka, Yeo, Kyongmin, Karpatne, Anuj, Klein, Levente

arXiv.org Artificial Intelligence

Inferring the source information of greenhouse gases, such as methane, from spatially sparse sensor observations is an essential element in mitigating climate change. While it is well understood that the complex behavior of the atmospheric dispersion of such pollutants is governed by the Advection-Diffusion equation, it is difficult to directly apply the governing equations to identify the source location and magnitude (inverse problem) because of the spatially sparse and noisy observations, i.e., the pollution concentration is known only at the sensor locations and sensors sensitivity is limited. Here, we develop a multi-task learning framework that can provide high-fidelity reconstruction of the concentration field and identify emission characteristics of the pollution sources such as their location, emission strength, etc. from sparse sensor observations. We demonstrate that our proposed framework is able to achieve accurate reconstruction of the methane concentrations from sparse sensor measurements as well as precisely pin-point the location and emission strength of these pollution sources.