Goto

Collaborating Authors

 Baghshah, Mahdieh Soleymani


Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation

arXiv.org Artificial Intelligence

Text-to-image generative models have made significant advancements in recent years; however, accurately capturing intricate details in textual prompts, such as entity missing, attribute binding errors, and incorrect relationships remains a formidable challenge. In response, we present an innovative, training-free method that directly addresses these challenges by incorporating tailored objectives to account for textual constraints. Unlike layout-based approaches that enforce rigid structures and limit diversity, our proposed approach offers a more flexible arrangement of the scene by imposing just the extracted constraints from the text, without any unnecessary additions. These constraints are formulated as losses-entity missing, entity mixing, attribute binding, and spatial relationships, integrated into a unified loss that is applied in the first generation stage. Furthermore, we introduce a feedback-driven system for fine-grained initial noise refinement. This system integrates a verifier that evaluates the generated image, identifies inconsistencies, and provides corrective feedback. Leveraging this feedback, our refinement method first targets the unmet constraints by refining the faulty attention maps caused by initial noise, through the optimization of selective losses associated with these constraints. Subsequently, our unified loss function is reapplied to proceed the second generation phase. Experimental results demonstrate that our method, relying solely on our proposed objective functions, significantly enhances compositionality, achieving a 24% improvement in human evaluation and a 25% gain in spatial relationships. Furthermore, our fine-grained noise refinement proves effective, boosting performance by up to 5%. Code is available at https://github.com/hadi-hosseini/noise-refinement.


VQEL: Enabling Self-Developed Symbolic Language in Agents through Vector Quantization in Emergent Language Games

arXiv.org Artificial Intelligence

In the field of emergent language, efforts have traditionally focused on developing communication protocols through interactions between agents in referential games. However, the aspect of internal language learning, where language serves not only as a communicative tool with others but also as a means for individual thinking, self-reflection, and problem-solving remains underexplored. Developing a language through self-play, without another agent's involvement, poses a unique challenge. It requires an agent to craft symbolic representations and train them using direct gradient methods. The challenge here is that if an agent attempts to learn symbolic representations through self-play using conventional modeling and techniques such as REINFORCE, the solution will offer no advantage over previous multi-agent approaches. We introduce VQEL, a novel method that incorporates Vector Quantization into the agents' architecture, enabling them to autonomously invent and develop discrete symbolic representations in a self-play referential game. Following the self-play phase, agents can enhance their language through reinforcement learning and interactions with other agents in the mutual-play phase. Our experiments across various datasets demonstrate that VQEL not only outperforms the traditional REINFORCE method but also benefits from improved control and reduced susceptibility to collapse, thanks to the incorporation of vector quantization.


CER: Confidence Enhanced Reasoning in LLMs

arXiv.org Artificial Intelligence

Ensuring the reliability of Large Language Models (LLMs) in complex reasoning tasks remains a formidable challenge, particularly in scenarios that demand precise mathematical calculations and knowledge-intensive open-domain generation. In this work, we introduce an uncertainty-aware framework designed to enhance the accuracy of LLM responses by systematically incorporating model confidence at critical decision points. We propose an approach that encourages multi-step reasoning in LLMs and quantify the confidence of intermediate answers such as numerical results in mathematical reasoning and proper nouns in open-domain generation. Then, the overall confidence of each reasoning chain is evaluated based on confidence of these critical intermediate steps. Finally, we aggregate the answer of generated response paths in a way that reflects the reliability of each generated content (as opposed to self-consistency in which each generated chain contributes equally to majority voting). We conducted extensive experiments in five datasets, three mathematical datasets and two open-domain datasets, using four LLMs. The results consistently validate the effectiveness of our novel confidence aggregation method, leading to an accuracy improvement of up to 7.4% and 5.8% over baseline approaches in math and open-domain generation tasks, respectively. Code is publicly available at https://github.com/ Aquasar11/CER.


Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

arXiv.org Artificial Intelligence

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.


RODEO: Robust Outlier Detection via Exposing Adaptive Out-of-Distribution Samples

arXiv.org Artificial Intelligence

In recent years, there have been significant improvements in various forms of image outlier detection. However, outlier detection performance under adversarial settings lags far behind that in standard settings. This is due to the lack of effective exposure to adversarial scenarios during training, especially on unseen outliers, leading to detection models failing to learn robust features. To bridge this gap, we introduce RODEO, a data-centric approach that generates effective outliers for robust outlier detection. More specifically, we show that incorporating outlier exposure (OE) and adversarial training can be an effective strategy for this purpose, as long as the exposed training outliers meet certain characteristics, including diversity, and both conceptual differentiability and analogy to the inlier samples. We leverage a text-to-image model to achieve this goal. We demonstrate both quantitatively and qualitatively that our adaptive OE method effectively generates ``diverse'' and ``near-distribution'' outliers, leveraging information from both text and image domains. Moreover, our experimental results show that utilizing our synthesized outliers significantly enhances the performance of the outlier detector, particularly in adversarial settings.


Inductive Biases for Zero-shot Systematic Generalization in Language-informed Reinforcement Learning

arXiv.org Artificial Intelligence

Sample efficiency and systematic generalization are two long-standing challenges in reinforcement learning. Previous studies have shown that involving natural language along with other observation modalities can improve generalization and sample efficiency due to its compositional and open-ended nature. However, to transfer these properties of language to the decision-making process, it is necessary to establish a proper language grounding mechanism. One approach to this problem is applying inductive biases to extract fine-grained and informative representations from the observations, which makes them more connectable to the language units. We provide architecture-level inductive biases for modularity and sparsity mainly based on Neural Production Systems (NPS). Alongside NPS, we assign a central role to memory in our architecture. It can be seen as a high-level information aggregator which feeds policy/value heads with comprehensive information and simultaneously guides selective attention in NPS through attentional feedback. Our results in the BabyAI environment suggest that the proposed model's systematic generalization and sample efficiency are improved significantly compared to previous models. An extensive ablation study on variants of the proposed method is conducted, and the effectiveness of each employed technique on generalization, sample efficiency, and training stability is specified.


Dilated Balanced Cross Entropy Loss for Medical Image Segmentation

arXiv.org Artificial Intelligence

A novel method for tackling the problem of imbalanced data in medical image segmentation is proposed in this work. In balanced cross entropy (CE) loss, which is a type of weighted CE loss, the weight assigned to each class is the in-verse of the class frequency. These balancing weights are expected to equalize the effect of each class on the overall loss and prevent the model from being biased towards the majority class. But, as it has been shown in previous studies, this method degrades the performance by a large margin. Therefore, balanced CE is not a popular loss in medical segmentation tasks, and usually a region-based loss, like the Dice loss, is used to address the class imbalance problem. In the pro-posed method, the weighting of cross entropy loss for each class is based on a dilated area of each class mask, and balancing weights are assigned to each class together with its surrounding pixels. The goal of this study is to show that the performance of balanced CE loss can be greatly improved my modifying its weighting strategy. Experiments on different datasets show that the proposed dilated balanced CE (DBCE) loss outperforms the balanced CE loss by a large margin and produces superior results compared to CE loss, and its performance is similar to the performance of the combination of Dice and CE loss. This means that a weighted cross entropy loss with the right weighing strategy can be as effective as a region-based loss in handling the problem of class imbalance in medical segmentation tasks.


CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives

arXiv.org Artificial Intelligence

Grounding the instruction in the environment is a key step in solving language-guided goal-reaching reinforcement learning problems. In automated reinforcement learning, a key concern is to enhance the model's ability to generalize across various tasks and environments. In goal-reaching scenarios, the agent must comprehend the different parts of the instructions within the environmental context in order to complete the overall task successfully. In this work, we propose CAREL (Cross-modal Auxiliary REinforcement Learning) as a new framework to solve this problem using auxiliary loss functions inspired by video-text retrieval literature and a novel method called instruction tracking, which automatically keeps track of progress in an environment. The results of our experiments suggest superior sample efficiency and systematic generalization for this framework in multi-modal reinforcement learning problems. Our code base is available here.


LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

arXiv.org Artificial Intelligence

Why do gradient-based explanations struggle with Transformers, and how can we improve them? We identify gradient flow imbalances in Transformers that violate FullGrad-completeness, a critical property for attribution faithfulness that CNNs naturally possess. To address this issue, we introduce LibraGrad -- a theoretically grounded post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths, without changing the forward pass or adding computational overhead. We evaluate LibraGrad using three metric families: Faithfulness, which quantifies prediction changes under perturbations of the most and least relevant features; Completeness Error, which measures attribution conservation relative to model outputs; and Segmentation AP, which assesses alignment with human perception. Extensive experiments across 8 architectures, 4 model sizes, and 4 datasets show that LibraGrad universally enhances gradient-based methods, outperforming existing white-box methods -- including Transformer-specific approaches -- across all metrics. We demonstrate superior qualitative results through two complementary evaluations: precise text-prompted region highlighting on CLIP models and accurate class discrimination between co-occurring animals on ImageNet-finetuned models -- two settings on which existing methods often struggle. LibraGrad is effective even on the attention-free MLP-Mixer architecture, indicating potential for extension to other modern architectures. Our code is freely available at https://github.com/NightMachinery/LibraGrad.


Improving 3D Few-Shot Segmentation with Inference-Time Pseudo-Labeling

arXiv.org Artificial Intelligence

In recent years, few-shot segmentation (FSS) models have emerged as a promising approach in medical imaging analysis, offering remarkable adaptability to segment novel classes with limited annotated data. Existing approaches to few-shot segmentation have often overlooked the potential of the query itself, failing to fully utilize the valuable information it contains. However, treating the query as unlabeled data provides an opportunity to enhance prediction accuracy. Specifically in the domain of medical imaging, the volumetric structure of queries offers a considerable source of valuable information that can be used to improve the target slice segmentation. In this work, we present a novel strategy to efficiently leverage the intrinsic information of the query sample for final segmentation during inference. First, we use the support slices from a reference volume to generate an initial segmentation score for the query slices through a prototypical approach. Subsequently, we apply a confidence-aware pseudo-labeling procedure to transfer the most informative parts of query slices to the support set. The final prediction is performed based on the new expanded support set, enabling the prediction of a more accurate segmentation mask for the query volume. Extensive experiments show that the proposed method can effectively boost performance across diverse settings and datasets.