Goto

Collaborating Authors

 noun




Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing

Neural Information Processing Systems

Large-scale text-to-image generative models have been a ground-breaking development in generative AI, with diffusion models showing their astounding ability to synthesize convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are susceptible to unintended modifications of regions outside the targeted area, such as on the background or on distractor objects which have some semantic or visual relationship with the targeted object. According to our experimental findings, inaccurate cross-attention maps are at the root of this problem. Based on this observation, we propose $\textit{Dynamic Prompt Learning}$ ($DPL$) to force cross-attention maps to focus on correct $\textit{noun}$ words in the text prompt. By updating the dynamic tokens for nouns in the textual input with the proposed leakage repairment losses, we achieve fine-grained image editing over particular objects while preventing undesired changes to other image regions. Our method $DPL$, based on the publicly available $\textit{Stable Diffusion}$, is extensively evaluated on a wide range of images, and consistently obtains superior results both quantitatively (CLIP score, Structure-Dist) and qualitatively (on user-evaluation). We show improved prompt editing results for Word-Swap, Prompt Refinement, and Attention Re-weighting, especially for complex multi-object scenes.


TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation

Neural Information Processing Systems

Current referring expression comprehension algorithms can effectively detect or segment objects indicated by nouns, but how to understand verb reference is still under-explored. As such, we study the challenging problem of task oriented detection, which aims to find objects that best afford an action indicated by verbs like sit comfortably on. Towards a finer localization that better serves downstream applications like robot interaction, we extend the problem into task oriented instance segmentation. A unique requirement of this task is to select preferred candidates among possible alternatives. Thus we resort to the transformer architecture which naturally models pair-wise query relationships with attention, leading to the TOIST method. In order to leverage pre-trained noun referring expression comprehension models and the fact that we can access privileged noun ground truth during training, a novel noun-pronoun distillation framework is proposed. Noun prototypes are generated in an unsupervised manner and contextual pronoun features are trained to select prototypes. As such, the network remains noun-agnostic during inference. We evaluate TOIST on the large-scale task oriented dataset COCO-Tasks and achieve +10.7% higher $\rm{mAP^{box}}$ than the best-reported results.


DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors

Barmina, Gianluca, Norman, Nathalie Carmen Hau, Schneider-Kamp, Peter, Poech, Lukas Galke

arXiv.org Artificial Intelligence

We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.


Slovak Conceptual Dictionary

Blšták, Miroslav

arXiv.org Artificial Intelligence

When solving tasks in the field of natural language processing, we sometimes need dictionary tools, such as lexicons, word form dictionaries or knowledge bases. However, the availability of dictionary data is insufficient in many languages, especially in the case of low resourced languages. In this article, we introduce a new conceptual dictionary for the Slovak language as the first linguistic tool of this kind. Since Slovak language is a language with limited linguistic resources and there are currently not available any machine-readable linguistic data sources with a sufficiently large volume of data, many tasks which require automated processing of Slovak text achieve weaker results compared to other languages and are almost impossible to solve.


Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation

Shi, Chuancheng, Li, Shangze, Guo, Shiming, Xie, Simiao, Wu, Wenhua, Dou, Jingtong, Wu, Chao, Xiao, Canran, Wang, Cong, Cheng, Zifeng, Shen, Fei, Chua, Tat-Seng

arXiv.org Artificial Intelligence

Multilingual text-to-image (T2I) models have advanced rapidly in terms of visual realism and semantic alignment, and are now widely utilized. Y et outputs vary across cultural contexts: because language carries cultural connotations, images synthesized from multilingual prompts should preserve cross-lingual cultural consistency. W e conduct a comprehensive analysis showing that current T2I models often produce culturally neutral or English-biased results under multilingual prompts. Analyses of two representative models indicate that the issue stems not from missing cultural knowledge but from insufficient activation of culture-related representations. W e propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers. Guided by this finding, we introduce two complementary alignment strategies: (1) inference-time cultural activation that amplifies the identified neurons without backbone fine-tuned; and (2) layer-targeted cultural enhancement that updates only culturally relevant layers. Experiments on our CultureBench demonstrate consistent improvements over strong baselines in cultural consistency while preserving fidelity and diversity.


8e08227323cd829e449559bb381484b7-Paper.pdf

Neural Information Processing Systems

For example, training such a system to model block-stacking might naturally generalize to stacking other types of objects because of structure in the real world that has been partially captured by the language describing it.