compound divergence
DiNeR: a Large Realistic Dataset for Evaluating Compositional Generalization
Hu, Chengang, Liu, Xiao, Feng, Yansong
Most of the existing compositional generalization datasets are synthetically-generated, resulting in a lack of natural language variation. While there have been recent attempts to introduce non-synthetic datasets for compositional generalization, they suffer from either limited data scale or a lack of diversity in the forms of combinations. To better investigate compositional generalization with more linguistic phenomena and compositional diversity, we propose the DIsh NamE Recognition (DiNeR) task and create a large realistic Chinese dataset. Given a recipe instruction, models are required to recognize the dish name composed of diverse combinations of food, actions, and flavors. Our dataset consists of 3,811 dishes and 228,114 recipes, and involves plenty of linguistic phenomena such as anaphora, omission and ambiguity. We provide two strong baselines based on T5 and large language models (LLMs). This work contributes a challenging task, baseline methods to tackle the task, and insights into compositional generalization in the context of dish name recognition. Code and data are available at https://github.com/Jumpy-pku/DiNeR.
MAQA: A Multimodal QA Benchmark for Negation
Li, Judith Yue, Jansen, Aren, Huang, Qingqing, Lee, Joonseok, Ganti, Ravi, Kuzmin, Dima
Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs). However, state-of-the-art transformer based LLMs often ignore negations in natural language and there is no existing benchmark to quantitatively evaluate whether multimodal transformers inherit this weakness. In this study, we present a new multimodal question answering (QA) benchmark adapted from labeled music videos in AudioSet (Gemmeke et al., 2017) with the goal of systematically evaluating if multimodal transformers can perform complex reasoning to recognize new concepts as negation of previously learned concepts. We show that with standard fine-tuning approach multimodal transformers are still incapable of correctly interpreting negation irrespective of model size. However, our experiments demonstrate that augmenting the original training task distributions with negated QA examples allow the model to reliably reason with negation. To do this, we describe a novel data generation procedure that prompts the 540B-parameter PaLM model to automatically generate negated QA examples as compositions of easily accessible video tags. The generated examples contain more natural linguistic patterns and the gains compared to template-based task augmentation approach are significant.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- North America > Dominican Republic (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Visually Grounded Concept Composition
Zhang, Bowen, Hu, Hexiang, Qiu, Linlu, Shaw, Peter, Sha, Fei
We investigate ways to compose complex concepts in texts from primitive ones while grounding them in images. We propose Concept and Relation Graph (CRG), which builds on top of constituency analysis and consists of recursively combined concepts with predicate functions. Meanwhile, we propose a concept composition neural network called Composer to leverage the CRG for visually grounded concept learning. Specifically, we learn the grounding of both primitive and all composed concepts by aligning them to images and show that learning to compose leads to more robust grounding results, measured in text-to-image matching accuracy. Notably, our model can model grounded concepts forming at both the finer-grained sentence level and the coarser-grained intermediate level (or word-level). Composer leads to pronounced improvement in matching accuracy when the evaluation data has significant compound divergence from the training data.
Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
Keysers, Daniel, Schärli, Nathanael, Scales, Nathan, Buisman, Hylke, Furrer, Daniel, Kashubin, Sergii, Momchev, Nikola, Sinopalnikov, Danila, Stafiniak, Lukasz, Tihon, Tibor, Tsarkov, Dmitry, Wang, Xiao, van Zee, Marc, Bousquet, Olivier
State-of-the-art machine learning methods exhibit limited compositional generalization. At the same time, there is a lack of realistic benchmarks that comprehensively measure this ability, which makes it challenging to find and evaluate improvements. We introduce a novel method to systematically construct such benchmarks by maximizing compound divergence while guaranteeing a small atom divergence between train and test sets, and we quantitatively compare this method to other approaches for creating compositional generalization benchmarks. We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. We also demonstrate how our method can be used to create new compositionality benchmarks on top of the existing SCAN dataset, which confirms these findings.
- Oceania > Australia (0.04)
- North America > United States > Tennessee (0.04)
- North America > United States > Massachusetts > Norfolk County > Brookline (0.04)
- (6 more...)
- Media > Film (1.00)
- Leisure & Entertainment (1.00)