Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data.
In order to build more useful AI systems, a natural inclination is to try to make them more agentic . But while agents built from language models are touted as the next big advance [Wang et al., 2024],
Understanding semantics of natural language utterances is a fundamental problem in machine learning. Semantics is usually invariant to permute some components in it.
Our bound addresses the second question; it suggests that learning algorithms that bias towards models with small variation across the source threat model exhibit smaller drop in robustness to particularunforeseenattacks.