Goto

Collaborating Authors

 open image


impacts

Neural Information Processing Systems

The primary goal of PACBench is to catalyze the development of more capable, reliable, and physically grounded VLMs and their fine-tuned variants, often called VLAs for real-world robotic applications. Because VLA fine-tuning typically relies on low-level trajectory data rather than higher level reasoning, probing the underlying VLM's understanding of object Properties, action Affordances, and physical Constraints (PAC) gives us a grounded lens into the capabilities that downstream robotic policies will inherit. By diagnosing PAC weaknesses in the base model, researchers can distinguish whether a VLA's performance stems from genuine physical common sense or simply memorized motion patterns, and thus guide targeted improvements in model architectures, training methodologies, and dataset curation. In doing so, PACBench helps ensure that robotic systems become more predictable, less prone to errors from a lack of physical understanding, and better equipped for safe, effective collaboration in complex, everyday environments. By providing a fine-grained diagnostic tool, PACBench can help researchers and developers identify specific weaknesses in current models, thereby guiding targeted improvements in model architectures, training methodologies, and dataset curation. This, in turn, can lead to robotic systems that are more predictable, less prone to errors stemming from a lack of physical common sense, and better able to perform a wide range of useful tasks. The open release of our benchmark and its diverse data sources (including web-scale images, real-world humanoid captures, and simulated scenarios) is intended to foster broad community engagement and accelerate progress in this crucial area of AI. While any advancement in AI capabilities warrants ongoing consideration of its societal implications, our work focuses on enhancing the fundamental understanding and robustness of AI systems, which we see as a positive step towards more responsible AI development.




Culture Affordance Atlas: Reconciling Object Diversity Through Functional Mapping

arXiv.org Artificial Intelligence

Culture shapes the objects people use and for what purposes, yet mainstream Vision-Language (VL) datasets frequently exhibit cultural biases, disproportionately favoring higher-income, Western contexts. This imbalance reduces model generalizability and perpetuates performance disparities, especially impacting lower-income and non-Western communities. To address these disparities, we propose a novel function-centric framework that categorizes objects by the functions they fulfill, across diverse cultural and economic contexts. We implement this framework by creating the Culture Affordance Atlas, a re-annotated and culturally grounded restructuring of the Dollar Street dataset spanning 46 functions and 288 objects publicly available at https://lit.eecs.umich.edu/CultureAffordance-Atlas/index.html. Through extensive empirical analyses using the CLIP model, we demonstrate that function-centric labels substantially reduce socioeconomic performance gaps between high- and low-income groups by a median of 6 pp (statistically significant), improving model effectiveness for lower-income contexts. Furthermore, our analyses reveals numerous culturally essential objects that are frequently overlooked in prominent VL datasets. Our contributions offer a scalable pathway toward building inclusive VL datasets and equitable AI systems.


Supplementary Materials A Extended Related Work (2)

Neural Information Processing Systems

We first discuss attacks that use physical objects as triggers, then discuss a few related works which use light as a trigger. We conclude by discussing the single proposed defense against physical backdoor attacks. As mentioned briefly in 2, [ 10 ] designs a backdoor attack against lane detection systems for autonomous vehicles. This attack expands the scope of physical backdoor attacks by attacking detection rather than classification models. Furthermore, it confirms the result from [ 43 ] that even when digitally altered images are used to poison a dataset, the triggers can be activated using physical objects (traffic cones in this setting) in real world scenarios. A second work [ 31 ] evaluates the effectiveness of using facial characteristics as backdoor triggers.


Finding Naturally Occurring Physical Backdoors in Image Datasets Emily Wenger University of Chicago Roma Bhattacharjee

Neural Information Processing Systems

Extensive literature on backdoor poison attacks has studied attacks and defenses for backdoors using "digital trigger patterns." In contrast, "physical backdoors" use physical objects as triggers, have only recently been identified, and are qualitatively different enough to resist most defenses targeting digital trigger backdoors. Research on physical backdoors is limited by access to large datasets containing real images of physical objects co-located with misclassification targets . Building these datasets is time-and labor-intensive. This work seeks to address the challenge of accessibility for research on physical backdoor attacks.


PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

arXiv.org Artificial Intelligence

Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often assumes a deep understanding of low-level physical prerequisites, a capability that remains largely unverified. For robots to perform actions reliably, they must comprehend intrinsic object properties (e.g., material, weight), action affordances (e.g., graspable, stackable), and physical constraints (e.g., stability, reachability, or an object's state, such as being closed). Despite the widespread use of VLMs in manipulation tasks, we argue that off-the-shelf models may lack this granular, physically grounded understanding, as such prerequisites are often overlooked during training. To address this critical gap, we introduce PAC Bench, a comprehensive benchmark designed to systematically evaluate VLMs on their understanding of core Properties, Affordances, and Constraints (PAC) from a task executability perspective. PAC Bench features a diverse dataset with over 30,000 annotations, comprising 673 real-world images (115 object classes, 15 property types, and 1 to 3 affordances defined per class), 100 real-world humanoid-view scenarios, and 120 unique simulated constraint scenarios across four tasks. Our evaluations reveal significant gaps in the ability of current VLMs to grasp fundamental physical concepts, highlighting limitations in their suitability for reliable robot manipulation and pointing to key areas for targeted research. PAC Bench also serves as a standardized benchmark for rigorously evaluating physical reasoning in VLMs and guiding the development of more robust, physically grounded models for robotic applications. Project Page: https://pacbench.github.io/


How to classify photos in 600 classes using nine million Open Images

#artificialintelligence

If you're looking build an image classifier but need training data, look no further than Google Open Images. This massive image dataset contains over 30 million images and 15 million bounding boxes. Plus, Open Images is much more open and accessible than certain other image datasets at this scale. For example, ImageNet has restrictive licensing. However, it's not easy for developers on single machines to sift through that much data.You need to download and process multiple metadata files, and roll their own storage space (or apply for access to a Google Cloud bucket).


nocaps: novel object captioning at scale

arXiv.org Artificial Intelligence

Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed 'nocaps', for novel object captioning at scale, our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, more than 500 object classes seen in test images have no training captions (hence, nocaps). We evaluate several existing approaches to novel object captioning on our challenging benchmark. In automatic evaluations these approaches show modest improvements over a strong baseline trained only on image-caption data. However, even when using ground-truth object detections, the results are significantly weaker than our human baseline - indicating substantial room for improvement.


No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World

arXiv.org Machine Learning

Modern machine learning systems such as image classifiers rely heavily on large scale data sets for training. Such data sets are costly to create, thus in practice a small number of freely available, open source data sets are widely used. We suggest that examining the geo-diversity of open data sets is critical before adopting a data set for use cases in the developing world. We analyze two large, publicly available image data sets to assess geo-diversity and find that these data sets appear to exhibit an observable amerocentric and eurocentric representation bias. Further, we analyze classifiers trained on these data sets to assess the impact of these training distributions and find strong differences in the relative performance on images from different locales. These results emphasize the need to ensure geo-representation when constructing data sets for use in the developing world.