Goto

Collaborating Authors

 Generative AI


Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities

arXiv.org Artificial Intelligence

The various limitations of Generative AI, such as hallucinations and model failures, have made it crucial to understand the role of different modalities in Visual Language Model (VLM) predictions. Our work investigates how the integration of information from image and text modalities influences the performance and behavior of VLMs in visual question answering (VQA) and reasoning tasks. We measure this effect through answer accuracy, reasoning quality, model uncertainty, and modality relevance. We study the interplay between text and image modalities in different configurations where visual content is essential for solving the VQA task. Our contributions include (1) the Semantic Interventions (SI)-VQA dataset, (2) a benchmark study of various VLM architectures under different modality configurations, and (3) the Interactive Semantic Interventions (ISI) tool. The SI-VQA dataset serves as the foundation for the benchmark, while the ISI tool provides an interface to test and apply semantic interventions in image and text inputs, enabling more fine-grained analysis. Our results show that complementary information between modalities improves answer and reasoning quality, while contradictory information harms model performance and confidence. Image text annotations have minimal impact on accuracy and uncertainty, slightly increasing image relevance. Attention analysis confirms the dominant role of image inputs over text in VQA tasks. In this study, we evaluate state-of-the-art VLMs that allow us to extract attention coefficients for each modality. A key finding is PaliGemma's harmful overconfidence, which poses a higher risk of silent failures compared to the LLaVA models. This work sets the foundation for rigorous analysis of modality integration, supported by datasets specifically designed for this purpose.


Social Conjuring: Multi-User Runtime Collaboration with AI in Building Virtual 3D Worlds

arXiv.org Artificial Intelligence

Generative artificial intelligence has shown promise in prompting virtual worlds into existence, yet little attention has been given to understanding how this process unfolds as social interaction. We present Social Conjurer, a framework for AI-augmented dynamic 3D scene co-creation, where multiple users collaboratively build and modify virtual worlds in real-time. Through an expanded set of interactions, including social and tool-based engagements as well as spatial reasoning, our framework facilitates the creation of rich, diverse virtual environments. Findings from a preliminary user study (N=12) provide insight into the user experience of this approach, how social contexts shape the prompting of spatial environments, and perspective on social applications of prompt-based 3D co-creation. In addition to highlighting the potential of AI-supported multi-user world creation and offering new pathways for AI-augmented creative processes in VR, this article presents a set of implications for designing human-centered interfaces that incorporate AI models into 3D content generation.


A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

arXiv.org Machine Learning

In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.


Microsoft's Copilot AI gets a voice and the ability to see websites you browse

Engadget

Beyond debuting new features for Copilot AI PCs and Windows 11's 2024 update, Microsoft is also giving its Copilot AI a makeover on the web, mobile and desktop. That includes a slightly friendlier interface wherever you access it, along with new capabilities like Copilot Voice, which allows you to talk conversationally with the AI assistant. Ultimately, Microsoft is aiming for Copilot to be seen as more than just a party trick for generative AI search and image creation -- it's trying to make it a core part of your daily workflow. That starts with a cleaner and simpler UI that makes Copilot look different than a boring old search engine. You'll also be able to access Copilot from within Whatsapp, which could be useful if you want to avoid Meta's AI assistant.


Copilot's AI will be able to 'see' and talk to you, Microsoft says

PCWorld

Microsoft is beginning to roll out its next feature update of Windows 11, the Windows 11 2024 Update, beginning today. But Microsoft obviously isn't done yet, and it's offering a sneak peek at new Copilot experiences which will debut this fall, including Copilot Voice, Copilot Vision, and Copilot Daily, among others. On the surface, the new additions to Copilot sound similar to multimodal ChatGPT (or GPT-4o) that OpenAI launched earlier this year, where ChatGPT can now "see" and an Advanced Voice feature means that you can have conversations with it. But there are some key differences between what Microsoft and OpenAI are offering, and only some of Microsoft's Copilot innovations will be available right away. It's probably safe to say, though, that Copilot Voice will be the most important addition -- and Copilot Vision may not be.


Hidden traces of humanity: what AI images reveal about our world

The Guardian

When faced with a bit of downtime, many of my friends will turn to the same party game. It's based on the surrealist game Exquisite Corpse, and involves translating brief written descriptions into rapidly made drawings and back again. One group calls it Telephone Pictionary; another refers to it as Writey-Drawey. The internet tells me it is also called Eat Poop You Cat, a sequence of words surely inspired by one of the game's results. As recently as three years ago, it was rare to encounter text-to-image or image-to-text mistranslations in daily life, which made the outrageous outcomes of the game feel especially novel. But we have since entered a new era of image-making. With the aid of AI image generators like Dall-E 3, Stable Diffusion and Midjourney, and the generative features integrated into Adobe's Creative Cloud programs, you can now transform a sentence or phrase into a highly detailed image in mere seconds. Images, likewise, can be nearly instantly translated into descriptive text.


PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System

arXiv.org Artificial Intelligence

Generative Artificial Intelligence (GenAI) is becoming ubiquitous in our daily lives. The increase in computational power and data availability has led to a proliferation of both single- and multi-modal models. As the GenAI ecosystem matures, the need for extensible and model-agnostic risk identification frameworks is growing. To meet this need, we introduce the Python Risk Identification Toolkit (PyRIT), an open-source framework designed to enhance red teaming efforts in GenAI systems. PyRIT is a model- and platform-agnostic tool that enables red teamers to probe for and identify novel harms, risks, and jailbreaks in multimodal generative AI models. Its composable architecture facilitates the reuse of core building blocks and allows for extensibility to future models and modalities. This paper details the challenges specific to red teaming generative AI systems, the development and features of PyRIT, and its practical applications in real-world scenarios.


Generative Diffusion-based Contract Design for Efficient AI Twins Migration in Vehicular Embodied AI Networks

arXiv.org Artificial Intelligence

Embodied AI is a rapidly advancing field that bridges the gap between cyberspace and physical space, enabling a wide range of applications. This evolution has led to the development of the Vehicular Embodied AI NETwork (VEANET), where advanced AI capabilities are integrated into vehicular systems to enhance autonomous operations and decision-making. Embodied agents, such as Autonomous Vehicles (AVs), are autonomous entities that can perceive their environment and take actions to achieve specific goals, actively interacting with the physical world. Embodied twins are digital models of these embodied agents, with various embodied AI twins for intelligent applications in cyberspace. In VEANET, embodied AI twins act as in-vehicle AI assistants to perform diverse tasks supporting autonomous driving using generative AI models. Due to limited computational resources of AVs, these AVs often offload computationally intensive tasks, such as constructing and updating embodied AI twins, to nearby RSUs. However, since the rapid mobility of AVs and the limited provision coverage of a single RSU, embodied AI twins require dynamic migrations from current RSU to other RSUs in real-time, resulting in the challenge of selecting suitable RSUs for efficient embodied AI twins migrations. Given information asymmetry, AVs cannot know the detailed information of RSUs. To this end, in this paper, we construct a multi-dimensional contract theoretical model between AVs and alternative RSUs. Considering that AVs may exhibit irrational behavior, we utilize prospect theory instead of expected utility theory to model the actual utilities of AVs. Finally, we employ a generative diffusion model-based algorithm to identify the optimal contract designs. Compared with traditional deep reinforcement learning algorithms, numerical results demonstrate the effectiveness of the proposed scheme.


Generative AI Application for Building Industry

arXiv.org Artificial Intelligence

This paper investigates the transformative potential of generative AI technologies, particularly large language models (LLMs), within the building industry. By leveraging these advanced AI tools, the study explores their application across key areas such as energy code compliance, building design optimization, and workforce training. The research highlights how LLMs can automate labor-intensive processes, significantly improving efficiency, accuracy, and safety in building practices. The paper also addresses the challenges associated with interpreting complex visual and textual data in architectural plans and regulatory codes, proposing innovative solutions to enhance AI-driven compliance checking and design processes. Additionally, the study considers the broader implications of AI integration, including the development of AI-powered tools for comprehensive code compliance across various regulatory domains and the potential for AI to revolutionize workforce training through realistic simulations. This paper provides a comprehensive analysis of the current capabilities of generative AI in the building industry while outlining future directions for research and development, aiming to pave the way for smarter, more sustainable, and responsive construction practices.


What is the Role of Large Language Models in the Evolution of Astronomy Research?

arXiv.org Artificial Intelligence

ChatGPT and other state-of-the-art large language models (LLMs) are rapidly transforming multiple fields, offering powerful tools for a wide range of applications. These models, commonly trained on vast datasets, exhibit human-like text generation capabilities, making them useful for research tasks such as ideation, literature review, coding, drafting, and outreach. We conducted a study involving 13 astronomers at different career stages and research fields to explore LLM applications across diverse tasks over several months and to evaluate their performance in research-related activities. This work was accompanied by an anonymous survey assessing participants' experiences and attitudes towards LLMs. We provide a detailed analysis of the tasks attempted and the survey answers, along with specific output examples. Our findings highlight both the potential and limitations of LLMs in supporting research while also addressing general and research-specific ethical considerations. We conclude with a series of recommendations, emphasizing the need for researchers to complement LLMs with critical thinking and domain expertise, ensuring these tools serve as aids rather than substitutes for rigorous scientific inquiry.