Goto

Collaborating Authors

 command


Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

Zhang, Enming, Gong, Peizhe, Dai, Xingyuan, Lv, Yisheng, Miao, Qinghai

arXiv.org Artificial Intelligence

Assessing the safety of vision-language models (VLMs) in autonomous driving is particularly important; however, existing work mainly focuses on traditional benchmark evaluations. As interactive components within autonomous driving systems, VLMs must maintain strong safety cognition during interactions. From this perspective, we propose a novel evaluation method: Safety Cognitive Driving Benchmark (SCD-Bench) . To address the large-scale annotation challenge for SCD-Bench, we develop the Autonomous Driving Image-Text Annotation System (ADA) . Additionally, to ensure data quality in SCD-Bench, our dataset undergoes manual refinement by experts with professional knowledge in autonomous driving. We further develop an automated evaluation method based on large language models (LLMs). To verify its effectiveness, we compare its evaluation results with those of expert human evaluations, achieving a consistency rate of 99.74%. Preliminary experimental results indicate that existing open-source models still lack sufficient safety cognition, showing a significant gap compared to GPT-4o. Notably, lightweight models (1B-4B) demonstrate minimal safety cognition. However, since lightweight models are crucial for autonomous driving systems, this presents a significant challenge for integrating VLMs into the field.


Every AI is at your command--Get ChatGPT, Gemini, Midjourney for 40

Popular Science

Just like streaming services, juggling multiple subscriptions for different AI tools is overwhelming and costly, but this revolutionary platform will break you free from monthly fees. You gain lifetime access to a suite of powerful features that can streamline your creative process for just 40--less than three months of a single ChatGPT subscription. The idea behind 1minAI is simple yet groundbreaking. Instead of juggling separate accounts and subscriptions, I now enjoy the benefits of top-tier AI tools like ChatGPT, Gemini, and Midjourney, all organized neatly in one dashboard. This integration allows me to generate text, create images, and refine content seamlessly, saving time and money.


Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Stanić, Aleksandar, Caelles, Sergi, Tschannen, Michael

arXiv.org Artificial Intelligence

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.


Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Hu, Yushi, Stretcu, Otilia, Lu, Chun-Ta, Viswanathan, Krishnamurthy, Hata, Kenji, Luo, Enming, Krishna, Ranjay, Fuxman, Ariel

arXiv.org Artificial Intelligence

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.


Recursive Visual Programming

Ge, Jiaxin, Subramanian, Sanjay, Shi, Baifeng, Herzig, Roei, Darrell, Trevor

arXiv.org Artificial Intelligence

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which simplifies generated routines, provides more efficient problem solving, and can manage more complex data structures. RVP is inspired by human coding practices and approaches VQA tasks with an iterative recursive code generation approach, allowing decomposition of complicated problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the system recursively generates a new piece of code, it autonomously determines the appropriate return type and crafts the requisite code to generate that output. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks through coding.


The General Directorate of Armaments awards the Tornado contract to the Preligens scale up - Actu IA

#artificialintelligence

The French defense procurement agency (DGA) has awarded Preligens a seven-year contract worth around €240 million. Named TORNADE (Traitement Optique et Radar par Neurones Artificiels via Détecteurs), it involves the acquisition of software licenses for AI solutions for processing and exploiting large amounts of data, and will benefit in particular the Ministry of Defence's joint intelligence function. Today, there is a huge amount of data coming from sensors, spy satellites, commercial satellites, drones or airplanes, and humans are struggling to process it quickly enough to enable authorities to make the right decisions quickly. The current geopolitical context shows how crucial it is for governments and military ministries to be reactive and the importance of satellite images. Preligens' AI image processing technology fully meets these needs.


Top Posts June 20-26: 20 Basic Linux Commands for Data Science Beginners - KDnuggets

#artificialintelligence

Decision Tree Algorithm, Explained by Nagesh Singh Chauhan 21 Cheat Sheets for Data Science Interviews by Nate Rosidi 15 Python Coding Interview Questions You Must Know For Data Science by Nate Rosidi Naïve Bayes Algorithm: Everything You Need to Know by Nagesh Singh Chauhan 14 Essential Git Commands for Data Scientists by Abid Ali Awan Top Programming Languages and Their Uses by Claire D. Costa 3 Ways Understanding Bayes Theorem Will Improve Your Data Science by Nicole Janeway Bills DBSCAN Clustering Algorithm in Machine Learning by Nagesh Singh Chauhan The Complete Collection of Data Science Books – Part 2 by Abid Ali Awan 5 Different Ways to Load Data in Python by Ahmad Anis Top Posts June 13-19: 14 Essential Git Commands for Data Scientists 20 Basic Linux Commands for Data Science Beginners KDnuggets News, June 15: 14 Essential Git Commands for Data Scientists; A… KDnuggets Top Posts for March 2022: Why Are So Many Data Scientists… Top Posts April 4-10: The Complete Collection Of Data Repositories – Part 1 Top Posts March 21-27: Why Are So Many Data Scientists Quitting Their Jobs? Top Posts March 21-27: Why Are So Many Data Scientists Quitting Their Jobs?


US military wants $29.8m for IT to boost AI intel analysis

#artificialintelligence

The US Northern Command, the military command group designated to protect North America from attack, has lobbied Congress for $29.8m to expand its IT infrastructure to better support machine-learning technologies. The request is part of the command's unfunded priorities list for fiscal year 2023, a wish list of all the gear and tech US NORTHCOM and the North American Aerospace Defense Command (NORAD) reckons is needed for building and testing new weapons or operating monitoring systems, totaling some $135 million. The requested IT funding will, we're told, be used to procure cloud computing infrastructure to run AI workloads from US NORTHCOM and NORAD's joint operations center, according to Defense News this week. The goal is to build smart cloud-hosted systems that can process incoming data, generate insight and decision options, and make this intelligence available across the Dept of Defense for leaders to consider. "Maintaining our strategic advantage begins with improving domain awareness globally, including in the approaches to North America," General Glen VanHerk, US NORTHCOM and NORAD's commander, said in a statement before the Senate Armed Forces in late March.


Meet DALL-E, the A.I. That Draws Anything at Your Command

#artificialintelligence

A half decade ago, the world's leading A.I. labs built systems that could identify objects in digital images and even generate images on their own, including flowers, dogs, cars and faces. A few years later, they built systems that could do much the same with written language, summarizing articles, answering questions, generating tweets and even writing blog posts. Now, researchers are combining those technologies to create new forms of A.I. DALL-E is a notable step forward because it juggles both language and images and, in some cases, grasps the relationship between the two. "We can now use multiple, intersecting streams of information to create better and better technology," said Oren Etzioni, chief executive of the Allen Institute for Artificial Intelligence, an artificial intelligence lab in Seattle. The technology is not perfect.


SoK: A Study of the Security on Voice Processing Systems

Chang, Robert, Kuo, Logan, Liu, Arthur, Sehatbakhsh, Nader

arXiv.org Artificial Intelligence

As the use of Voice Processing Systems (VPS) continues to become more prevalent in our daily lives through the increased reliance on applications such as commercial voice recognition devices as well as major text-to-speech software, the attacks on these systems are increasingly complex, varied, and constantly evolving. With the use cases for VPS rapidly growing into new spaces and purposes, the potential consequences regarding privacy are increasingly more dangerous. In addition, the growing number and increased practicality of over-the-air attacks have made system failures much more probable. In this paper, we will identify and classify an arrangement of unique attacks on voice processing systems. Over the years research has been moving from specialized, untargeted attacks that result in the malfunction of systems and the denial of services to more general, targeted attacks that can force an outcome controlled by an adversary. The current and most frequently used machine learning systems and deep neural networks, which are at the core of modern voice processing systems, were built with a focus on performance and scalability rather than security. Therefore, it is critical for us to reassess the developing voice processing landscape and to identify the state of current attacks and defenses so that we may suggest future developments and theoretical improvements.