Goto

Collaborating Authors

 scene analysis


Zero-Shot Scene Understanding with Multimodal Large Language Models for Automated Vehicles

arXiv.org Artificial Intelligence

Scene understanding is critical for various downstream tasks in autonomous driving, including facilitating driver-agent communication and enhancing human-centered explainability of autonomous vehicle (AV) decisions. This paper evaluates the capability of four multimodal large language models (MLLMs), including relatively small models, to understand scenes in a zero-shot, in-context learning setting. Additionally, we explore whether combining these models using an ensemble approach with majority voting can enhance scene understanding performance. Our experiments demonstrate that GPT-4o, the largest model, outperforms the others in scene understanding. However, the performance gap between GPT-4o and the smaller models is relatively modest, suggesting that advanced techniques such as improved in-context learning, retrieval-augmented generation (RAG), or fine-tuning could further optimize the smaller models' performance. We also observe mixed results with the ensemble approach: while some scene attributes show improvement in performance metrics such as F1-score, others experience a decline. These findings highlight the need for more sophisticated ensemble techniques to achieve consistent gains across all scene attributes. This study underscores the potential of leveraging MLLMs for scene understanding and provides insights into optimizing their performance for autonomous driving applications.


Nano Drone-based Indoor Crime Scene Analysis

arXiv.org Artificial Intelligence

Technologies such as robotics, Artificial Intelligence (AI), and Computer Vision (CV) can be applied to crime scene analysis (CSA) to help protect lives, facilitate justice, and deter crime, but an overview of the tasks that can be automated has been lacking. Here we follow a speculate prototyping approach: First, the STAIR tool is used to rapidly review the literature and identify tasks that seem to have not received much attention, like accessing crime sites through a window, mapping/gathering evidence, and analyzing blood smears. Secondly, we present a prototype of a small drone that implements these three tasks with 75%, 85%, and 80% performance, to perform a minimal analysis of an indoor crime scene. Lessons learned are reported, toward guiding next work in the area.


A system of vision sensor based deep neural networks for complex driving scene analysis in support of crash risk assessment and prevention

arXiv.org Artificial Intelligence

To assist human drivers and autonomous vehicles in assessing crash risks, driving scene analysis using dash cameras on vehicles and deep learning algorithms is of paramount importance. Although these technologies are increasingly available, driving scene analysis for this purpose still remains a challenge. This is mainly due to the lack of annotated large image datasets for analyzing crash risk indicators and crash likelihood, and the lack of an effective method to extract lots of required information from complex driving scenes. To fill the gap, this paper develops a scene analysis system. The Multi-Net of the system includes two multi-task neural networks that perform scene classification to provide four labels for each scene. The DeepLab v3 and YOLO v3 are combined by the system to detect and locate risky pedestrians and the nearest vehicles. All identified information can provide the situational awareness to autonomous vehicles or human drivers for identifying crash risks from the surrounding traffic. To address the scarcity of annotated image datasets for studying traffic crashes, two completely new datasets have been developed by this paper and made available to the public, which were proved to be effective in training the proposed deep neural networks. The paper further evaluates the performance of the Multi-Net and the efficiency of the developed system. Comprehensive scene analysis is further illustrated with representative examples. Results demonstrate the effectiveness of the developed system and datasets for driving scene analysis, and their supportiveness for crash risk assessment and crash prevention.


Acoustic scene analysis with multi-head attention networks

arXiv.org Machine Learning

Acoustic Scene Classification (ASC) is a challenging task, as a single scene may involve multiple events that contain complex sound patterns. For example, a cooking scene may contain several sound sources including silverware clinking, chopping, frying, etc. What complicates ASC more is that classes of different activities could have overlapping sounds patterns (e.g. both cooking and dishwashing could have silverware clinking sound). In this paper, we propose a multi-head attention network to model the complex temporal input structures for ASC. The proposed network takes the audio's time-frequency representation as input, and it leverages standard VGG plus LSTM layers to extract high-level feature representation. Further more, it applies multiple attention heads to summarize various patterns of sound events into fixed dimensional representation, for the purpose of final scene classification. The whole network is trained in an end-to-end fashion with back-propagation. Experimental results confirm that our model discovers meaningful sound patterns through the attention mechanism, without using explicit supervision in the alignment. We evaluated our proposed model using DCASE 2018 Task 5 dataset, and achieved competitive performance on par with previous winner's results.


Google's A.I. Is Training Itself to Count Calories In Food Photos

#artificialintelligence

Whether by accident or design, the details of Google's plans for artificial intelligence (AI) have been elusive. In some cases, there's no real mystery, just nothing all that exciting to talk about. AI technology is the foundation of the company's search engine, and the most obvious reason for Google's high-profile, $400M acquisition of DeepMind in 2014 is to use the UK firm's expertise in deep learning--a subset of AI research, but more on that later--to bolster that core capability. But the Googleplex has absorbed other bright minds from the field of AI, as well as some of the most buzzed-about companies in robotics, with only some of that collective braintrust officially allocated to driverless cars, delivery drones or other publicly announced robotics or AI-related projects. What, exactly, are Google's AI experts up to?


How to see a simple world: an exegesis of some computer programs for scene analysis.

Classics

The Platonic assumption that the world is made up entirely of objects with flat surfaces obviously does not hold; and yet, as with so many other simplifications of reality for the sake of tractability, it has been immensely productive in establishing a paradigm for scene analysis. There is a coherent evolving body of research based on the notion that a polyhedral world is the simplest we can consider without eliminating any of the essential aspects of scene analysis, namely, the picture-taking process, models, lighting, support, occlusion, and so on. The thesis is that once we achieve ways of dealing intelligently with those aspects for a simple, but nonetheless real, world we could then consider the fuzzy world of teddy bears (Michie, 1974) and the like. This should not be taken as suggesting that each of those aspects presents simply a separate, independent subproblem to be solved. The most important question to be faced was how to write programs that coordinate the use of these separate, but interrelated, knowledge systems to achieve sensible picture interpretations. Roberts (Roberts, 1965) was the first to give an answer to this question. We shall examine his answer in some detail, because he exposed in it the issues that became themes of the first decade of scene analysis.


An Accommodating Edge Follower

Classics

This edge follower could easily find the outlines of white cubes on a black table, but was prone to error in less carefully controlled environments. Our studies of its inadequacies have stimulated the development of a more powerful edge follower, which overcomes most of the limitations of the old one. This program is currently the initial stage of visual processing in the Stanford hand-eye system (2). It has demonstrated an ability to track weak edges under adverse lighting conditions 2. HARDWARE The edge follower uses a standard vidicon television camera, modified to provide computer control of orientation (a pan-tilt head), focal length (a lens turret), color filter, focus, and target voltage. The lens iris is set manually. The pan-tilt head, lens turret, and focus motor *This research was supported by the Advanced research Projects Agency of the Department of Defense under Contract No. SD-183. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the Advanced Research Projects Agency of the U.S. Government.


Scene Analysis Based on Imperfect Edge Data

Classics

This system accepts as input a scene represented as a line drawing. Based on a set of known object models, the program attempts to determine the identity and location of each object viewed. The most significant feature of the system is its ability to deal with imperfect input data. This ability appears essential in light of our current stock of preprocessing techniques and the variation that is possible in real world data. INTRODUCTION A hand-eye system is a problem solving system with an eye (camera) for input and a hand (manipulator) for output. Such a system must have at least 1) a set of scene analysis (perception) programs which interpret the real world in a meaningful way, 2) a set of manipulation programs which control movement of the hand in 3-space, and 3) an executive (problem solver, strategy) program which directs the perceptual and motor processes toward a desired goal.