AITopics | Object-Oriented Architecture

Collaborating Authors

Object-Oriented Architecture

News Overviews Instructional Materials AI-Alerts Classics

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

Zhu, William Y., Ye, Keren, Ke, Junjie, Yu, Jiahui, Guibas, Leonidas, Milanfar, Peyman, Yang, Feng

arXiv.org Artificial IntelligenceAug-7-2024

Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).

computer vision and pattern recognition, recognition, retrieval, (10 more...)

arXiv.org Artificial Intelligence

2408.04102

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

A Robotics-Inspired Scanpath Model Reveals the Importance of Uncertainty and Semantic Object Cues for Gaze Guidance in Dynamic Scenes

Mengers, Vito, Roth, Nicolas, Brock, Oliver, Obermayer, Klaus, Rolfs, Martin

arXiv.org Artificial IntelligenceAug-2-2024

How we perceive objects around us depends on what we actively attend to, yet our eye movements depend on the perceived objects. Still, object segmentation and gaze behavior are typically treated as two independent processes. Drawing on an information processing pattern from robotics, we present a mechanistic model that simulates these processes for dynamic real-world scenes. Our image-computable model uses the current scene segmentation for object-based saccadic decision-making while using the foveated object to refine its scene segmentation recursively. To model this refinement, we use a Bayesian filter, which also provides an uncertainty estimate for the segmentation that we use to guide active scene exploration. We demonstrate that this model closely resembles observers' free viewing behavior, measured by scanpath statistics, including foveation duration and saccade amplitude distributions used for parameter fitting and higher-level statistics not used for fitting. These include how object detections, inspections, and returns are balanced and a delay of returning saccades without an explicit implementation of such temporal inhibition of return. Extensive simulations and ablation studies show that uncertainty promotes balanced exploration and that semantic object cues are crucial to form the perceptual units used in object-based attention. Moreover, we show how our model's modular design allows for extensions, such as incorporating saccadic momentum or pre-saccadic attention, to further align its output with human scanpaths.

saccade, scanpath, segmentation, (17 more...)

arXiv.org Artificial Intelligence

2408.01322

Country:

Europe > Germany (0.04)
North America > Canada (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.84)

Add feedback

ALPI: Auto-Labeller with Proxy Injection for 3D Object Detection using 2D Labels Only

Lahlali, Saad, Granger, Nicolas, Borgne, Hervé Le, Pham, Quoc-Cuong

arXiv.org Artificial IntelligenceJul-24-2024

3D object detection plays a crucial role in various applications such as autonomous vehicles, robotics and augmented reality. However, training 3D detectors requires a costly precise annotation, which is a hindrance to scaling annotation to large datasets. To address this challenge, we propose a weakly supervised 3D annotator that relies solely on 2D bounding box annotations from images, along with size priors. One major problem is that supervising a 3D detection model using only 2D boxes is not reliable due to ambiguities between different 3D poses and their identical 2D projection. We introduce a simple yet effective and generic solution: we build 3D proxy objects with annotations by construction and add them to the training dataset. Our method requires only size priors to adapt to new classes. To better align 2D supervision with 3D detection, our method ensures depth invariance with a novel expression of the 2D losses. Finally, to detect more challenging instances, our annotator follows an offline pseudo-labelling scheme which gradually improves its 3D pseudo-labels. Extensive experiments on the KITTI dataset demonstrate that our method not only performs on-par or above previous works on the Car category, but also achieves performance close to fully supervised methods on more challenging classes. We further demonstrate the effectiveness and robustness of our method by being the first to experiment on the more challenging nuScenes dataset. We additionally propose a setting where weak labels are obtained from a 2D detector pre-trained on MS-COCO instead of human annotations.

annotation, box annotation, detection, (16 more...)

arXiv.org Artificial Intelligence

2407.17197

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(8 more...)

Genre: Research Report (0.50)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.66)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.51)

Add feedback

VisMin: Visual Minimal-Change Understanding

Awal, Rabiul, Ahmadi, Saba, Zhang, Le, Agrawal, Aishwarya

arXiv.org Artificial IntelligenceJul-23-2024

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar \textit{captions} given an image. In this paper, we introduce a new, challenging benchmark termed \textbf{Vis}ual \textbf{Min}imal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: \textit{object}, \textit{attribute}, \textit{count}, and \textit{spatial relation}. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at \url{https://vismin.net/}.

benchmark, caption, edit instruction, (12 more...)

arXiv.org Artificial Intelligence

2407.16772

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Sports > Tennis (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

OCTrack: Benchmarking the Open-Corpus Multi-Object Tracking

Qian, Zekun, Han, Ruize, Feng, Wei, Hou, Junhui, Song, Linqi, Wang, Song

arXiv.org Artificial IntelligenceJul-19-2024

We study a novel yet practical problem of open-corpus multi-object tracking (OCMOT), which extends the MOT into localizing, associating, and recognizing generic-category objects of both seen (base) and unseen (novel) classes, but without the category text list as prompt. To study this problem, the top priority is to build a benchmark. In this work, we build OCTrackB, a large-scale and comprehensive benchmark, to provide a standard evaluation platform for the OC-MOT problem. Compared to previous datasets, OCTrackB has more abundant and balanced base/novel classes and the corresponding samples for evaluation with less bias. We also propose a new multi-granularity recognition metric to better evaluate the generative object recognition in OCMOT. By conducting the extensive benchmark evaluation, we report and analyze the results of various state-of-the-art methods, which demonstrate the rationale of OCMOT, as well as the usefulness and advantages of OCTrackB.

category, computer vision, proceedings, (12 more...)

arXiv.org Artificial Intelligence

2407.14047

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
North America > United States > South Carolina (0.04)
North America > United States > Maine (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.48)

Add feedback

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

Liu, Yang, Chen, Weixing, Bai, Yongjie, Luo, Jingzhou, Song, Xinshuai, Jiang, Kaixuan, Li, Zhida, Zhao, Ganlong, Lin, Junyi, Li, Guanbin, Gao, Wen, Lin, Liang

arXiv.org Artificial IntelligenceJul-18-2024

Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities, making them a promising architecture for the brain of embodied agents. However, there is no comprehensive survey for Embodied AI in the era of MLMs. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI. Our analysis firstly navigates through the forefront of representative works of embodied robots and simulators, to fully understand the research focuses and their limitations. Then, we analyze four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, covering the state-of-the-art methods, essential paradigms, and comprehensive datasets. Additionally, we explore the complexities of MLMs in virtual and real embodied agents, highlighting their significance in facilitating interactions in dynamic digital and physical environments. Finally, we summarize the challenges and limitations of embodied AI and discuss their potential future directions. We hope this survey will serve as a foundational reference for the research community and inspire continued innovation. The associated project can be found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.

large language model, machine learning, reinforcement learning, (22 more...)

arXiv.org Artificial Intelligence

2407.06886

Country: Asia > China (0.67)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.65)

Industry:

Health & Medicine (1.00)
Education > Educational Setting (0.67)
Information Technology > Robotics & Automation (0.45)
Energy > Oil & Gas (0.45)

Technology:

Information Technology > Artificial Intelligence > Robots > Manipulation (1.00)
Information Technology > Artificial Intelligence > Robots > Locomotion (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
(6 more...)

Add feedback

OVExp: Open Vocabulary Exploration for Object-Oriented Navigation

Wei, Meng, Wang, Tai, Chen, Yilun, Wang, Hanqing, Pang, Jiangmiao, Liu, Xihui

arXiv.org Artificial IntelligenceJul-12-2024

Object-oriented embodied navigation aims to locate specific objects, defined by category or depicted in images. Existing methods often struggle to generalize to open vocabulary goals without extensive training data. While recent advances in Vision-Language Models (VLMs) offer a promising solution by extending object recognition beyond predefined categories, efficient goal-oriented exploration becomes more challenging in an open vocabulary setting. We introduce OVExp, a learning-based framework that integrates VLMs for Open-Vocabulary Exploration. OVExp constructs scene representations by encoding observations with VLMs and projecting them onto top-down maps for goal-conditioned exploration. Goals are encoded in the same VLM feature space, and a lightweight transformer-based decoder predicts target locations while maintaining versatile representation abilities. To address the impracticality of fusing dense pixel embeddings with full 3D scene reconstruction for training, we propose constructing maps using low-cost semantic categories and transforming them into CLIP's embedding space via the text encoder. The simple but effective design of OVExp significantly reduces computational costs and demonstrates strong generalization abilities to various navigation settings. Experiments on established benchmarks show OVExp outperforms previous zero-shot methods, can generalize to diverse scenes, and handle different goal modalities.

navigation, ovexp, semantic map, (13 more...)

arXiv.org Artificial Intelligence

2407.09016

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

Add feedback

OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

Krishnan, Akshay, Kundu, Abhijit, Maninis, Kevis-Kokitsi, Hays, James, Brown, Matthew

arXiv.org Artificial IntelligenceJul-11-2024

We propose OmniNOCS, a large-scale monocular dataset with 3D Normalized Object Coordinate Space (NOCS) maps, object masks, and 3D bounding box annotations for indoor and outdoor scenes. OmniNOCS has 20 times more object classes and 200 times more instances than existing NOCS datasets (NOCS-Real275, Wild6D). We use OmniNOCS to train a novel, transformer-based monocular NOCS prediction model (NOCSformer) that can predict accurate NOCS, instance masks and poses from 2D object detections across diverse classes. It is the first NOCS model that can generalize to a broad range of classes when prompted with 2D boxes. We evaluate our model on the task of 3D oriented bounding box prediction, where it achieves comparable results to state-of-the-art 3D detection methods such as Cube R-CNN. Unlike other 3D detection methods, our model also provides detailed and accurate 3D object shape and segmentation. We propose a novel benchmark for the task of NOCS prediction based on OmniNOCS, which we hope will serve as a useful baseline for future work in this area. Our dataset and code will be at the project website: https://omninocs.github.io.

dataset, nocsformer, omninocs, (14 more...)

arXiv.org Artificial Intelligence

2407.08711

Country:

South America > Brazil (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Li, Xin, Miao, Deshui, He, Zhenyu, Wang, Yaowei, Lu, Huchuan, Yang, Ming-Hsuan

arXiv.org Artificial IntelligenceJul-10-2024

Tracking and segmenting multiple similar objects with complex or separate parts in long-term videos is inherently challenging due to the ambiguity of target parts and identity confusion caused by occlusion, background clutter, and long-term variations. In this paper, we propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic network comprising a semantic embedding block and spatial dependencies modeling block to associate the pretrained ViT features with global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation and ensuring effective long-term query propagation. The experimental results show that the proposed method set a new state-of-the-art performance on multiple datasets, including the DAVIS2017 test (89.1%), YoutubeVOS 2019 (88.5%), MOSE (75.1%), LVOS test (73.0%), and LVOS val (75.1%), which demonstrate the effectiveness and generalization capacity of the proposed method. We will make all source code and trained models publicly available.

dataset, segmentation, video, (14 more...)

arXiv.org Artificial Intelligence

2407.0776

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.48)

Add feedback

DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments

Ma, Ji, Dai, Hongming, Mu, Yao, Wu, Pengying, Wang, Hao, Chi, Xiaowei, Fei, Yang, Zhang, Shanghang, Liu, Chang

arXiv.org Artificial IntelligenceJul-8-2024

Zero-Shot Object Navigation (ZSON) requires agents to autonomously locate and approach unseen objects in unfamiliar environments and has emerged as a particularly challenging task within the domain of Embodied AI. Existing datasets for developing ZSON algorithms lack consideration of dynamic obstacles, object attribute diversity, and scene texts, thus exhibiting noticeable discrepancies from real-world situations. To address these issues, we propose a Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments (DOZE) that comprises ten high-fidelity 3D scenes with over 18k tasks, aiming to mimic complex, dynamic real-world scenarios. Specifically, DOZE scenes feature multiple moving humanoid obstacles, a wide array of open-vocabulary objects, diverse distinct-attribute objects, and valuable textual hints. Besides, different from existing datasets that only provide collision checking between the agent and static obstacles, we enhance DOZE by integrating capabilities for detecting collisions between the agent and moving obstacles. This novel functionality enables the evaluation of the agents' collision avoidance abilities in dynamic environments. We test four representative ZSON methods on DOZE, revealing substantial room for improvement in existing approaches concerning navigation efficiency, safety, and object recognition accuracy. Our dataset can be found at https://DOZE-Dataset.github.io/.

large language model, natural language, object-oriented architecture, (14 more...)

arXiv.org Artificial Intelligence

2402.19007

Country: Asia > China (0.47)

Genre: Research Report (0.82)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.66)

Add feedback