AITopics | Object-Oriented Architecture

Collaborating Authors

Object-Oriented Architecture

News Overviews Instructional Materials AI-Alerts Classics

Reviews: Modelling and unsupervised learning of symmetric deformable object categories

Neural Information Processing SystemsOct-7-2024, 06:07:53 GMT

Summary: This work propose an approach to model symmetries in deformable object categories in an unsupervised manner. This approach has been demonstrated to work for objects with bilateral symmetry (identifying symmetries in human faces using CelebA dataset, cats using cats head dataset, on cars with a synthetic car dataset), and finally for rotational symmetry on a protein structure. Pros: Overview of the problem and associated challenges. The proposed approach seems a natural way to establish dense correspondences for non-rigid objects given two views of same object category (say example in Figure-3). In my opinion, correspondences for non-rigid/deformable objects is far more important problem than symmetry (with a potential impact on numerous problems including non rigid 3D reconstruction, wide baseline disparity estimation, human analysis etc).

artificial intelligence, object-oriented architecture, symmetry, (8 more...)

Neural Information Processing Systems

Genre: Personal > Opinion (0.39)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.84)

Add feedback

SharpSLAM: 3D Object-Oriented Visual SLAM with Deblurring for Agile Drones

Davletshin, Denis, Zhura, Iana, Cheremnykh, Vladislav, Rybiyanov, Mikhail, Fedoseev, Aleksey, Tsetserukou, Dzmitry

arXiv.org Artificial IntelligenceOct-7-2024

The paper focuses on the algorithm for improving the quality of 3D reconstruction and segmentation in DSP-SLAM by enhancing the RGB image quality. SharpSLAM algorithm developed by us aims to decrease the influence of high dynamic motion on visual object-oriented SLAM through image deblurring, improving all aspects of object-oriented SLAM, including localization, mapping, and object reconstruction. The experimental results revealed noticeable improvement in object detection quality, with F-score increased from 82.9% to 86.2% due to the higher number of features and corresponding map points. The RMSE of signed distance function has also decreased from 17.2 to 15.4 cm. Furthermore, our solution has enhanced object positioning, with an increase in the IoU from 74.5% to 75.7%. SharpSLAM algorithm has the potential to highly improve the quality of 3D reconstruction and segmentation in DSP-SLAM and to impact a wide range of fields, including robotics, autonomous vehicles, and augmented reality.

algorithm, international conference, reconstruction, (17 more...)

arXiv.org Artificial Intelligence

2410.05405

Country:

Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Asia > Russia (0.04)

Genre: Research Report (1.00)

Industry: Transportation > Ground > Road (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.82)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.67)

Add feedback

Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

Xu, Chengyuan, Kumaran, Radha, Stier, Noah, Yu, Kangyou, Höllerer, Tobias

arXiv.org Artificial IntelligenceOct-6-2024

Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps in neural vision-language understanding to enhance environment perception for autonomous tasks. In this work, we introduce a multimodal 3D object representation that unifies both semantic and linguistic knowledge with the geometric representation, enabling user-guided machine learning involving physical objects. We first present a fast multimodal 3D reconstruction pipeline that brings linguistic understanding to AR by fusing CLIP vision-language features into the environment and object models. We then propose "in-situ" machine learning, which, in conjunction with the multimodal representation, enables new tools and interfaces for users to interact with physical spaces and objects in a spatially and linguistically meaningful manner. We demonstrate the usefulness of the proposed system through two real-world AR applications on Magic Leap 2: a) spatial search in physical environments with natural language and b) an intelligent inventory system that tracks object changes over time. We also make our full implementation and demo data available at (https://github.com/cy-xu/spatially_aware_AI) to encourage further exploration and research in spatially aware AI.

application, proceedings, representation, (13 more...)

arXiv.org Artificial Intelligence

2410.04652

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New York > New York County > New York City (0.05)
Oceania > Australia > New South Wales > Sydney (0.04)
(4 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Unsupervised learning of object frames by dense equivariant image labelling

James Thewlis, Hakan Bilen, Andrea Vedaldi

Neural Information Processing SystemsOct-4-2024, 06:06:48 GMT

One of the key challenges of visual perception is to extract abstract models of 3D objects and object categories from visual measurements, which are affected by complex nuisance factors such as viewpoint, occlusion, motion, and deformations. Starting from the recent idea of viewpoint factorization, we propose a new approach that, given a large number of images of an object and no other supervision, can extract a dense object-centric coordinate frame. This coordinate frame is invariant to deformations of the images and comes with a dense equivariant labelling neural network that can map image pixels to their corresponding object coordinates. We demonstrate the applicability of this method to simple articulated objects and deformable objects such as human faces, learning embeddings from random synthetic transformations or optical flow correspondences, all without any manual supervision.

correspondence, landmark, proc, (16 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.89)

Add feedback

CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization

Matsuzaki, Shigemichi, Tanaka, Kazuhito, Shintani, Kazuhiro

arXiv.org Artificial IntelligenceOct-3-2024

This letter proposes a method of global localization on a map with semantic object landmarks. One of the most promising approaches for localization on object maps is to use semantic graph matching using landmark descriptors calculated from the distribution of surrounding objects. These descriptors are vulnerable to misclassification and partial observations. Moreover, many existing methods rely on inlier extraction using RANSAC, which is stochastic and sensitive to a high outlier rate. To address the former issue, we augment the correspondence matching using Vision Language Models (VLMs). Landmark discriminability is improved by VLM embeddings, which are independent of surrounding objects. In addition, inliers are estimated deterministically using a graph-theoretic approach. We also incorporate pose calculation using the weighted least squares considering correspondence similarity and observation completeness to improve the robustness. We confirmed improvements in matching and pose estimation accuracy through experiments on ScanNet and TUM datasets.

artificial intelligence, natural language, object-oriented architecture, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/LRA.2024.3474482

2410.03054

Country:

Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.88)

Add feedback

QDGset: A Large Scale Grasping Dataset Generated with Quality-Diversity

Huber, Johann, Hélénon, François, Kappel, Mathilde, Páez-Ubieta, Ignacio de Loyola, Puente, Santiago T., Gil, Pablo, Amar, Faïz Ben, Doncieux, Stéphane

arXiv.org Artificial IntelligenceOct-3-2024

Recent advances in AI have led to significant results in robotic learning, but skills like grasping remain partially solved. Many recent works exploit synthetic grasping datasets to learn to grasp unknown objects. However, those datasets were generated using simple grasp sampling methods using priors. Recently, Quality-Diversity (QD) algorithms have been proven to make grasp sampling significantly more efficient. In this work, we extend QDG-6DoF, a QD framework for generating object-centric grasps, to scale up the production of synthetic grasping datasets. We propose a data augmentation method that combines the transformation of object meshes with transfer learning from previous grasping repertoires. The conducted experiments show that this approach reduces the number of required evaluations per discovered robust grasp by up to 20%. We used this approach to generate QDGset, a dataset of 6DoF grasp poses that contains about 3.5 and 4.5 times more grasps and objects, respectively, than the previous state-of-the-art. Our method allows anyone to easily generate data, eventually contributing to a large-scale collaborative dataset of synthetic grasps.

dataset, gripper, robotic and automation, (15 more...)

arXiv.org Artificial Intelligence

2410.02319

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Spain > Valencian Community > Alicante Province > Alicante (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.46)

Add feedback

Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation

Kütük, Aleyna, Sezgin, Tevfik Metin

arXiv.org Artificial IntelligenceSep-30-2024

Scene sketch semantic segmentation is a crucial task for various applications including sketch-to-image retrieval and scene understanding. Existing sketch segmentation methods treat sketches as bitmap images, leading to the loss of temporal order among strokes due to the shift from vector to image format. Moreover, these methods struggle to segment objects from categories absent in the training data. In this paper, we propose a Class-Agnostic Visio-Temporal Network (CAVT) for scene sketch semantic segmentation. CAVT employs a class-agnostic object detector to detect individual objects in a scene and groups the strokes of instances through its post-processing module. This is the first approach that performs segmentation at both the instance and stroke levels within scene sketches. Furthermore, there is a lack of free-hand scene sketch datasets with both instance and stroke-level class annotations. To fill this gap, we collected the largest Free-hand Instance- and Stroke-level Scene Sketch Dataset (FrISS) that contains 1K scene sketches and covers 403 object classes with dense annotations. Extensive experiments on FrISS and other datasets demonstrate the superior performance of our method over state-of-the-art scene sketch segmentation models. The code and dataset will be made public after acceptance.

category, dataset, sketch, (16 more...)

arXiv.org Artificial Intelligence

2410.00266

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Republic of Türkiye (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Asia > Cambodia > Siem Reap Province > Siem Reap (0.04)

Genre: Research Report > Promising Solution (0.46)

Industry:

Transportation (1.00)
Consumer Products & Services (1.00)
Leisure & Entertainment > Sports (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

Yu, Qiaojun, Huang, Siyuan, Yuan, Xibin, Jiang, Zhengkai, Hao, Ce, Li, Xin, Chang, Haonan, Wang, Junbo, Liu, Liu, Li, Hongsheng, Gao, Peng, Lu, Cewu

arXiv.org Artificial IntelligenceSep-30-2024

Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future. Images, videos, dataset, and code are published on the project website at:https://sites.google.com/view/uni-aff/home

affordance, manipulation, uniaff, (14 more...)

arXiv.org Artificial Intelligence

2409.20551

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States > Florida > Hillsborough County > University (0.04)
Asia > Singapore > Central Region > Singapore (0.04)
(2 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.46)

Add feedback

OpenObject-NAV: Open-Vocabulary Object-Oriented Navigation Based on Dynamic Carrier-Relationship Scene Graph

Tang, Yujie, Wang, Meiling, Deng, Yinan, Zheng, Zibo, Zhong, Jiagui, Yue, Yufeng

arXiv.org Artificial IntelligenceSep-27-2024

In everyday life, frequently used objects like cups often have unfixed positions and multiple instances within the same category, and their carriers frequently change as well. As a result, it becomes challenging for a robot to efficiently navigate to a specific instance. To tackle this challenge, the robot must capture and update scene changes and plans continuously. However, current object navigation approaches primarily focus on semantic-level and lack the ability to dynamically update scene representation. This paper captures the relationships between frequently used objects and their static carriers. It constructs an open-vocabulary Carrier-Relationship Scene Graph (CRSG) and updates the carrying status during robot navigation to reflect the dynamic changes of the scene. Based on the CRSG, we further propose an instance navigation strategy that models the navigation process as a Markov Decision Process. At each step, decisions are informed by Large Language Model's commonsense knowledge and visual-language feature similarity. We designed a series of long-sequence navigation tasks for frequently used everyday items in the Habitat simulator. The results demonstrate that by updating the CRSG, the robot can efficiently navigate to moved targets. Additionally, we deployed our algorithm on a real robot and validated its practical effectiveness.

large language model, natural language, navigation, (17 more...)

arXiv.org Artificial Intelligence

2409.18743

Country:

Asia > China > Zhejiang Province > Ningbo (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.64)

Add feedback

Semantically-Driven Disambiguation for Human-Robot Interaction

Dogan, Fethiye Irmak, Liu, Weiyu, Leite, Iolanda, Chernova, Sonia

arXiv.org Artificial IntelligenceSep-25-2024

Ambiguities are common in human-robot interaction, especially when a robot follows user instructions in a large collocated space. For instance, when the user asks the robot to find an object in a home environment, the object might be in several places depending on its varying semantic properties (e.g., a bowl can be in the kitchen cabinet or on the dining room table, depending on whether it is clean/dirty, full/empty and the other objects around it). Previous works on object semantics have predicted such relationships using one shot-inferences which are likely to fail for ambiguous or partially understood instructions. This paper focuses on this gap and suggests a semantically-driven disambiguation approach by utilizing follow-up clarifications to handle such uncertainties. To achieve this, we first obtain semantic knowledge embeddings, and then these embeddings are used to generate clarifying questions by following an iterative process. The evaluation of our method shows that our approach is model agnostic, i.e., applicable to different semantic embedding models, and follow-up clarifications improve the performance regardless of the embedding model. Additionally, our ablation studies show the significance of informative clarifications and iterative predictions to enhance system accuracies.

clarification, knowledge, prediction, (17 more...)

arXiv.org Artificial Intelligence

2409.17004

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report (1.00)

Industry: Appliances & Durable Goods (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.68)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.61)

Add feedback