AITopics

2410.18195

Country:

North America > United States (0.14)
Europe > Italy (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (1.00)
(4 more...)

arXiv.org Artificial IntelligenceOct-17-2024

Skill Generalization with Verbs

Ma, Rachel, Lam, Lyndon, Spiegel, Benjamin A., Ganeshan, Aditya, Patel, Roma, Abbatematteo, Ben, Paulius, David, Tellex, Stefanie, Konidaris, George

It is imperative that robots can understand natural language commands issued by humans. Such commands typically contain verbs that signify what action should be performed on a given object and that are applicable to many objects. We propose a method for generalizing manipulation skills to novel objects using verbs. Our method learns a probabilistic classifier that determines whether a given object trajectory can be described by a specific verb. We show that this classifier accurately generalizes to novel object categories with an average accuracy of 76.69% across 13 object categories and 14 verbs. We then perform policy search over the object kinematics to find an object trajectory that maximizes classifier prediction for a given verb. Our method allows a robot to generate a trajectory for a novel object based on a verb, which can then be used as input to a motion planner. We show that our model can generate trajectories that are usable for executing five verb commands applied to novel instances of two different object categories on a real robot.

machine learning, natural language, trajectory, (20 more...)

doi: 10.1109/IROS55552.2023.10341472

2410.14118

Country:

North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Los Angeles County > Pomona (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

arXiv.org Artificial IntelligenceOct-15-2024

Affordance-Centric Policy Learning: Sample Efficient and Generalisable Robot Policy Learning using Affordance-Centric Task Frames

Rana, Krishan, Abou-Chakra, Jad, Garg, Sourav, Lee, Robert, Reid, Ian, Suenderhauf, Niko

Affordances are central to robotic manipulation, where most tasks can be simplified to interactions with task-specific regions on objects. By focusing on these key regions, we can abstract away task-irrelevant information, simplifying the learning process, and enhancing generalisation. In this paper, we propose an affordance-centric policy-learning approach that centres and appropriately \textit{orients} a \textit{task frame} on these affordance regions allowing us to achieve both \textbf{intra-category invariance} -- where policies can generalise across different instances within the same object category -- and \textbf{spatial invariance} -- which enables consistent performance regardless of object placement in the environment. We propose a method to leverage existing generalist large vision models to extract and track these affordance frames, and demonstrate that our approach can learn manipulation tasks using behaviour cloning from as little as 10 demonstrations, with equivalent generalisation to an image-based policy trained on 305 demonstrations. We provide video demonstrations on our project site: https://affordance-policy.github.io.

artificial intelligence, machine learning, object-oriented architecture, (18 more...)

2410.12124

Country:

Oceania > Australia > Queensland (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.34)

Neural Information Processing SystemsOct-11-2024, 03:20:46 GMT

Find What You Want: Learning Demand-conditioned Object Attribute Space for Demand-driven Navigation

The task of Visual Object Navigation (VON) involves an agent's ability to locate a particular object within a given scene. To successfully accomplish the VON task, two essential conditions must be fulfiled: 1) the user knows the name of the desired object; and 2) the user-specified object actually is present within the scene. To meet these conditions, a simulator can incorporate predefined object names and positions into the metadata of the scene. However, in real-world scenarios, it is often challenging to ensure that these conditions are always met. Humans in an unfamiliar environment may not know which objects are present in the scene, or they may mistakenly specify an object that is not actually present.

agent, demand-driven navigation, learning demand-conditioned object attribute space, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.40)
Information Technology > Artificial Intelligence > Natural Language (0.39)

arXiv.org Artificial IntelligenceOct-11-2024

VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking

Qian, Zekun, Han, Ruize, Hou, Junhui, Song, Linqi, Feng, Wei

Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.

category, machine learning, natural language, (16 more...)

2410.08529

Country:

Asia > China > Hong Kong (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Italy > Tuscany > Florence (0.04)

Genre:

Research Report > Promising Solution (0.54)
Research Report > New Finding (0.34)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.35)

Neural Information Processing SystemsOct-9-2024, 10:38:11 GMT

Zero-Shot Semantic Segmentation

Semantic segmentation models are limited in their ability to scale to large numbers of object classes. In this paper, we introduce the new task of zero-shot semantic segmentation: learning pixel-wise classifiers for never-seen object categories with zero training examples. To this end, we present a novel architecture, ZS3Net, combining a deep visual segmentation model with an approach to generate visual representations from semantic word embeddings. By this way, ZS3Net addresses pixel classification tasks where both seen and unseen categories are faced at test time (so called generalized zero-shot classification). Performance is further improved by a self-training step that relies on automatic pseudo-labeling of pixels from unseen classes.

category, segmentation model, zero-shot semantic segmentation, (1 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.64)

Neural Information Processing SystemsOct-8-2024, 06:56:52 GMT

Reviews: Zero-Shot Transfer with Deictic Object-Oriented Representation in Reinforcement Learning

Post rebuttal: I now understand the middle ground this paper is positioned, and the difference to propositional OO representations where you don't necessarily care which instance of an object type you're dealing with, which significantly reduces the dimensionality of learning transition dynamics. But this is still similar to other work on graph neural networks for model learning in fully relational representations, like Relation Networks by Santoro et al., and Interaction Networks by Battaglia et al. which in worst case learn T * n * (n-1) relations for n objects for T types of relations. However, this paper does do a nice job of formalizing from the OO-MDP and Propositional MDP setting as opposed to the two papers I mentioned which do not, and focus on the physical dynamics case. I am willing to increase my score based on this, but still do not think it is novel enough to be accepted. This is very similar to relational MDPs, but they learn transition dynamics in this relational attribute space rather than real state space.

deictic object-oriented representation, reinforcement learning, transition dynamic, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.43)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.43)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Neural Information Processing SystemsOct-7-2024, 16:02:27 GMT

Reviews: Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation

An approach for joint estimation of 3D Layout, 3D Object Detection, Camera Pose Estimation and Holistic Scene Understanding' (as defined in Song et al. (2015)) is proposed. More specifically, deep nets, functional mappings (e.g., projections from 3D to 2D points) and loss functions are combined to obtain a holistic interpretation of a scene illustrated in a single RGB image. The proposed approach is shown to outperform 3DGP (Choi et al. (2013)) and IM2CAD (Izadinia et al. (2017)) on the SUN RGB-D dataset. Review Summary: The paper is well written and presents an intuitive approach which is illustrated to work well when compared to two baselines. For some of the tasks, e.g., 3D Layout estimation, stronger baselines exist and as a reviewer/reader I can't assess how the proposed approach compares.

baseline, camera pose estimation, fair comparison, (10 more...)

Genre: Summary/Review (0.39)

Technology:

Information Technology > Artificial Intelligence > Vision > Video Understanding (0.62)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.40)

Neural Information Processing SystemsOct-7-2024, 13:21:09 GMT

Reviews: Object-Oriented Dynamics Predictor

This paper addresses the problem of action-conditional video prediction via a deep neural network whose architecture specifically aims to represent object positions, relationships, and interactions. The learned models are shown empirically to generalize to novel object configurations and to be robust to minor changes in object appearance. Technical Quality As far as I can tell the paper is technically sound. The experiments are well-designed to support the main claims. I especially appreciated the attempts to study whether the network is truly capturing object-based knowledge as a human might expect (rather than simply being a really fancy pixel - pixel model).

contribution, experiment, object-oriented dynamic predictor, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.73)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.52)

Neural Information Processing SystemsOct-7-2024, 11:12:50 GMT

Reviews: Learning Hierarchical Semantic Image Manipulation through Structured Representations

In this paper a new method for image manipulation is proposed. The proposed method incorporates a hierarchical framework and provides both interactive and automatic semantic object-level image manipulation. In the interactive manipulation setting, the user can select a bounding box where image editing for adding and removing objects will be applied. The proposed network architecture consists of a foreground output stream which produces the predictions on binary object mask and a background output stream for producing per-pixel label maps. As the result, the proposed image manipulation method generates output image by filling in the pixel-level textures guided by the semantic layout.

evaluation, learning hierarchical semantic image manipulation, structured representation, (5 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.40)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.39)