AITopics | Object-Oriented Architecture

Collaborating Authors

Object-Oriented Architecture

News Overviews Instructional Materials AI-Alerts Classics

Cross-Modal Concept Learning and Inference for Vision-Language Models

Zhang, Yi, Zhang, Ce, Tang, Yushun, He, Zhihai

arXiv.org Artificial IntelligenceJul-28-2023

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization.

computer vision, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2307.1546

Country:

Asia > China > Heilongjiang Province > Harbin (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.48)

Add feedback

'What are you referring to?' Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges

Chiyah-Garcia, Javier, Suglia, Alessandro, Eshghi, Arash, Hastie, Helen

arXiv.org Artificial IntelligenceJul-28-2023

Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarificational Exchanges (CE): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models. We use the SIMMC 2.0 dataset to evaluate the ability of different state-of-the-art model architectures to process CEs, with a metric that probes the contextual updates that arise from them in the model. We find that language-based models are able to encode simple multi-modal semantic information and process some CEs, excelling with those related to the dialogue history, whilst multi-modal models can use additional learning objectives to obtain disentangled object representations, which become crucial to handle complex referential ambiguities across modalities overall.

machine learning, natural language, object-oriented architecture, (19 more...)

arXiv.org Artificial Intelligence

2307.15554

Country:

North America > Dominican Republic (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(9 more...)

Genre:

Personal > Interview (0.83)
Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.49)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.49)

Add feedback

Learning Dynamic Attribute-factored World Models for Efficient Multi-object Reinforcement Learning

Feng, Fan, Magliacane, Sara

arXiv.org Artificial IntelligenceJul-18-2023

In many reinforcement learning tasks, the agent has to learn to interact with many objects of different types and generalize to unseen combinations and numbers of objects. Often a task is a composition of previously learned tasks (e.g. block stacking). These are examples of compositional generalization, in which we compose object-centric representations to solve complex tasks. Recent works have shown the benefits of object-factored representations and hierarchical abstractions for improving sample efficiency in these settings. On the other hand, these methods do not fully exploit the benefits of factorization in terms of object attributes. In this paper, we address this opportunity and introduce the Dynamic Attribute FacTored RL (DAFT-RL) framework. In DAFT-RL, we leverage object-centric representation learning to extract objects from visual inputs. We learn to classify them in classes and infer their latent parameters. For each class of object, we learn a class template graph that describes how the dynamics and reward of an object of this class factorize according to its attributes. We also learn an interaction pattern graph that describes how objects of different classes interact with each other at the attribute level. Through these graphs and a dynamic interaction graph that models the interactions between objects, we can learn a policy that can then be directly applied in a new environment by just estimating the interactions and latent parameters. We evaluate DAFT-RL in three benchmark datasets and show our framework outperforms the state-of-the-art in generalizing across unseen objects with varying attributes and latent parameters, as well as in the composition of previously learned tasks.

machine learning, object-oriented architecture, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2307.09205

Country:

North America > United States > California > Orange County > Irvine (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Hong Kong (0.04)

Genre:

Research Report (0.50)
Workflow (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

NeuSE: Neural SE(3)-Equivariant Embedding for Consistent Spatial Understanding with Objects

Fu, Jiahui, Du, Yilun, Singh, Kurran, Tenenbaum, Joshua B., Leonard, John J.

arXiv.org Artificial IntelligenceJul-10-2023

We present NeuSE, a novel Neural SE(3)-Equivariant Embedding for objects, and illustrate how it supports object SLAM for consistent spatial understanding with long-term scene changes. NeuSE is a set of latent object embeddings created from partial object observations. It serves as a compact point cloud surrogate for complete object models, encoding full shape information while transforming SE(3)-equivariantly in tandem with the object in the physical world. With NeuSE, relative frame transforms can be directly derived from inferred latent codes. Our proposed SLAM paradigm, using NeuSE for object shape and pose characterization, can operate independently or in conjunction with typical SLAM systems. It directly infers SE(3) camera pose constraints that are compatible with general SLAM pose graph optimization, while also maintaining a lightweight object-centric map that adapts to real-world changes. Our approach is evaluated on synthetic and real-world sequences featuring changed objects and shows improved localization accuracy and change-aware mapping capability, when working either standalone or jointly with a common SLAM pipeline.

latent code, representation, trajectory, (16 more...)

arXiv.org Artificial Intelligence

2303.07308

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.68)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.46)

Add feedback

Active Class Selection for Few-Shot Class-Incremental Learning

McClurg, Christopher, Ayub, Ali, Tyagi, Harsh, Rajtmajer, Sarah M., Wagner, Alan R.

arXiv.org Artificial IntelligenceJul-5-2023

For real-world applications, robots will need to continually learn in their environments through limited interactions with their users. Toward this, previous works in few-shot class incremental learning (FSCIL) and active class selection (ACS) have achieved promising results but were tested in constrained setups. Therefore, in this paper, we combine ideas from FSCIL and ACS to develop a novel framework that can allow an autonomous agent to continually learn new objects by asking its users to label only a few of the most informative objects in the environment. To this end, we build on a state-of-the-art (SOTA) FSCIL model and extend it with techniques from ACS literature. We further integrate a potential field-based navigation technique with our model to develop a complete framework that can allow an agent to process and reason on its sensory data through the FIASco model, navigate towards the most informative object in the environment, gather data about the object through its sensors and incrementally update the FIASco model. A primary challenge faced by robots deployed in the real world is continual adaptation to dynamic environments. Central to this challenge is object recognition (Ayub & Wagner, 2020d), a task typically requiring labeled examples. In this work, we address the problem of parsimonious object labelling wherein a robot may request labels for a small number of objects about which it knows least. In recent years, several works have been directed toward the problem of Few-Shot Class Incremental Learning (FSCIL) (Tao et al., 2020; Ayub & Wagner, 2020c) to develop models of incremental object learning that can learn from limited training data for each object class. The literature has made significant progress toward developing robots that can continually learn new objects from limited training data while preserving knowledge of previous objects. However, existing methods make strong assumptions about the training data that are rarely true in the real world. For example, FSCIL assumes that in each increment the robot will receive a fully labeled image dataset for the object classes in that increment, and the robot will not receive more data for these classes again (Tao et al., 2020; Ayub & Wagner, 2020c;d). In real world environments, however, robots will most likely encounter many unlabeled objects in their environment, and they will have to direct their learning toward a smaller subset of those unknown objects. Active learning is a subfield of machine learning that focuses on improving the learning efficiency of models by selectively seeking labels from within a large unlabeled data pool (Settles, 2009; Ayub & Fendley, 2022). Related to active learning is active class selection (ACS) in which a model seeks labels for specific object classes (Lomasky et al., 2007).

artificial intelligence, machine learning, object-oriented architecture, (16 more...)

arXiv.org Artificial Intelligence

2307.02641

Country:

North America > United States > Pennsylvania (0.04)
Europe > Sweden > Skåne County > Malmö (0.04)

Genre: Research Report (0.82)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.49)
(2 more...)

Add feedback

Deep R Programming

Gagolewski, Marek

arXiv.org Artificial IntelligenceJun-28-2023

Deep R Programming is a comprehensive and in-depth introductory course on one of the most popular languages for data science. It equips ambitious students, professionals, and researchers with the knowledge and skills to become independent users of this potent environment so that they can tackle any problem related to data wrangling and analytics, numerical computing, statistics, and machine learning. This textbook is a non-profit project. Its online and PDF versions are freely available at .

data mining, machine learning, programming language, (27 more...)

arXiv.org Artificial Intelligence

doi: 10.5281/zenodo.7490464

2301.01188

Country:

Europe > Poland > Masovia Province > Warsaw (0.04)
Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
(3 more...)

Genre:

Summary/Review (1.00)
Research Report > Experimental Study (0.67)
Research Report > New Finding (0.67)
Instructional Material > Course Syllabus & Notes (0.65)

Industry:

Banking & Finance (1.00)
Transportation (0.92)
Information Technology (0.92)
(5 more...)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Software Engineering (1.00)
Information Technology > Information Management (1.00)
(9 more...)

Add feedback

VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

Zhao, Tiancheng, Zhang, Tianqi, Zhu, Mingwei, Shen, Haozhan, Lee, Kyusong, Lu, Xiaopeng, Yin, Jianwei

arXiv.org Artificial IntelligenceJun-22-2023

Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal downstream tasks. Most existing works evaluated their systems by comparing the fine-tuned downstream task performance. However, only average downstream task accuracy provides little information about the pros and cons of each VLP method, let alone provides insights on how the community can improve the systems in the future. Inspired by the CheckList for testing natural language processing, we exploit VL-CheckList, a novel framework to understand the capabilities of VLP models. The proposed method divides the image-texting ability of a VLP model into three categories: objects, attributes, and relations, and uses a novel taxonomy to further break down these three aspects. We conduct comprehensive studies to analyze seven recently popular VLP models via the proposed framework. Results confirm the effectiveness of the proposed method by revealing fine-grained differences among the compared models that were not visible from downstream task-only evaluation. Further results show promising research direction in building better VLP models. Our data and code are available at: https://github.com/om-ai-lab/VL-CheckList.

natural language, object-oriented architecture, vlp model, (16 more...)

arXiv.org Artificial Intelligence

2207.00221

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.71)

Add feedback

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ma, Ziqiao, Pan, Jiayi, Chai, Joyce

arXiv.org Artificial IntelligenceJun-14-2023

The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly. Our code is available at https://github.com/sled-group/world-to-words

machine learning, natural language, object-oriented architecture, (17 more...)

arXiv.org Artificial Intelligence

2306.08685

Country:

North America > United States > Michigan (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Education (0.49)
Energy (0.30)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.88)

Add feedback

GeneCIS: A Benchmark for General Conditional Image Similarity

Vaze, Sagar, Carion, Nicolas, Misra, Ishan

arXiv.org Artificial IntelligenceJun-13-2023

We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. Project page at https://sgvaze.github.io/genecis/.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2306.07969

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia (0.04)
Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.67)
(2 more...)

Add feedback

Read, look and detect: Bounding box annotation from image-caption pairs

Sanchez, Eduardo Hugo

arXiv.org Artificial IntelligenceJun-9-2023

Various methods have been proposed to detect objects while reducing the cost of data annotation. For instance, weakly supervised object detection (WSOD) methods rely only on image-level annotations during training. Unfortunately, data annotation remains expensive since annotators must provide the categories describing the content of each image and labeling is restricted to a fixed set of categories. In this paper, we propose a method to locate and label objects in an image by using a form of weaker supervision: image-caption pairs. By leveraging recent advances in vision-language (VL) models and self-supervised vision transformers (ViTs), our method is able to perform phrase grounding and object detection in a weakly supervised manner. Our experiments demonstrate the effectiveness of our approach by achieving a 47.51% recall@1 score in phrase grounding on Flickr30k Entities and establishing a new state-of-the-art in object detection by achieving 21.1 mAP 50 and 10.5 mAP 50:95 on MS COCO when exclusively relying on image-caption pairs.

artificial intelligence, object-oriented architecture, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2306.06149

Country:

Asia > China > Jiangsu Province > Yancheng (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.68)

Add feedback