AITopics | Object-Oriented Architecture

Collaborating Authors

Object-Oriented Architecture

News Overviews Instructional Materials AI-Alerts Classics

Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras

Yu, Jun, Dai, Yutong, Liu, Xiaokang, Huang, Jin, Shen, Yishan, Zhang, Ke, Zhou, Rong, Adhikarla, Eashan, Ye, Wenxuan, Liu, Yixin, Kong, Zhaoming, Zhang, Kai, Yin, Yilong, Namboodiri, Vinod, Davison, Brian D., Moore, Jason H., Chen, Yong

arXiv.org Artificial IntelligenceApr-29-2024

MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including CV, NLP, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for ZSL, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at https://github.com/junfish/Awesome-Multitask-Learning.

multi-objective optimization, overlapped tensor trace norm, unsupervised domain adaptation, (17 more...)

arXiv.org Artificial Intelligence

2404.18961

Country:

Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
(15 more...)

Genre:

Research Report > Promising Solution (1.00)
Overview (1.00)
Research Report > New Finding (0.92)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Education (1.00)
(7 more...)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(8 more...)

Add feedback

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

Li, Dingzhe, Jin, Yixiang, A, Yong, Yu, Hongze, Shi, Jun, Hao, Xiaoshuai, Hao, Peng, Liu, Huaping, Sun, Fuchun, Fang, Bin

arXiv.org Artificial IntelligenceApr-28-2024

The realization of universal robots is an ultimate goal of researchers. However, a key hurdle in achieving this goal lies in the robots' ability to manipulate objects in their unstructured surrounding environments according to different tasks. The learning-based approach is considered an effective way to address generalization. The impressive performance of foundation models in the fields of computer vision and natural language suggests the potential of embedding foundation models into manipulation tasks as a viable path toward achieving general manipulation capability. However, we believe achieving general manipulation capability requires an overarching framework akin to auto driving. This framework should encompass multiple functional modules, with different foundation models assuming distinct roles in facilitating general manipulation capability. This survey focuses on the contributions of foundation models to robot learning for manipulation. We propose a comprehensive framework and detail how foundation models can address challenges in each module of the framework. What's more, we examine current approaches, outline challenges, suggest future research directions, and identify potential risks associated with integrating foundation models into this domain.

arxiv preprint arxiv, foundation model, manipulation, (13 more...)

arXiv.org Artificial Intelligence

2404.18201

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report (1.00)

Industry:

Education (1.00)
Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
(2 more...)

Add feedback

Garbage Segmentation and Attribute Analysis by Robotic Dogs

Xu, Nuo, Liao, Jianfeng, Meng, Qiwei, Song, Wei

arXiv.org Artificial IntelligenceApr-28-2024

Efficient waste management and recycling heavily rely on garbage exploration and identification. In this study, we propose GSA2Seg (Garbage Segmentation and Attribute Analysis), a novel visual approach that utilizes quadruped robotic dogs as autonomous agents to address waste management and recycling challenges in diverse indoor and outdoor environments. Equipped with advanced visual perception system, including visual sensors and instance segmentators, the robotic dogs adeptly navigate their surroundings, diligently searching for common garbage items. Inspired by open-vocabulary algorithms, we introduce an innovative method for object attribute analysis. By combining garbage segmentation and attribute analysis techniques, the robotic dogs accurately determine the state of the trash, including its position and placement properties. This information enhances the robotic arm's grasping capabilities, facilitating successful garbage retrieval. Additionally, we contribute an image dataset, named GSA2D, to support evaluation. Through extensive experiments on GSA2D, this paper provides a comprehensive analysis of GSA2Seg's effectiveness. Dataset available: \href{https://www.kaggle.com/datasets/hellob/gsa2d-2024}{https://www.kaggle.com/datasets/hellob/gsa2d-2024}.

dataset, gsa2seg, segmentation, (15 more...)

arXiv.org Artificial Intelligence

2404.18112

Country:

Asia > Japan (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report > Promising Solution (0.48)

Industry: Water & Waste Management > Solid Waste Management (0.70)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.36)

Add feedback

Subobject-level Image Tokenization

Chen, Delong, Cahyawijaya, Samuel, Liu, Jianfeng, Wang, Baoyuan, Fung, Pascale

arXiv.org Artificial IntelligenceApr-23-2024

Transformer-based vision models typically tokenize images into fixed-size square patches as input units, which lacks the adaptability to image content and overlooks the inherent pixel grouping structure. Inspired by the subword tokenization widely adopted in language models, we propose an image tokenizer at a subobject level, where the subobjects are represented by semantically meaningful image segments obtained by segmentation models (e.g., segment anything models). To implement a learning system based on subobject tokenization, we first introduced a Direct Segment Anything Model (DirectSAM) that efficiently produces comprehensive segmentation of subobjects, then embed subobjects into compact latent vectors and fed them into a large language model for vision language learning. Empirical results demonstrated that our subobject-level tokenization significantly facilitates efficient learning of translating images into object and attribute descriptions compared to the traditional patch-level tokenization.

subobject, subobject token, tokenization, (15 more...)

arXiv.org Artificial Intelligence

2402.14327

Country:

North America > United States > California > Los Angeles County > Long Beach (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(9 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Resilience through Scene Context in Visual Referring Expression Generation

Junker, Simeon, Zarrieß, Sina

arXiv.org Artificial IntelligenceApr-18-2024

Scene context is well known to facilitate humans' perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models' visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.

expression, information, representation, (16 more...)

arXiv.org Artificial Intelligence

2404.12289

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Czechia > Prague (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(11 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Rethinking 3D Dense Caption and Visual Grounding in A Unified Framework through Prompt-based Localization

Luo, Yongdong, Lin, Haojia, Zheng, Xiawu, Jiang, Yigeng, Chao, Fei, Hu, Jie, Jiang, Guannan, Zhang, Songan, Ji, Rongrong

arXiv.org Artificial IntelligenceApr-17-2024

3D Visual Grounding (3DVG) and 3D Dense Captioning (3DDC) are two crucial tasks in various 3D applications, which require both shared and complementary information in localization and visual-language relationships. Therefore, existing approaches adopt the two-stage "detect-then-describe/discriminate" pipeline, which relies heavily on the performance of the detector, resulting in suboptimal performance. Inspired by DETR, we propose a unified framework, 3DGCTR, to jointly solve these two distinct but closely related tasks in an end-to-end fashion. The key idea is to reconsider the prompt-based localization ability of the 3DVG model. In this way, the 3DVG model with a well-designed prompt as input can assist the 3DDC task by extracting localization information from the prompt. In terms of implementation, we integrate a Lightweight Caption Head into the existing 3DVG network with a Caption Text Prompt as a connection, effectively harnessing the existing 3DVG model's inherent localization capacity, thereby boosting 3DDC capability. This integration facilitates simultaneous multi-task training on both tasks, mutually enhancing their performance. Extensive experimental results demonstrate the effectiveness of this approach. Specifically, on the ScanRefer dataset, 3DGCTR surpasses the state-of-the-art 3DDC method by 4.3% in CIDEr@0.5IoU in MLE training and improves upon the SOTA 3DVG method by 3.16% in Acc@0.25IoU.

proceedings, query, visual grounding, (14 more...)

arXiv.org Artificial Intelligence

2404.11064

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Fujian Province > Xiamen (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Evaluating Text-to-Image Synthesis: Survey and Taxonomy of Image Quality Metrics

Hartwig, Sebastian, Engel, Dominik, Sick, Leon, Kniesel, Hannah, Payer, Tristan, Poonam, Poonam, Glöckler, Michael, Bäuerle, Alex, Ropinski, Timo

arXiv.org Artificial IntelligenceApr-15-2024

Recent advances in text-to-image synthesis enabled through a combination of language and vision foundation models have led to a proliferation of the tools available and an increased attention to the field. When conducting text-to-image synthesis, a central goal is to ensure that the content between text and image is aligned. As such, there exist numerous evaluation metrics that aim to mimic human judgement. However, it is often unclear which metric to use for evaluating text-to-image synthesis systems as their evaluation is highly nuanced. In this work, we provide a comprehensive overview of existing text-to-image evaluation metrics. Based on our findings, we propose a new taxonomy for categorizing these metrics. Our taxonomy is grounded in the assumption that there are two main quality criteria, namely compositionality and generality, which ideally map to human preferences. Ultimately, we derive guidelines for practitioners conducting text-to-image evaluation, discuss open challenges of evaluation mechanisms, and surface limitations of current metrics.

dataset, evaluation, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2403.11821

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Germany (0.04)
North America > United States > New York > New York County > New York City (0.04)
(12 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.87)

Industry:

Media > Photography (1.00)
Information Technology (0.93)
Health & Medicine (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
(3 more...)

Add feedback

GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Khanna, Mukul, Ramrakhya, Ram, Chhablani, Gunjan, Yenamandra, Sriram, Gervet, Theophile, Chang, Matthew, Kira, Zsolt, Chaplot, Devendra Singh, Batra, Dhruv, Mottaghi, Roozbeh

arXiv.org Artificial IntelligenceApr-9-2024

The Embodied AI community has made significant strides in visual navigation tasks, exploring targets from 3D coordinates, objects, language descriptions, and images. However, these navigation models often handle only a single input modality as the target. With the progress achieved so far, it is time to move towards universal navigation models capable of handling various goal types, enabling more effective user interaction with robots. To facilitate this goal, we propose GOAT-Bench, a benchmark for the universal navigation task referred to as GO to AnyThing (GOAT). In this task, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image in an open-vocabulary fashion. We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities, the role of explicit and implicit scene memories, their robustness to noise in goal specifications, and the impact of memory in lifelong scenarios.

agent, category, navigation, (17 more...)

arXiv.org Artificial Intelligence

2404.06609

Country: North America > United States > Illinois > Champaign County > Urbana (0.04)

Genre: Research Report (0.50)

Industry:

Appliances & Durable Goods (0.68)
Materials (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Object-conditioned Bag of Instances for Few-Shot Personalized Instance Recognition

Michieli, Umberto, Moon, Jijoong, Kim, Daehyun, Ozay, Mete

arXiv.org Artificial IntelligenceApr-1-2024

Nowadays, users demand for increased personalization of vision systems to localize and identify personal instances of objects (e.g., my dog rather than dog) from a few-shot dataset only. Despite outstanding results of deep networks on classical label-abundant benchmarks (e.g., those of the latest YOLOv8 model for standard object detection), they struggle to maintain within-class variability to represent different instances rather than object categories only. We construct an Object-conditioned Bag of Instances (OBoI) based on multi-order statistics of extracted features, where generic object detection models are extended to search and identify personal instances from the OBoI's metric space, without need for backpropagation. By relying on multi-order statistics, OBoI achieves consistent superior accuracy in distinguishing different instances. In the results, we achieve 77.1% personal object recognition accuracy in case of 18 personal instances, showing about 12% relative gain over the state of the art.

multi-order statistics, oboi, statistics, (16 more...)

arXiv.org Artificial Intelligence

2404.01397

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.36)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.35)

Add feedback

Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP

Abbasi, Reza, Samiei, Mohammad, Rohban, Mohammad Hossein, Baghshah, Mahdieh Soleymani

arXiv.org Artificial IntelligenceMar-27-2024

Vision-language models, such as CLIP, have shown promising Out-of-Distribution (OoD) generalization under various types of distribution shifts. Recent studies attempted to investigate the leading cause of this capability. In this work, we follow the same path, but focus on a specific type of OoD data - images with novel compositions of attribute-object pairs - and study whether such models can successfully classify those images into composition classes. We carefully designed an authentic image test dataset called ImageNet-AO, consisting of attributes for objects that are unlikely encountered in the CLIP training sets. We found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization compared to both supervised models and CLIPs trained with smaller datasets, such as CC-12M and YFCC-15M. Our results provide evidence that the scale and diversity of training data and language supervision play a key role in unlocking the compositional generalization abilities of vision-language models.

dataset, distribution accuracy, generalization, (13 more...)

arXiv.org Artificial Intelligence

2403.18525

Country:

Asia > Middle East > Iran > Tehran Province > Tehran (0.05)
Europe > Poland (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.39)

Add feedback