AITopics | Wang, Ruiyu

Collaborating Authors

Wang, Ruiyu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Lifelong Learning with Task-Specific Adaptation: Addressing the Stability-Plasticity Dilemma

Wang, Ruiyu, Wang, Sen, Zuo, Xinxin, Sun, Qiang

arXiv.org Artificial IntelligenceMar-8-2025

Lifelong learning (LL) aims to continuously acquire new knowledge while retaining previously learned knowledge. A central challenge in LL is the stability-plasticity dilemma, which requires models to balance the preservation of previous knowledge (stability) with the ability to learn new tasks (plasticity). While parameter-efficient fine-tuning (PEFT) has been widely adopted in large language models, its application to lifelong learning remains underexplored. To bridge this gap, this paper proposes AdaLL, an adapter-based framework designed to address the dilemma through a simple, universal, and effective strategy. AdaLL co-trains the backbone network and adapters under regularization constraints, enabling the backbone to capture task-invariant features while allowing the adapters to specialize in task-specific information. Unlike methods that freeze the backbone network, AdaLL incrementally enhances the backbone's capabilities across tasks while minimizing interference through backbone regularization. This architectural design significantly improves both stability and plasticity, effectively eliminating the stability-plasticity dilemma. Extensive experiments demonstrate that AdaLL consistently outperforms existing methods across various configurations, including dataset choices, task sequences, and task scales.

adapter, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2503.06213

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.67)

Industry: Education > Educational Setting > Continuing Education (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models

Wang, Ruiyu, Yuan, Yu, Sun, Shizhao, Bian, Jiang

arXiv.org Artificial IntelligenceFeb-5-2025

Creating Computer-Aided Design (CAD) models requires significant expertise and effort. Text-to-CAD, which converts textual descriptions into CAD parametric sequences, is crucial in streamlining this process. Recent studies have utilized ground-truth parametric sequences, known as sequential signals, as supervision to achieve this goal. However, CAD models are inherently multimodal, comprising parametric sequences and corresponding rendered visual objects. Besides,the rendering process from parametric sequences to visual objects is many-to-one. Therefore, both sequential and visual signals are critical for effective training. In this work, we introduce CADFusion, a framework that uses Large Language Models (LLMs) as the backbone and alternates between two training stages: the sequential learning (SL) stage and the visual feedback (VF) stage. In the SL stage, we train LLMs using ground-truth parametric sequences, enabling the generation of logically coherent parametric sequences. In the VF stage, we reward parametric sequences that render into visually preferred objects and penalize those that do not, allowing LLMs to learn how rendered visual objects are perceived and evaluated. These two stages alternate throughout the training, ensuring balanced learning and preserving benefits of both signals. Experiments demonstrate that CADFusion significantly improves performance, both qualitatively and quantitatively.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2501.19054

Country:

Asia (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (1.00)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene Rearrangement

Jin, Shutong, Wang, Ruiyu, Chen, Kuangyi, Pokorny, Florian T.

arXiv.org Artificial IntelligenceDec-1-2024

Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.22059

Country: Europe > Germany (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Double Oracle Neural Architecture Search for Game Theoretic Deep Learning Models

Aung, Aye Phyu Phyu, Wang, Xinrun, Wang, Ruiyu, Chan, Hau, An, Bo, Li, Xiaoli, Senthilnath, J.

arXiv.org Artificial IntelligenceOct-7-2024

In this paper, we propose a new approach to train deep learning models using game theory concepts including Generative Adversarial Networks (GANs) and Adversarial Training (AT) where we deploy a double-oracle framework using best response oracles. GAN is essentially a two-player zero-sum game between the generator and the discriminator. The same concept can be applied to AT with attacker and classifier as players. Training these models is challenging as a pure Nash equilibrium may not exist and even finding the mixed Nash equilibrium is difficult as training algorithms for both GAN and AT have a large-scale strategy space. Extending our preliminary model DO-GAN, we propose the methods to apply the double oracle framework concept to Adversarial Neural Architecture Search (NAS for GAN) and Adversarial Training (NAS for AT) algorithms. We first generalize the players' strategies as the trained models of generator and discriminator from the best response oracles. We then compute the meta-strategies using a linear program. For scalability of the framework where multiple network models of best responses are stored in the memory, we prune the weakly-dominated players' strategies to keep the oracles from becoming intractable. Finally, we conduct experiments on MNIST, CIFAR-10 and TinyImageNet for DONAS-GAN. We also evaluate the robustness under FGSM and PGD attacks on CIFAR-10, SVHN and TinyImageNet for DONAS-AT. We show that all our variants have significant improvements in both subjective qualitative evaluation and quantitative metrics, compared with their respective base architectures.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2410.04764

Country:

North America > United States > Nebraska (0.14)
North America > United States > Massachusetts (0.14)

Genre:

Personal (0.68)
Research Report (0.50)

Industry:

Education (0.46)
Leisure & Entertainment > Games (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Feature Extractor or Decision Maker: Rethinking the Role of Visual Encoders in Visuomotor Policies

Wang, Ruiyu, Zhuang, Zheyu, Jin, Shutong, Ingelhag, Nils, Kragic, Danica, Pokorny, Florian T.

arXiv.org Artificial IntelligenceSep-30-2024

An end-to-end (E2E) visuomotor policy is typically treated as a unified whole, but recent approaches using out-of-domain (OOD) data to pretrain the visual encoder have cleanly separated the visual encoder from the network, with the remainder referred to as the policy. We propose Visual Alignment Testing, an experimental framework designed to evaluate the validity of this functional separation. Our results indicate that in E2E-trained models, visual encoders actively contribute to decision-making resulting from motor data supervision, contradicting the assumed functional separation. In contrast, OOD-pretrained models, where encoders lack this capability, experience an average performance drop of 42% in our benchmark results, compared to the state-of-the-art performance achieved by E2E policies. We believe this initial exploration of visual encoders' role can provide a first step towards guiding future pretraining methods to address their decision-making ability, such as developing task-conditioned or context-aware encoders.

artificial intelligence, encoder, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2409.20248

Country: Europe > Sweden (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)

Add feedback

Can Language Model Understand Word Semantics as A Chatbot? An Empirical Study of Language Model Internal External Mismatch

Zhao, Jinman, Zhang, Xueyan, Yue, Xingyu, Chen, Weizhe, Qian, Zifan, Wang, Ruiyu

arXiv.org Artificial IntelligenceSep-20-2024

Current common interactions with language models is through full inference. This approach may not necessarily align with the model's internal knowledge. Studies show discrepancies between prompts and internal representations. Most focus on sentence understanding. We study the discrepancy of word semantics understanding in internal and external mismatch across Encoder-only, Decoder-only, and Encoder-Decoder pre-trained language models.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2409.13972

Country:

Asia (0.69)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.15)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

UniPredict: Large Language Models are Universal Tabular Classifiers

Wang, Ruiyu, Wang, Zifeng, Sun, Jimeng

arXiv.org Artificial IntelligenceJan-16-2024

Tabular data prediction is a fundamental machine learning task for many applications. Existing methods predominantly employ discriminative modeling and operate under the assumption of a fixed target column, necessitating re-training for every new predictive task. Inspired by the generative power of large language models (LLMs), this paper exploits the idea of building universal tabular data predictors based on generative modeling, namely UniPredict. Here, we demonstrate the scalability of an LLM to extensive tabular datasets, enabling it to comprehend diverse tabular inputs and predict target variables following the provided instructions. Specifically, we train a single LLM on an aggregation of 169 tabular datasets with diverse targets and compare its performance against baselines that are trained on each dataset separately. We observe this versatile UniPredict model demonstrates an advantage over other models, ranging from 5.4% to 13.4%, when compared with the best tree-boosting baseline and the best neural network baseline, respectively. We further test UniPredict in few-shot learning settings on another 62 tabular datasets. Our method achieves strong performance in quickly adapting to new tasks. In low-resource few-shot setup, we observed a 100%+ performance advantage compared with XGBoost, and significant margin over all baselines. We envision that UniPredict sheds light on developing a universal tabular data prediction system that learns from data at scale and serves a wide range of prediction tasks.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2310.03266

Country: North America > United States > Illinois (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Leisure & Entertainment (1.00)
Education > Educational Setting (1.00)
Banking & Finance (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Large Language Models on Lexical Semantic Change Detection: An Evaluation

Wang, Ruiyu, Choi, Matthew

arXiv.org Artificial IntelligenceDec-10-2023

Lexical Semantic Change Detection stands out as one of the few areas where Large Language Models (LLMs) have not been extensively involved. Traditional methods like PPMI, and SGNS remain prevalent in research, alongside newer BERT-based approaches. Despite the comprehensive coverage of various natural language processing domains by LLMs, there is a notable scarcity of literature concerning their application in this specific realm. In this work, we seek to bridge this gap by introducing LLMs into the domain of Lexical Semantic Change Detection. Our work presents novel prompting solutions and a comprehensive evaluation that spans all three generations of language models, contributing to the exploration of LLMs in this research area.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2312.06002

Country:

North America > Canada > Ontario > Toronto (0.29)
Asia > Middle East (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

Video Transformers under Occlusion: How Physics and Background Attributes Impact Large Models for Robotic Manipulation

Jin, Shutong, Wang, Ruiyu, Zahid, Muhammad, Pokorny, Florian T.

arXiv.org Artificial IntelligenceOct-11-2023

As transformer architectures and dataset sizes continue to scale, the need to understand the specific dataset factors affecting model performance becomes increasingly urgent. This paper investigates how object physics attributes (color, friction coefficient, shape) and background characteristics (static, dynamic, background complexity) influence the performance of Video Transformers in trajectory prediction tasks under occlusion. Beyond mere occlusion challenges, this study aims to investigate three questions: How do object physics attributes and background characteristics influence the model performance? What kinds of attributes are most influential to the model generalization? Is there a data saturation point for large transformer model performance within a single task? To facilitate this research, we present OccluManip, a real-world video-based robot pushing dataset comprising 460,000 consistent recordings of objects with different physics and varying backgrounds. 1.4 TB and in total 1278 hours of high-quality videos of flexible temporal length along with target object trajectories are collected, accommodating tasks with different temporal requirements. Additionally, we propose Video Occlusion Transformer (VOT), a generic video-transformer-based network achieving an average 96% accuracy across all 18 sub-datasets provided in OccluManip. OccluManip and VOT will be released at: https://github.com/ShutongJIN/OccluManip.git

large language model, machine learning, natural language, (6 more...)

arXiv.org Artificial Intelligence

2310.02044

Genre: Research Report (0.89)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

Add feedback

Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop Visual Reasoning

Hu, Pengbo, Qi, Ji, Li, Xingyu, Li, Hong, Wang, Xinqi, Quan, Bing, Wang, Ruiyu, Zhou, Yi

arXiv.org Artificial IntelligenceAug-20-2023

There emerges a promising trend of using large language models (LLMs) to generate code-like plans for complex inference tasks such as visual reasoning. This paradigm, known as LLM-based planning, provides flexibility in problem solving and endows better interpretability. However, current research is mostly limited to basic scenarios of simple questions that can be straightforward answered in a few inference steps. Planning for the more challenging multi-hop visual reasoning tasks remains under-explored. Specifically, under multi-hop reasoning situations, the trade-off between accuracy and the complexity of plan-searching becomes prominent. The prevailing algorithms either address the efficiency issue by employing the fast one-stop generation or adopt a complex iterative generation method to improve accuracy. Both fail to balance the need for efficiency and performance. Drawing inspiration from the dual system of cognition in the human brain, the fast and the slow think processes, we propose a hierarchical plan-searching algorithm that integrates the one-stop reasoning (fast) and the Tree-of-thought (slow). Our approach succeeds in performance while significantly saving inference steps. Moreover, we repurpose the PTR and the CLEVER datasets, developing a systematic framework for evaluating the performance and efficiency of LLMs-based plan-search algorithms under reasoning tasks at different levels of difficulty. Extensive experiments demonstrate the superiority of our proposed algorithm in terms of performance and efficiency. The dataset and code will be release soon.

artificial intelligence, natural language, relation, (15 more...)

arXiv.org Artificial Intelligence

2308.09658

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback