AITopics | Tai, Yu-Wing

Collaborating Authors

Tai, Yu-Wing

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Dynamic Path Navigation for Motion Agents with LLM Reasoning

Zhao, Yubo, Wu, Qi, Wang, Yifan, Tai, Yu-Wing, Tang, Chi-Keung

arXiv.org Artificial IntelligenceMar-10-2025

Large Language Models (LLMs) have demonstrated strong generalizable reasoning and planning capabilities. However, their efficacies in spatial path planning and obstacle-free trajectory generation remain underexplored. Leveraging LLMs for navigation holds significant potential, given LLMs' ability to handle unseen scenarios, support user-agent interactions, and provide global control across complex systems, making them well-suited for agentic planning and humanoid motion generation. As one of the first studies in this domain, we explore the zero-shot navigation and path generation capabilities of LLMs by constructing a dataset and proposing an evaluation protocol. Specifically, we represent paths using anchor points connected by straight lines, enabling movement in various directions. This approach offers greater flexibility and practicality compared to previous methods while remaining simple and intuitive for LLMs. We demonstrate that, when tasks are well-structured in this manner, modern LLMs exhibit substantial planning proficiency in avoiding obstacles while autonomously refining navigation with the generated motion to reach the target. Further, this spatial reasoning ability of a single LLM motion agent interacting in a static environment can be seamlessly generalized in multi-motion agents coordination in dynamic environments. Unlike traditional approaches that rely on single-step planning or local policies, our training-free LLM-based method enables global, dynamic, closed-loop planning, and autonomously resolving collision issues.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.07323

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents

Liu, Xinhang, Tang, Chi-Keung, Tai, Yu-Wing

arXiv.org Artificial IntelligenceFeb-28-2025

Constructing photorealistic virtual worlds has applications across various fields, but it often requires the extensive labor of highly trained professionals to operate conventional 3D modeling software. To democratize this process, we introduce WorldCraft, a system where large language model (LLM) agents leverage procedural generation to create indoor and outdoor scenes populated with objects, allowing users to control individual object attributes and the scene layout using intuitive natural language commands. In our framework, a coordinator agent manages the overall process and works with two specialized LLM agents to complete the scene creation: ForgeIt, which integrates an ever-growing manual through auto-verification to enable precise customization of individual objects, and ArrangeIt, which formulates hierarchical optimization problems to achieve a layout that balances ergonomic and aesthetic considerations. Additionally, our pipeline incorporates a trajectory control agent, allowing users to animate the scene and operate the camera through natural language interactions. Our system is also compatible with off-the-shelf deep 3D generators to enrich scene assets. Through evaluations and comparisons with state-of-the-art methods, we demonstrate the versatility of WorldCraft, ranging from single-object customization to intricate, large-scale interior and exterior scene designs. This system empowers non-professionals to bring their creative visions to life.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.15601

Country:

Asia (0.14)
North America > United States (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Wang, Zixuan, Tai, Yu-Wing, Tang, Chi-Keung

arXiv.org Artificial IntelligenceOct-4-2024

We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions, and calls the agent for audio generation. Consequently, Audio-Agent generates high-quality audio that is closely aligned with the provided text or video while also supporting variable-length generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio, a process that can be tedious and time-consuming. We propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions to bridge video and audio modality. Thus our framework provides a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training. Multimodal deep generative models have gained increasing attention these years. Essentially, the models are trained to perform tasks based on different kinds of input called modalities, mimicking how humans make decisions from different kinds of senses such as vision and smell Suzuki & Matsuo (2022).

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.03335

Genre: Research Report (1.00)

Industry:

Media > Music (0.46)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

Add feedback

Reward-RAG: Enhancing RAG with Reward Driven Supervision

Nguyen, Thang, Chin, Peter, Tai, Yu-Wing

arXiv.org Artificial IntelligenceOct-3-2024

In this paper, we introduce Reward-RAG, a novel approach designed to enhance the Retrieval-Augmented Generation (RAG) model through Reward-Driven Supervision. Unlike previous RAG methodologies, which focus on training language models (LMs) to utilize external knowledge retrieved from external sources, our method adapts retrieval information to specific domains by employing CriticGPT to train a dedicated reward model. This reward model generates synthesized datasets for fine-tuning the RAG encoder, aligning its outputs more closely with human preferences. The versatility of our approach allows it to be effectively applied across various domains through domain-specific fine-tuning. We evaluate Reward-RAG on publicly available benchmarks from multiple domains, comparing it to state-of-the-art methods. Our experimental results demonstrate significant improvements in performance, highlighting the effectiveness of Reward-RAG in improving the relevance and quality of generated responses. These findings underscore the potential of integrating reward models with RAG to achieve superior outcomes in natural language generation tasks.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2410.0378

Country:

South America > Uruguay (0.29)
North America > United States (0.28)
Europe > Switzerland (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Research Report > Promising Solution (0.86)

Industry:

Leisure & Entertainment > Sports > Soccer (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.71)
Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification

Liu, Jianmeng, Liu, Yichen, Zhang, Yuyao, Meng, Zeyuan, Tai, Yu-Wing, Tang, Chi-Keung

arXiv.org Artificial IntelligenceJun-8-2024

Recent conditional 3D completion works have mainly relied on CLIP or BERT to encode textual information, which cannot support complex instruction. Meanwhile, large language models (LLMs) have shown great potential in multi-modal understanding and generation tasks. Inspired by the recent advancements of LLM, we present Volume Patch LLM (VP-LLM), which leverages LLMs to perform conditional 3D completion in a single-forward pass. To integrate a 3D model into the LLM tokenization configuration, the incomplete 3D object is first divided into small patches that can be encoded independently. These encoded patches are then fed into an LLM along with the text prompt, instructing the LLM to capture the relations between these patches as well as injecting semantic meanings into the 3D object. Our results demonstrate a strong ability of LLMs to interpret complex text instructions and understand 3D objects, surpassing state-of-the-art diffusion-based 3D completion models in generation quality.

arxiv preprint arxiv, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2406.05543

Country:

Asia (0.28)
Europe > Netherlands (0.14)
Oceania > Australia (0.14)

Genre: Research Report > New Finding (0.86)

Industry:

Transportation > Air (0.70)
Aerospace & Defense > Aircraft (0.69)
Transportation > Passenger (0.69)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

C3LLM: Conditional Multimodal Content Generation Using Large Language Models

Wang, Zixuan, Duan, Qinkai, Tai, Yu-Wing, Tang, Chi-Keung

arXiv.org Artificial IntelligenceMay-25-2024

C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities, synthesizing the given conditional information, and making multimodal generation in a discrete manner. Our contributions are as follows. First, we adapt a hierarchical structure for audio generation tasks with pre-trained audio codebooks. Specifically, we train the LLM to generate audio semantic tokens from the given conditions, and further use a non-autoregressive transformer to generate different levels of acoustic tokens in layers to better enhance the fidelity of the generated audio. Second, based on the intuition that LLMs were originally designed for discrete tasks with the next-word prediction method, we use the discrete representation for audio generation and compress their semantic meanings into acoustic tokens, similar to adding "acoustic vocabulary" to LLM. Third, our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model, providing more versatility in an end-to-end fashion. Our C3LLM achieves improved results through various automated evaluation metrics, providing better semantic alignment compared to previous methods.

information, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2405.16136

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

C3Net: Compound Conditioned ControlNet for Multimodal Content Generation

Zhang, Juntao, Liu, Yuehuai, Tai, Yu-Wing, Tang, Chi-Keung

arXiv.org Artificial IntelligenceNov-29-2023

We present Compound Conditioned ControlNet, C3Net, a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically, C3Net first aligns the conditions from multi-modalities to the same semantic latent space using modality-specific encoders based on contrastive training. Then, it generates multimodal outputs based on the aligned latent space, whose semantic information is combined using a ControlNet-like architecture called Control C3-UNet. Correspondingly, with this system design, our model offers an improved solution for joint-modality generation through learning and explaining multimodal conditions instead of simply taking linear interpolations on the latent space. Meanwhile, as we align conditions to a unified latent space, C3Net only requires one trainable Control C3-UNet to work on multimodal semantic information. Furthermore, our model employs unimodal pretraining on the condition alignment stage, outperforming the non-pretrained alignment even on relatively scarce training data and thus demonstrating high-quality compound condition generation. We contribute the first high-quality tri-modal validation set to validate quantitatively that C3Net outperforms or is on par with first and contemporary state-of-the-art multimodal generation. Our codes and tri-modal dataset will be released.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2311.17951

Country: North America > United States > Pennsylvania (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection

Xu, Yue, Li, Yong-Lu, Cui, Kaitong, Wang, Ziyu, Lu, Cewu, Tai, Yu-Wing, Tang, Chi-Keung

arXiv.org Artificial IntelligenceNov-29-2023

Data-efficient learning has garnered significant attention, especially given the current trend of large multi-modal models. Recently, dataset distillation becomes an effective approach for data-efficiency; however, the distillation process itself can still be inefficient. In this work, we model the dataset distillation task within the context of information transport. By observing the substantial data redundancy inherent in the distillation, we argue to put more emphasis on the samples' utility for the distillation task. We introduce and validate a family of data utility estimators and optimal data selection methods to exploit the most valuable samples. This strategy significantly reduces the training costs and extends various existing distillation algorithms to larger and more diversified datasets, e.g., in some cases only 0.04% training data is sufficient for comparable distillation performance. Our method consistently enhances the distillation algorithms, even on much larger-scale and more heterogeneous datasets, e.g. ImageNet-1K and Kinetics-400. This paradigm opens up new avenues in the dynamics of distillation and paves the way for efficient dataset distillation. Our code is available on https://github.com/silicx/GoldFromOres .

artificial intelligence, distillation, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2305.18381

Country: Asia > China (0.14)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Cascade-DETR: Delving into High-Quality Universal Object Detection

Ye, Mingqiao, Ke, Lei, Li, Siyuan, Tai, Yu-Wing, Tang, Chi-Keung, Danelljan, Martin, Yu, Fisher

arXiv.org Artificial IntelligenceJul-20-2023

Object localization in general environments is a fundamental part of vision systems. While dominating on the COCO benchmark, recent Transformer-based detection methods are not competitive in diverse domains. Moreover, these methods still struggle to very accurately estimate the object bounding boxes in complex environments. We introduce Cascade-DETR for high-quality universal object detection. We jointly tackle the generalization to diverse domains and localization accuracy by proposing the Cascade Attention layer, which explicitly integrates object-centric information into the detection decoder by limiting the attention to the previous box prediction. To further enhance accuracy, we also revisit the scoring of queries. Instead of relying on classification scores, we predict the expected IoU of the query, leading to substantially more well-calibrated confidences. Lastly, we introduce a universal object detection benchmark, UDB10, that contains 10 datasets from diverse domains. While also advancing the state-of-the-art on COCO, Cascade-DETR substantially improves DETR-based detectors on all datasets in UDB10, even by over 10 mAP in some cases. The improvements under stringent quality requirements are even more pronounced. Our code and models will be released at https://github.com/SysCV/cascade-detr.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2307.11035

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Mask-Free Video Instance Segmentation

Ke, Lei, Danelljan, Martin, Ding, Henghui, Tai, Yu-Wing, Tang, Chi-Keung, Yu, Fisher

arXiv.org Artificial IntelligenceMar-28-2023

The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance. Our code and trained models are available at https://github.com/SysCV/MaskFreeVis.

artificial intelligence, machine learning, segmentation, (18 more...)

arXiv.org Artificial Intelligence

2303.15904

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback