AITopics | Xu, Jiaqi

Collaborating Authors

Xu, Jiaqi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models

Yan, Qiao, Yuan, Yuchen, Hu, Xiaowei, Wang, Yihan, Xu, Jiaqi, Li, Jinpeng, Fu, Chi-Wing, Heng, Pheng-Ann

arXiv.org Artificial IntelligenceFeb-28-2025

The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at MedHallTune.

large language model, medhalltune, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.2078

Country: Asia > China (0.29)

Genre: Research Report > New Finding (0.89)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)

Add feedback

InterFormer: Towards Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction

Zeng, Zhichen, Liu, Xiaolong, Hang, Mengyue, Liu, Xiaoyi, Zhou, Qinghai, Yang, Chaofei, Liu, Yiqun, Ruan, Yichen, Chen, Laming, Chen, Yuxin, Hao, Yujia, Xu, Jiaqi, Nie, Jade, Liu, Xi, Zhang, Buyun, Wen, Wei, Yuan, Siyang, Wang, Kai, Chen, Wen-Yen, Han, Yiping, Li, Huayu, Yang, Chunzhi, Long, Bo, Yu, Philip S., Tong, Hanghang, Yang, Jiyan

arXiv.org Artificial IntelligenceJan-7-2025

Click-through rate (CTR) prediction, which predicts the probability of a user clicking an ad, is a fundamental task in recommender systems. The emergence of heterogeneous information, such as user profile and behavior sequences, depicts user interests from different aspects. A mutually beneficial integration of heterogeneous information is the cornerstone towards the success of CTR prediction. However, most of the existing methods suffer from two fundamental limitations, including (1) insufficient inter-mode interaction due to the unidirectional information flow between modes, and (2) aggressive information aggregation caused by early summarization, resulting in excessive information loss. To address the above limitations, we propose a novel module named InterFormer to learn heterogeneous information interaction in an interleaving style. To achieve better interaction learning, InterFormer enables bidirectional information flow for mutually beneficial learning across different modes. To avoid aggressive information aggregation, we retain complete information in each data mode and use a separate bridging arch for effective information selection and summarization. Our proposed InterFormer achieves state-of-the-art performance on three public datasets and a large-scale industrial dataset.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.09852

Country: North America > United States > Illinois (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)

Add feedback

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Xu, Jiaqi, Zou, Xinyi, Huang, Kunzhe, Chen, Yunkuo, Liu, Bo, Cheng, MengLi, Shi, Xing, Huang, Jun

arXiv.org Artificial IntelligenceJul-5-2024

This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: https://github.com/aigc-apps/EasyAnimate. We are continuously working to enhance the performance of our method.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2405.18991

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Survey of Reasoning with Foundation Models

Sun, Jiankai, Zheng, Chuanyang, Xie, Enze, Liu, Zhengying, Chu, Ruihang, Qiu, Jianing, Xu, Jiaqi, Ding, Mingyu, Li, Hongyang, Geng, Mengzhe, Wu, Yue, Wang, Wenhai, Chen, Junsong, Yin, Zhangyue, Ren, Xiaozhe, Fu, Jie, He, Junxian, Yuan, Wu, Liu, Qi, Liu, Xihui, Li, Yu, Dong, Hao, Cheng, Yu, Zhang, Ming, Heng, Pheng Ann, Dai, Jifeng, Luo, Ping, Wang, Jingdong, Wen, Ji-Rong, Qiu, Xipeng, Guo, Yike, Xiong, Hui, Liu, Qun, Li, Zhenguo

arXiv.org Artificial IntelligenceJan-25-2024

Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, e.g., Large Language Models (LLMs), there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI.

artificial general intelligence, large language model, machine learning, (29 more...)

arXiv.org Artificial Intelligence

2312.11562

Country:

Europe (1.00)
Asia > China (1.00)
Asia > Middle East (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Overview (1.00)
(2 more...)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(13 more...)

Add feedback

FaceChain: A Playground for Human-centric Artificial Intelligence Generated Content

Liu, Yang, Yu, Cheng, Shang, Lei, He, Yongyi, Wu, Ziheng, Wang, Xingjun, Xu, Chao, Xie, Haoyu, Wang, Weida, Zhao, Yuze, Zhu, Lin, Cheng, Chen, Chen, Weitao, Yao, Yuan, Zhou, Wenmeng, Xu, Jiaqi, Wang, Qiang, Chen, Yingda, Xie, Xuansong, Sun, Baigui

arXiv.org Artificial IntelligenceDec-13-2023

Recent advancement in personalized image generation have unveiled the intriguing capability of pre-trained text-to-image models on learning identity information from a collection of portrait images. However, existing solutions are vulnerable in producing truthful details, and usually suffer from several defects such as (i) The generated face exhibit its own unique characteristics, \ie facial shape and facial feature positioning may not resemble key characteristics of the input, and (ii) The synthesized face may contain warped, blurred or corrupted regions. In this paper, we present FaceChain, a personalized portrait generation framework that combines a series of customized image-generation model and a rich set of face-related perceptual understanding models (\eg, face detection, deep face embedding extraction, and facial attribute recognition), to tackle aforementioned challenges and to generate truthful personalized portraits, with only a handful of portrait images as input. Concretely, we inject several SOTA face models into the generation procedure, achieving a more efficient label-tagging, data-processing, and model post-processing compared to previous solutions, such as DreamBooth ~\cite{ruiz2023dreambooth} , InstantBooth ~\cite{shi2023instantbooth} , or other LoRA-only approaches ~\cite{hu2021lora} . Besides, based on FaceChain, we further develop several applications to build a broader playground for better showing its value, including virtual try on and 2D talking head. We hope it can grow to serve the burgeoning needs from the communities. Note that this is an ongoing work that will be consistently refined and improved upon. FaceChain is open-sourced under Apache-2.0 license at \url{https://github.com/modelscope/facechain}.

artificial intelligence, machine learning, portrait, (16 more...)

arXiv.org Artificial Intelligence

2308.14256

Genre: Research Report (0.50)

Industry: Information Technology (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling

Xu, Jiaqi, Liu, Bo, Chen, Yunkuo, Cheng, Mengli, Shi, Xing

arXiv.org Artificial IntelligenceMar-10-2023

Video-and-language understanding has a variety of applications in the industry, such as video question answering, text-video retrieval and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal encoders and feature fusion modules, which consume large amounts of GPU memory. Especially, they have difficulty dealing with dense video frames or long text that are prevalent in industrial applications. In this paper, we propose MuLTI, a highly accurate and memory-efficient video-and-language understanding model that achieves efficient and effective feature fusion through feature sampling and attention modules. Therefore, MuLTI can handle longer sequences with limited GPU memory. Then, we introduce an attention-based adapter to the encoders, which finetunes the shallow features to improve the model's performance with low GPU memory consumption. Finally, to further improve the model's performance, we introduce a new pretraining task named Multiple Choice Modeling to bridge the task gap between pretraining and downstream tasks and enhance the model's ability to align the video and the text. Benefiting from the efficient feature fusion module, the attention-based adapter and the new pretraining task, MuLTI achieves state-of-the-art performance on multiple datasets. Implementation and pretrained models will be released.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2303.05707

Genre: Research Report (0.82)

Industry: Education (0.71)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

SurRoL: An Open-source Reinforcement Learning Centered and dVRK Compatible Platform for Surgical Robot Learning

Xu, Jiaqi, Li, Bin, Lu, Bo, Liu, Yun-Hui, Dou, Qi, Heng, Pheng-Ann

arXiv.org Artificial IntelligenceAug-30-2021

Autonomous surgical execution relieves tedious routines and surgeon's fatigue. Recent learning-based methods, especially reinforcement learning (RL) based methods, achieve promising performance for dexterous manipulation, which usually requires the simulation to collect data efficiently and reduce the hardware cost. The existing learning-based simulation platforms for medical robots suffer from limited scenarios and simplified physical interactions, which degrades the real-world performance of learned policies. In this work, we designed SurRoL, an RL-centered simulation platform for surgical robot learning compatible with the da Vinci Research Kit (dVRK). The designed SurRoL integrates a user-friendly RL library for algorithm development and a real-time physics engine, which is able to support more PSM/ECM scenarios and more realistic physical interactions. Ten learning-based surgical tasks are built in the platform, which are common in the real autonomous surgical execution. We evaluate SurRoL using RL algorithms in simulation, provide in-depth analysis, deploy the trained policies on the real dVRK, and show that our SurRoL achieves better transferability in the real world.

artificial intelligence, health & medicine, simulation, (18 more...)

arXiv.org Artificial Intelligence

2108.13035

Genre: Research Report (0.50)

Industry: Health & Medicine > Surgery (0.94)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback