AITopics | Chen, Long

Collaborating Authors

Chen, Long

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Hu, Zijing, Zhang, Fengda, Chen, Long, Kuang, Kun, Li, Jiahui, Gao, Kaifeng, Xiao, Jun, Wang, Xin, Zhu, Wenwu

arXiv.org Artificial IntelligenceMar-14-2025

Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$, employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.

diffusion model, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2503.1124

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Renz, Katrin, Chen, Long, Arani, Elahe, Sinavski, Oleg

arXiv.org Artificial IntelligenceMar-12-2025

Integrating large language models (LLMs) into autonomous driving has attracted significant attention with the hope of improving generalization and explainability. However, existing methods often focus on either driving or vision-language understanding but achieving both high driving performance and extensive language understanding remains challenging. In addition, the dominant approach to tackle vision-language understanding is using visual question answering. However, for autonomous driving, this is only useful if it is aligned with the action space. Otherwise, the model's answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. Our model SimLingo is based on a vision language model (VLM) and works using only camera, excluding expensive sensors like LiDAR. SimLingo obtains state-of-the-art performance on the widely used CARLA simulator on the Bench2Drive benchmark and is the winning entry at the CARLA challenge 2024. Additionally, we achieve strong results in a wide variety of language-related tasks while maintaining high driving performance.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2503.09594

Genre: Research Report (0.82)

Industry:

Transportation > Ground > Road (1.00)
Energy > Renewable > Geothermal > Geothermal Energy Systems and Facilities > Geothermal System for Power Generation > Advanced Geothermal System (AGS) (0.61)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Learning Perceptive Humanoid Locomotion over Challenging Terrain

Sun, Wandong, Cao, Baoshi, Chen, Long, Su, Yongbo, Liu, Yang, Xie, Zongwu, Liu, Hong

arXiv.org Artificial IntelligenceMar-5-2025

Humanoid robots are engineered to navigate terrains akin to those encountered by humans, which necessitates human-like locomotion and perceptual abilities. Currently, the most reliable controllers for humanoid motion rely exclusively on proprioception, a reliance that becomes both dangerous and unreliable when coping with rugged terrain. Although the integration of height maps into perception can enable proactive gait planning, robust utilization of this information remains a significant challenge, especially when exteroceptive perception is noisy. To surmount these challenges, we propose a solution based on a teacher-student distillation framework. In this paradigm, an oracle policy accesses noise-free data to establish an optimal reference policy, while the student policy not only imitates the teacher's actions but also simultaneously trains a world model with a variational information bottleneck for sensor denoising and state estimation. Extensive evaluations demonstrate that our approach markedly enhances performance in scenarios characterized by unreliable terrain estimations. Moreover, we conducted rigorous testing in both challenging urban settings and off-road environments, the model successfully traverse 2 km of varied terrain without external intervention.

artificial intelligence, humanoid perception controller, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2503.00692

Country: Asia (0.14)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.46)

Add feedback

WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation

Nie, Dujun, Guo, Xianda, Duan, Yiqun, Zhang, Ruijun, Chen, Long

arXiv.org Artificial IntelligenceMar-3-2025

-- Object Goal Navigation---requiring an agent to locate a specific object in an unseen environment---remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)--based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. T o retain the predicted state of the environment, WMNav proposes the online maintained Curiosity V alue Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. T o further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. I NTRODUCTION Effective navigation is a fundamental requirement for domestic robots, allowing them to access specific locations and execute assigned operations [1]. Zero-Shot Object Navigation (ZSON) is a critical component of this functionality, which demands that an agent locate and approach a target object from an unseen category through environmental understanding.

large language model, natural language, navigation, (16 more...)

arXiv.org Artificial Intelligence

2503.02247

Country: Asia > China (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

Tuning-Free Structured Sparse PCA via Deep Unfolding Networks

Chen, Long, Xiu, Xianchao

arXiv.org Artificial IntelligenceFeb-28-2025

Sparse principal component analysis (PCA) is a well-established dimensionality reduction technique that is often used for unsupervised feature selection (UFS). However, determining the regularization parameters is rather challenging, and conventional approaches, including grid search and Bayesian optimization, not only bring great computational costs but also exhibit high sensitivity. To address these limitations, we first establish a structured sparse PCA formulation by integrating $\ell_1$-norm and $\ell_{2,1}$-norm to capture the local and global structures, respectively. Building upon the off-the-shelf alternating direction method of multipliers (ADMM) optimization framework, we then design an interpretable deep unfolding network that translates iterative optimization steps into trainable neural architectures. This innovation enables automatic learning of the regularization parameters, effectively bypassing the empirical tuning requirements of conventional methods. Numerical experiments on benchmark datasets validate the advantages of our proposed method over the existing state-of-the-art methods. Our code will be accessible at https://github.com/xianchaoxiu/SPCA-Net.

artificial intelligence, feature selection, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2502.20837

Country: Asia > China (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)

Add feedback

Learning Humanoid Locomotion with World Model Reconstruction

Sun, Wandong, Chen, Long, Su, Yongbo, Cao, Baoshi, Liu, Yang, Xie, Zongwu

arXiv.org Artificial IntelligenceFeb-22-2025

Humanoid robots are designed to navigate environments accessible to humans using their legs. However, classical research has primarily focused on controlled laboratory settings, resulting in a gap in developing controllers for navigating complex real-world terrains. This challenge mainly arises from the limitations and noise in sensor data, which hinder the robot's understanding of itself and the environment. In this study, we introduce World Model Reconstruction (WMR), an end-to-end learning-based approach for blind humanoid locomotion across challenging terrains. We propose training an estimator to explicitly reconstruct the world state and utilize it to enhance the locomotion policy. The locomotion policy takes inputs entirely from the reconstructed information. The policy and the estimator are trained jointly; however, the gradient between them is intentionally cut off. This ensures that the estimator focuses solely on world reconstruction, independent of the locomotion policy's updates. We evaluated our model on rough, deformable, and slippery surfaces in real-world scenarios, demonstrating robust adaptability and resistance to interference. The robot successfully completed a 3.2 km hike without any human assistance, mastering terrains covered with ice and snow.

artificial intelligence, machine learning, robot, (14 more...)

arXiv.org Artificial Intelligence

2502.1623

Country: Asia (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Energy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.72)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.50)

Add feedback

SleepGMUformer: A gated multimodal temporal neural network for sleep staging

Zhao, Chenjun, Niu, Xuesen, Yu, Xinglin, Chen, Long, Lv, Na, Zhou, Huiyu, Zhao, Aite

arXiv.org Artificial IntelligenceFeb-19-2025

Sleep staging is a central aspect of sleep assessment and research the accuracy of sleep staging is not only relevant to the assessment of sleep quality [3] but also key to achieving early intervention for sleep disorders and related psychiatric disorders [4]. Polysomnography is a multi-parameter study of sleep [5], a test to diagnose sleep disorders through different types of physiological signals recorded during sleep, such as electroencephalography (EEG), cardiography (CG), electrooculography (EOG), electromyography (EMG), oro-nasal airflow and oxygen saturation [6]. According to the Rechtschaffen and Kales (R&K) rule, PSG signals are usually divided into 30-second segments and classified into six sleep stages, namely wakefulness (Wake), four non-rapid eye movement stages (i.e., S1, S2, S3, and S4), and rapid eye movement (REM). In 2007, the American Academy of Sleep Medicine (AASM) adopted the Rechtschaffen & Kales (R&K) sleep staging system for Non-Rapid Eye Movement (NREM) sleep. Sleep specialists typically utilize these criteria for the manual classification of sleep stages, a process that is not only labor-intensive but also prone to subjective bias [7]. Therefore, automated sleep staging is a more efficient alternative to manual methods and has more clinical value [8].

artificial intelligence, machine learning, sleep stage, (16 more...)

arXiv.org Artificial Intelligence

2502.14227

Country:

Asia > China > Shandong Province (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Sleep (1.00)
Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification

Zhu, Yue, Diao, Haiwen, Gao, Shang, Chen, Long, Lu, Huchuan

arXiv.org Artificial IntelligenceFeb-10-2025

Fine-tuning pre-trained vision models for specific tasks is a common practice in computer vision. However, this process becomes more expensive as models grow larger. Recently, parameter-efficient fine-tuning (PEFT) methods have emerged as a popular solution to improve training efficiency and reduce storage needs by tuning additional low-rank modules within pre-trained backbones. Despite their advantages, they struggle with limited representation capabilities and misalignment with pre-trained intermediate features. To address these issues, we introduce an innovative Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission (KARST) for various recognition tasks. Specifically, its multi-kernel design extends Kronecker projections horizontally and separates adaptation matrices into multiple complementary spaces, reducing parameter dependency and creating more compact subspaces. Besides, it incorporates extra learnable re-scaling factors to better align with pre-trained feature distributions, allowing for more flexible and balanced feature aggregation. Extensive experiments validate that our KARST outperforms other PEFT counterparts with a negligible inference cost due to its re-parameterization characteristics. Code is publicly available at: https://github.com/Lucenova/KARST.

artificial intelligence, deep learning, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2502.06779

Country:

Asia > China (0.30)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.50)

Industry: Energy > Oil & Gas > Upstream (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

FAS: Fast ANN-SNN Conversion for Spiking Large Language Models

Chen, Long, Song, Xiaotian, Song, Andy, Chen, BaDong, Lv, Jiancheng, Sun, Yanan

arXiv.org Artificial IntelligenceFeb-6-2025

Spiking Large Language Models have been shown as a good alternative to LLMs in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct training and ANN-SNN conversion, often suffer from performance degradation and relatively high computational costs. To address these issues, we propose a novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking LLMs in two stages. The first stage employs a full-parameter fine-tuning of pre-trained models, so it does not need any direct training from scratch. The second stage introduces a coarse-to-fine calibration method to reduce conversion errors and improve accuracy. Our experiments on both language and vision-language tasks across four different scales of LLMs demonstrate that FAS can achieve state-of-the-art performance yet with significantly reduced inference latency and computational costs. For example, FAS only takes 8 timesteps to achieve an accuracy of 3% higher than that of the OPT-7B model, while reducing energy consumption by 96.63%.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.04405

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Industry:

Energy (0.69)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

3D Foundation AI Model for Generalizable Disease Detection in Head Computed Tomography

Zhu, Weicheng, Huang, Haoxu, Tang, Huanze, Musthyala, Rushabh, Yu, Boyang, Chen, Long, Vega, Emilio, O'Donnell, Thomas, Dehkharghani, Seena, Frontera, Jennifer A., Masurkar, Arjun V., Melmed, Kara, Razavian, Narges

arXiv.org Artificial IntelligenceFeb-4-2025

Head computed tomography (CT) imaging is a widely-used imaging modality with multitudes of medical indications, particularly in assessing pathology of the brain, skull, and cerebrovascular system. It is commonly the first-line imaging in neurologic emergencies given its rapidity of image acquisition, safety, cost, and ubiquity. Deep learning models may facilitate detection of a wide range of diseases. However, the scarcity of high-quality labels and annotations, particularly among less common conditions, significantly hinders the development of powerful models. To address this challenge, we introduce FM-CT: a Foundation Model for Head CT for generalizable disease detection, trained using self-supervised learning. Our approach pre-trains a deep learning model on a large, diverse dataset of 361,663 non-contrast 3D head CT scans without the need for manual annotations, enabling the model to learn robust, generalizable features. To investigate the potential of self-supervised learning in head CT, we employed both discrimination with self-distillation and masked image modeling, and we construct our model in 3D rather than at the slice level (2D) to exploit the structure of head CT scans more comprehensively and efficiently. The model's downstream classification performance is evaluated using internal and three external datasets, encompassing both in-distribution (ID) and out-of-distribution (OOD) data. Our results demonstrate that the self-supervised foundation model significantly improves performance on downstream diagnostic tasks compared to models trained from scratch and previous 3D CT foundation models on scarce annotated datasets. This work highlights the effectiveness of self-supervised learning in medical imaging and sets a new benchmark for head CT image analysis in 3D, enabling broader use of artificial intelligence for head CT-based diagnosis.

artificial intelligence, deep learning, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.02779

Country: North America > United States > New York (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback