AITopics | Li, Liang

Collaborating Authors

Li, Liang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Frequency Dynamic Convolution for Dense Image Prediction

Chen, Linwei, Gu, Lin, Li, Liang, Yan, Chenggang, Fu, Ying

arXiv.org Artificial IntelligenceMar-24-2025

While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. FDConv divides this budget into frequency-based groups with disjoint Fourier indices, enabling the construction of frequency-diverse weights without increasing the parameter cost. To further enhance adaptability, we propose Kernel Spatial Modulation (KSM) and Frequency Band Modulation (FBM). KSM dynamically adjusts the frequency response of each filter at the spatial level, while FBM decomposes weights into distinct frequency bands in the frequency domain and modulates them dynamically based on local content. Extensive experiments on object detection, segmentation, and classification validate the effectiveness of FDConv. We demonstrate that when applied to ResNet-50, FDConv achieves superior performance with a modest increase of +3.6M parameters, outperforming previous methods that require substantial increases in parameter budgets (e.g., CondConv +90M, KW +76.5M). Moreover, FDConv seamlessly integrates into a variety of architectures, including ConvNeXt, Swin-Transformer, offering a flexible and efficient solution for modern vision tasks. The code is made publicly available at https://github.com/Linwei-Chen/FDConv.

artificial intelligence, machine learning, proceedings, (16 more...)

arXiv.org Artificial Intelligence

2503.18783

Country: Asia > China (0.28)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Collective Behavior Clone with Visual Attention via Neural Interaction Graph Prediction

Li, Kai, Ma, Zhao, Li, Liang, Zhao, Shiyu

arXiv.org Artificial IntelligenceMar-9-2025

In this paper, we propose a framework, collective behavioral cloning (CBC), to learn the underlying interaction mechanism and control policy of a swarm system. Given the trajectory data of a swarm system, we propose a graph variational autoencoder (GVAE) to learn the local interaction graph. Based on the interaction graph and swarm trajectory, we use behavioral cloning to learn the control policy of the swarm system. To demonstrate the practicality of CBC, we deploy it on a real-world decentralized vision-based robot swarm system. A visual attention network is trained based on the learned interaction graph for online neighbor selection. Experimental results show that our method outperforms previous approaches in predicting both the interaction graph and swarm actions with higher accuracy. This work offers a promising approach for understanding interaction mechanisms and swarm dynamics in future swarm robotics research. Code and data are available.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2503.06869

Country:

Asia > China (0.14)
North America > United States > Montana > Roosevelt County (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

MobiLLM: Enabling LLM Fine-Tuning on the Mobile Device via Server Assisted Side Tuning

Li, Liang, Yang, Xingke, Wu, Wen, Wang, Hao, Ohtsuki, Tomoaki, Fu, Xin, Pan, Miao, Shen, Xuemin

arXiv.org Artificial IntelligenceFeb-27-2025

Large Language Model (LLM) at mobile devices and its potential applications never fail to fascinate. However, on-device LLM fine-tuning poses great challenges due to extremely high memory requirements and slow training speeds. Even with parameter-efficient fine-tuning (PEFT) methods that update only a small subset of parameters, resource-constrained mobile devices cannot afford them. In this paper, we propose MobiLLM to enable memory-efficient transformer LLM fine-tuning on a mobile device via server-assisted side-tuning. Particularly, MobiLLM allows the resource-constrained mobile device to retain merely a frozen backbone model, while offloading the memory and computation-intensive backpropagation of a trainable side-network to a high-performance server. Unlike existing fine-tuning methods that keep trainable parameters inside the frozen backbone, MobiLLM separates a set of parallel adapters from the backbone to create a backpropagation bypass, involving only one-way activation transfers from the mobile device to the server with low-width quantization during forward propagation. In this way, the data never leaves the mobile device while the device can remove backpropagation through the local backbone model and its forward propagation can be paralyzed with the server-side execution. Thus, MobiLLM preserves data privacy while significantly reducing the memory and computational burdens for LLM fine-tuning. Through extensive experiments, we demonstrate that MobiLLM can enable a resource-constrained mobile device, even a CPU-only one, to fine-tune LLMs and significantly reduce convergence time and memory usage.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.20421

Country: North America > United States > Louisiana (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Telecommunications (0.68)
Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Energy-Efficient Split Learning for Fine-Tuning Large Language Models in Edge Networks

Li, Zuguang, Wu, Shaohua, Li, Liang, Zhang, Songge

arXiv.org Artificial IntelligenceJan-13-2025

In this letter, we propose an energy-efficient split learning (SL) framework for fine-tuning large language models (LLMs) using geo-distributed personal data at the network edge, where LLMs are split and alternately across massive mobile devices and an edge server. Considering the device heterogeneity and channel dynamics in edge networks, a \underline{C}ut l\underline{A}yer and computing \underline{R}esource \underline{D}ecision (CARD) algorithm is developed to minimize training delay and energy consumption. Simulation results demonstrate that the proposed approach reduces the average training delay and server's energy consumption by 70.8% and 53.1%, compared to the benchmarks, respectively.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.0009

Country: Asia > China (0.30)

Genre: Research Report > New Finding (0.48)

Industry:

Energy (0.90)
Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Online Video Understanding: A Comprehensive Benchmark and Memory-Augmented Method

Huang, Zhenpeng, Li, Xinhao, Li, Jiaqi, Wang, Jing, Zeng, Xiangyu, Liang, Cheng, Wu, Tao, Chen, Xi, Li, Liang, Wang, Limin

arXiv.org Artificial IntelligenceDec-31-2024

Multimodal Large Language Models (MLLMs) have shown significant progress in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark specifically designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features six core task types across three temporal contexts-past, present, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.00584

Country: Asia > China (0.14)

Genre:

Research Report (1.00)
Instructional Material > Online (0.34)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

Han, Shuhao, Fan, Haotian, Fu, Jiachen, Li, Liang, Li, Tao, Cui, Junhui, Wang, Yunqiu, Tai, Yang, Sun, Jingwei, Guo, Chunle, Li, Chongyi

arXiv.org Artificial IntelligenceDec-25-2024

Recently, Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated metrics have emerged to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated metrics is limited by existing small datasets. Additionally, these datasets lack the capacity to assess the performance of automated metrics at a fine-grained level. In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our benchmark. This allows us to comprehensively evaluate the effectiveness of image-text alignment metrics for T2I models. Meanwhile, we introduce two new methods to evaluate the image-text alignment capabilities of T2I models: FGA-BLIP2 which involves end-to-end fine-tuning of a vision-language model to produce fine-grained image-text alignment scores and PN-VQA which adopts a novel positive-negative VQA manner in VQA models for zero-shot fine-grained evaluation. Both methods achieve impressive performance in image-text alignment evaluations. We also use our methods to rank current AIGC models, in which the results can serve as a reference source for future study and promote the development of T2I generation. The data and code will be made publicly available.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.1815

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.65)

Add feedback

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning

Tu, Yunbin, Li, Liang, Su, Li, Huang, Qingming

arXiv.org Artificial IntelligenceDec-18-2024

Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain rich audio-visual content, by modeling global contrastive alignment and local fine-grained interaction between visual and audio modalities. Then, we devise the query-centric cognition that uses the deep-level query to perform the temporal-channel filtration on the shallow-level audio-visual representation. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks. Extensive experiments show QUAG achieves the SOTA results on HIREST. Further, we test QUAG on the query-based video summarization task and verify its good generalization.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.13543

Country: Asia > China (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A Decade of Deep Learning: A Survey on The Magnificent Seven

Azizov, Dilshod, Manzoor, Muhammad Arslan, Bojkovic, Velibor, Wang, Yingxu, Wang, Zixiao, Iklassov, Zangir, Zhao, Kailong, Li, Liang, Liu, Siwei, Zhong, Yu, Liu, Wei, Liang, Shangsong

arXiv.org Artificial IntelligenceDec-13-2024

At the core of this transformation is the development of multi-layered neural network architectures that facilitate automatic feature extraction from raw data, significantly improving the efficiency on machine learning tasks. Given the rapid pace of these advancements, an accessible manual is necessary to distill the key advances of the past decade. With this in mind, we introduce a study which highlights the evolution of deep learning, largely attributed to powerful algorithms. Among the multitude of breakthroughs, certain algorithms, including Residual Networks (ResNets), Transformers, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Graph Neural Networks (GNNs), Contrastive Language-Image Pretraining (CLIP) and Diffusion models, have emerged as the cornerstones and driving forces behind the discipline. We select these algorithms via a survey targeting a broad spectrum of academics and professionals with the aim of encapsulating the essence of the most influential algorithms over the past decade. In this work, we provide details on the selection methodology, exploring the mentioned architectures in a broader context of the history of deep learning. We present an overview of selected core architectures, their mathematical underpinnings, and the algorithmic procedures that define the subsequent extensions and variants of these models, their applications, and their challenges and potential future research directions. In addition, we explore the practical aspects related to these algorithms, such as training and optimization methods, normalization techniques, and rate scheduling strategies that are essential for their effective implementation. Therefore, our manuscript serves as a practical survey for understanding and applying these crucial algorithms and aims to provide a manual for experienced researchers transitioning into deep learning from other domains, as well as for beginners seeking to grasp the trending algorithms.

artificial intelligence, arxiv preprint arxiv, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2412.16188

Country:

Asia (0.28)
Europe (0.28)
North America > United States (0.14)

Genre:

Overview (1.00)
Research Report > New Finding (0.46)

Industry:

Information Technology (0.92)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
Leisure & Entertainment > Games (0.67)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multi-robot autonomous 3D reconstruction using Gaussian splatting with Semantic guidance

Zeng, Jing, Ye, Qi, Liu, Tianle, Xu, Yang, Li, Jin, Xu, Jinming, Li, Liang, Chen, Jiming

arXiv.org Artificial IntelligenceDec-3-2024

Implicit neural representations and 3D Gaussian splatting (3DGS) have shown great potential for scene reconstruction. Recent studies have expanded their applications in autonomous reconstruction through task assignment methods. However, these methods are mainly limited to single robot, and rapid reconstruction of large-scale scenes remains challenging. Additionally, task-driven planning based on surface uncertainty is prone to being trapped in local optima. To this end, we propose the first 3DGS-based centralized multi-robot autonomous 3D reconstruction framework. To further reduce time cost of task generation and improve reconstruction quality, we integrate online open-vocabulary semantic segmentation with surface uncertainty of 3DGS, focusing view sampling on regions with high instance uncertainty. Finally, we develop a multi-robot collaboration strategy with mode and task assignments improving reconstruction quality while ensuring planning efficiency. Our method demonstrates the highest reconstruction quality among all planning methods and superior planning efficiency compared to existing multi-robot methods. We deploy our method on multiple robots, and results show that it can effectively plan view paths and reconstruct scenes with high quality.

artificial intelligence, machine learning, reconstruction, (18 more...)

arXiv.org Artificial Intelligence

2412.02249

Country: Asia > China > Zhejiang Province (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.48)

Add feedback

CalliffusionV2: Personalized Natural Calligraphy Generation with Flexible Multi-modal Control

Liao, Qisheng, Li, Liang, Fei, Yulang, Xia, Gus

arXiv.org Artificial IntelligenceOct-3-2024

From oracle bone script to seal script, from clerical script to standard In this paper, we introduce CalliffusionV2, a novel system script, the evolution of Chinese characters bears witness to designed to produce natural Chinese calligraphy with flexible the development of Chinese culture. This influence extends multi-modal control. Unlike previous approaches that beyond China, impacting other East Asian countries such as rely solely on image or text inputs and lack fine-grained Korea and Japan, where Chinese calligraphy has also played control, our system leverages both images to guide generations a significant role. Despite its historical significance, in modern at fine-grained levels and natural language texts times, mastering calligraphy requires a significant time to describe the features of generations. CalliffusionV2 excels investment that many people today find difficult to accommodate at creating a broad range of characters and can quickly in their busy lives.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.03787

Country:

Asia > Japan (0.24)
Asia > China (0.24)
North America > United States > Texas (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback