Li, Jun
Invertible Koopman neural operator for data-driven modeling of partial differential equations
Jin, Yuhong, Cong, Andong, Hou, Lei, Gao, Qiang, Ge, Xiangdong, Zhu, Chonglong, Feng, Yongzhi, Li, Jun
INN is introduced to eliminate dependency on reconstruction loss. Koopman operator is parameterized in frequency space to ensure resolution-invariance. By preprocessing, such as interpolation, IKNO is available for non-Cartesian domains. In various numerical and real-world examples, IKNO performs over FNO and KNO. R. ChinaA R T I C L E I N F OKeywords: Deep learning Invertible neural network Koopman operator Data-driven modeling Neural operator Partial differential equations A B S T R A C T Koopman operator theory is a popular candidate for data-driven modeling because it provides a global linearization representation for nonlinear dynamical systems. However, existing Koopman operator-based methods suffer from shortcomings in constructing the well-behaved observable function and its inverse and are inefficient enough when dealing with partial differential equations (PDEs). To address these issues, this paper proposes the Invertible Koopman Neural Operator (IKNO), a novel data-driven modeling approach inspired by the Koopman operator theory and neural operator. IKNO leverages an Invertible Neural Network to parameterize observable function and its inverse simultaneously under the same learnable parameters, explicitly guaranteeing the reconstruction relation, thus eliminating the dependency on the reconstruction loss, which is an essential improvement over the original Koopman Neural Operator (KNO). The structured linear matrix inspired by the Koopman operator theory is parameterized to learn the evolution of observables' low-frequency modes in the frequency space rather than directly in the observable space, sustaining IKNO is resolution-invariant like other neural operators. Moreover, with preprocessing such as interpolation and dimension expansion, IKNO can be extended to operator learning tasks defined on non-Cartesian domains. We fully support the above claims based on rich numerical and real-world examples and demonstrate the effectiveness of IKNO and superiority over other neural operators.1. Introduction Complex nonlinear dynamical systems are ubiquitous in many engineering fields, such as aerospace and vibration control, and modeling these systems is an important research topic [1-3]. Traditional knowledge-driven modeling approaches usually use a priori expertise to build a set of differential or algebraic equations to describe or explain phenomena of interest, having achieved relative maturity. However, in many scenarios, some key parameters, even expressions of systems of concern, may be difficult to measure or give accurately, making establishing a physical model that can accurately characterize systems' evolution challenging. In recent years, as big data technology and computer performance have improved, data-driven modeling approaches have gained extensive attention from researchers, providing a feasible route to solve the aforementioned problems [4-6].
Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing
Zi, Xing, Jin, Kairui, Tao, Xian, Li, Jun, Braytee, Ali, Shah, Rajiv Ratn, Prasad, Mukesh
Pixel-level segmentation is essential in remote sensing, where foundational vision models like CLIP and Segment Anything Model(SAM) have demonstrated significant capabilities in zero-shot segmentation tasks. Despite their advances, challenges specific to remote sensing remain substantial. Firstly, The SAM without clear prompt constraints, often generates redundant masks, and making post-processing more complex. Secondly, the CLIP model, mainly designed for global feature alignment in foundational models, often overlooks local objects crucial to remote sensing. This oversight leads to inaccurate recognition or misplaced focus in multi-target remote sensing imagery. Thirdly, both models have not been pre-trained on multi-scale aerial views, increasing the likelihood of detection failures. To tackle these challenges, we introduce the innovative VTPSeg pipeline, utilizing the strengths of Grounding DINO, CLIP, and SAM for enhanced open-vocabulary image segmentation. The Grounding DINO+(GD+) module generates initial candidate bounding boxes, while the CLIP Filter++(CLIP++) module uses a combination of visual and textual prompts to refine and filter out irrelevant object bounding boxes, ensuring that only pertinent objects are considered. Subsequently, these refined bounding boxes serve as specific prompts for the FastSAM model, which executes precise segmentation. Our VTPSeg is validated by experimental and ablation study results on five popular remote sensing image segmentation datasets.
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions
Li, Jun, Liu, Che, Bai, Wenjia, Arcucci, Rossella, Bercea, Cosmin I., Schnabel, Julia A.
Visual Language Models (VLMs) have demonstrated impressive capabilities in visual grounding tasks. However, their effectiveness in the medical domain, particularly for abnormality detection and localization within medical images, remains underexplored. A major challenge is the complex and abstract nature of medical terminology, which makes it difficult to directly associate pathological anomaly terms with their corresponding visual features. In this work, we introduce a novel approach to enhance VLM performance in medical abnormality detection and localization by leveraging decomposed medical knowledge. Instead of directly prompting models to recognize specific abnormalities, we focus on breaking down medical concepts into fundamental attributes and common visual patterns. This strategy promotes a stronger alignment between textual descriptions and visual features, improving both the recognition and localization of abnormalities in medical images.We evaluate our method on the 0.23B Florence-2 base model and demonstrate that it achieves comparable performance in abnormality grounding to significantly larger 7B LLaVA-based medical VLMs, despite being trained on only 1.5% of the data used for such models. Experimental results also demonstrate the effectiveness of our approach in both known and previously unseen abnormalities, suggesting its strong generalization capabilities.
Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model
Jin, Jiarui, Wang, Haoyu, Li, Hongyan, Li, Jun, Pan, Jiahui, Hong, Shenda
Electrocardiogram (ECG) is essential for the clinical diagnosis of arrhythmias and other heart diseases, but deep learning methods based on ECG often face limitations due to the need for high-quality annotations. Although previous ECG self-supervised learning (eSSL) methods have made significant progress in representation learning from unannotated ECG data, they typically treat ECG signals as ordinary time-series data, segmenting the signals using fixed-size and fixed-step time windows, which often ignore the form and rhythm characteristics and latent semantic relationships in ECG signals. In this work, we introduce a novel perspective on ECG signals, treating heartbeats as words and rhythms as sentences. Based on this perspective, we first designed the QRS-Tokenizer, which generates semantically meaningful ECG sentences from the raw ECG signals. Building on these, we then propose HeartLang, a novel self-supervised learning framework for ECG language processing, learning general representations at form and rhythm levels. Additionally, we construct the largest heartbeat-based ECG vocabulary to date, which will further advance the development of ECG language processing. We evaluated HeartLang across six public ECG datasets, where it demonstrated robust competitiveness against other eSSL methods. Our data and code are publicly available at https://github.com/PKUDigitalHealth/HeartLang. Electrocardiogram (ECG) is a common type of clinical data used to monitor cardiac activity, and is frequently employed in diagnosing cardiac diseases or conditions impairing myocardial function (Hong et al., 2020; Liu et al., 2021). A primary limitation of using supervised deep learning methods for ECG signal analysis is their dependency on largescale, expert-reviewed, annotated high-quality data.
SFADNet: Spatio-temporal Fused Graph based on Attention Decoupling Network for Traffic Prediction
Wu, Mei, Weng, Wenchao, Li, Jun, Lin, Yiqian, Chen, Jing, Seng, Dewen
In recent years, traffic flow prediction has played a crucial role in the management of intelligent transportation systems. However, traditional prediction methods are often limited by static spatial modeling, making it difficult to accurately capture the dynamic and complex relationships between time and space, thereby affecting prediction accuracy. This paper proposes an innovative traffic flow prediction network, SFADNet, which categorizes traffic flow into multiple traffic patterns based on temporal and spatial feature matrices. For each pattern, we construct an independent adaptive spatio-temporal fusion graph based on a cross-attention mechanism, employing residual graph convolution modules and time series modules to better capture dynamic spatio-temporal relationships under different fine-grained traffic patterns. Extensive experimental results demonstrate that SFADNet outperforms current state-of-the-art baselines across four large-scale datasets.
Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation
An, Xiaoqi, Zhao, Lin, Gong, Chen, Li, Jun, Yang, Jian
With the rapid development of autonomous driving, LiDAR-based 3D Human Pose Estimation (3D HPE) is becoming a research focus. However, due to the noise and sparsity of LiDAR-captured point clouds, robust human pose estimation remains challenging. Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results. In this work, we try to obtain sufficient information for 3D HPE only by modeling the intrinsic properties of low-quality point clouds. Hence, a simple yet powerful method is proposed, which provides insights both on modeling and augmentation of point clouds. Specifically, we first propose a concise and effective density-aware pose transformer (DAPT) to get stable keypoint representations. By using a set of joint anchors and a carefully designed exchange module, valid information is extracted from point clouds with different densities. Then 1D heatmaps are utilized to represent the precise locations of the keypoints. Secondly, a comprehensive LiDAR human synthesis and augmentation method is proposed to pre-train the model, enabling it to acquire a better human body prior. We increase the diversity of point clouds by randomly sampling human positions and orientations and by simulating occlusions through the addition of laser-level masks. Extensive experiments have been conducted on multiple datasets, including IMU-annotated LidarHuman26M, SLOPER4D, and manually annotated Waymo Open Dataset v2.0 (Waymo), HumanM3. Our method demonstrates SOTA performance in all scenarios. In particular, compared with LPFormer on Waymo, we reduce the average MPJPE by $10.0mm$. Compared with PRN on SLOPER4D, we notably reduce the average MPJPE by $20.7mm$.
Learning Based MPC for Autonomous Driving Using a Low Dimensional Residual Model
Li, Yaoyu, Huang, Chaosheng, Yang, Dongsheng, Liu, Wenbo, Li, Jun
In this paper, a learning based Model Predictive Control (MPC) using a low dimensional residual model is proposed for autonomous driving. One of the critical challenge in autonomous driving is the complexity of vehicle dynamics, which impedes the formulation of accurate vehicle model. Inaccurate vehicle model can significantly impact the performance of MPC controller. To address this issue, this paper decomposes the nominal vehicle model into invariable and variable elements. The accuracy of invariable component is ensured by calibration, while the deviations in the variable elements are learned by a low-dimensional residual model. The features of residual model are selected as the physical variables most correlated with nominal model errors. Physical constraints among these features are formulated to explicitly define the valid region within the feature space. The formulated model and constraints are incorporated into the MPC framework and validated through both simulation and real vehicle experiments. The results indicate that the proposed method significantly enhances the model accuracy and controller performance.
Citywide Electric Vehicle Charging Demand Prediction Approach Considering Urban Region and Dynamic Influences
Kuang, Haoxuan, Deng, Kunxiang, You, Linlin, Li, Jun
Electric vehicle charging demand prediction is important for vacant charging pile recommendation and charging infrastructure planning, thus facilitating vehicle electrification and green energy development. The performance of previous spatio-temporal studies is still far from satisfactory nowadays because urban region attributes and multivariate temporal influences are not adequately taken into account. To tackle these issues, we propose a learning approach for citywide electric vehicle charging demand prediction, named CityEVCP. To learn non-pairwise relationships in urban areas, we cluster service areas by the types and numbers of points of interest in the areas and develop attentive hypergraph networks accordingly. Graph attention mechanisms are employed for information propagation between neighboring areas. Additionally, we propose a variable selection network to adaptively learn dynamic auxiliary information and improve the Transformer encoder utilizing gated mechanisms for fluctuating charging time-series data. Experiments on a citywide electric vehicle charging dataset demonstrate the performances of our proposed approach compared with a broad range of competing baselines. Furthermore, we demonstrate the impact of dynamic influences on prediction results in different areas of the city and the effectiveness of our area clustering method.
Kernel Orthogonality does not necessarily imply a Decrease in Feature Map Redundancy in CNNs: Convolutional Similarity Minimization
Belmekki, Zakariae, Li, Jun, Reuter, Patrick, Jรกuregui, David Antonio Gรณmez, Jenkins, Karl
Convolutional Neural Networks (CNNs) have been heavily used in Deep Learning due to their success in various tasks. Nonetheless, it has been observed that CNNs suffer from redundancy in feature maps, leading to inefficient capacity utilization. Efforts to mitigate and solve this problem led to the emergence of multiple methods, amongst which is kernel orthogonality through variant means. In this work, we challenge the common belief that kernel orthogonality leads to a decrease in feature map redundancy, which is, supposedly, the ultimate objective behind kernel orthogonality. We prove, theoretically and empirically, that kernel orthogonality has an unpredictable effect on feature map similarity and does not necessarily decrease it. Based on our theoretical result, we propose an effective method to reduce feature map similarity independently of the input of the CNN. This is done by minimizing a novel loss function we call Convolutional Similarity. Empirical results show that minimizing the Convolutional Similarity increases the performance of classification models and can accelerate their convergence. Furthermore, using our proposed method pushes towards a more efficient use of the capacity of models, allowing the use of significantly smaller models to achieve the same levels of performance.
Adversarial Federated Consensus Learning for Surface Defect Classification Under Data Heterogeneity in IIoT
Cui, Jixuan, Li, Jun, Mei, Zhen, Ni, Yiyang, Chen, Wen, Li, Zengxiang
The challenge of data scarcity hinders the application of deep learning in industrial surface defect classification (SDC), as it's difficult to collect and centralize sufficient training data from various entities in Industrial Internet of Things (IIoT) due to privacy concerns. Federated learning (FL) provides a solution by enabling collaborative global model training across clients while maintaining privacy. However, performance may suffer due to data heterogeneity-discrepancies in data distributions among clients. In this paper, we propose a novel personalized FL (PFL) approach, named Adversarial Federated Consensus Learning (AFedCL), for the challenge of data heterogeneity across different clients in SDC. First, we develop a dynamic consensus construction strategy to mitigate the performance degradation caused by data heterogeneity. Through adversarial training, local models from different clients utilize the global model as a bridge to achieve distribution alignment, alleviating the problem of global knowledge forgetting. Complementing this strategy, we propose a consensus-aware aggregation mechanism. It assigns aggregation weights to different clients based on their efficacy in global knowledge learning, thereby enhancing the global model's generalization capabilities. Finally, we design an adaptive feature fusion module to further enhance global knowledge utilization efficiency. Personalized fusion weights are gradually adjusted for each client to optimally balance global and local features. Compared with state-of-the-art FL methods like FedALA, the proposed AFedCL method achieves an accuracy increase of up to 5.67% on three SDC datasets.