Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation
Li, Xuecheng, Jia, Weikuan, Kurbonaliev, Alisher, Alisher, Qurbonaliev, Rustam, Khudzhamkulov, Shuhratjon, Ismoilov, Javhariddin, Eshmatov, Zheng, Yuanjie
–arXiv.org Artificial Intelligence
Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naïve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.
arXiv.org Artificial Intelligence
Dec-9-2025
- Country:
- Asia
- China (0.04)
- Tajikistan (0.05)
- Asia
- Genre:
- Instructional Material > Course Syllabus & Notes (0.66)
- Research Report
- Experimental Study (0.46)
- New Finding (0.46)
- Industry:
- Education > Educational Setting (0.68)
- Health & Medicine (0.93)
- Technology: