When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product
Wu, Youqi, Zhang, Jingwei, Farnia, Farzan
–arXiv.org Artificial Intelligence
State-of-the-art embeddings often capture distinct yet complementary discriminative features: For instance, one image embedding model may excel at distinguishing fine-grained textures, while another focuses on object-level structure. Motivated by this observation, we propose a principled approach to fuse such complementary representations through kernel multiplication. Multiplying the kernel similarity functions of two embeddings allows their discriminative structures to interact, producing a fused representation whose kernel encodes the union of the clusters identified by each parent embedding. This formulation also provides a natural way to construct joint kernels for paired multi-modal data (e.g., image-text tuples), where the product of modality-specific kernels inherits structure from both domains. We highlight that this kernel product is mathematically realized via the Kronecker product of the embedding feature maps, yielding our proposed KrossFuse framework for embedding fusion. To address the computational cost of the resulting high-dimensional Kronecker space, we further develop RP-KrossFuse, a scalable variant that leverages random projections for efficient approximation. As a key application, we use this framework to bridge the performance gap between cross-modal embeddings (e.g., CLIP, BLIP) and unimodal experts (e.g., DINOv2, E5). Experiments show that RP-KrossFuse effectively integrates these models, enhancing modality-specific performance while preserving cross-modal alignment. The project code is available at https://github.com/yokiwuuu/KrossFuse.
arXiv.org Artificial Intelligence
Oct-31-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Middle East > Israel (0.04)
- North America > United States
- Florida > Miami-Dade County
- Miami (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Florida > Miami-Dade County
- Asia
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Research Report
- Industry:
- Education (0.45)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning
- Neural Networks (1.00)
- Performance Analysis > Accuracy (0.46)
- Natural Language > Text Processing (0.67)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Machine Learning
- Data Science > Data Mining (1.00)
- Sensing and Signal Processing > Image Processing (0.93)
- Artificial Intelligence
- Information Technology