AITopics | late fusion

Collaborating Authors

late fusion

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning

Neural Information Processing SystemsJun-23-2026, 03:21:02 GMT

We propose Modality-Aware Sharpness-Aware Minimization (MSAM), a model-agnostic framework that applies to many modalities and supports early and late fusion scenarios.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States (0.46)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Government (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

HybSpecNet: A Critical Analysis of Architectural Instability in Hybrid-Domain Spectral GNNs

Goksu, Huseyin

arXiv.org Artificial IntelligenceNov-21-2025

Spectral Graph Neural Networks offer a principled approach to graph filtering but face a fundamental "Stability-vs-Adaptivity" trade-off. This trade-off is dictated by the choice of spectral domain. Filters in the finite [-1, 1] domain (e.g., ChebyNet) are numerically stable at high polynomial degrees (K) but are static and low-pass, causing them to fail on heterophilic graphs. Conversely, filters in the semi-infinite [0, infty) domain (e.g., KrawtchoukNet) are highly adaptive and achieve SOTA results on heterophily by learning non-low-pass responses. However, as we demonstrate, these adaptive filters can also suffer from numerical instability, leading to catastrophic performance collapse at high K. In this paper, we propose to resolve this trade-off by designing a hybrid-domain GNN, HybSpecNet, which combines a stable `ChebyNet` branch with an adaptive `KrawtchoukNet` branch. We first demonstrate that a "naive" hybrid architecture, which fuses the branches via concatenation, successfully unifies performance at low K, achieving strong results on both homophilic and heterophilic benchmarks. However, we then prove that this naive architecture fails the stability test. Our K-ablation experiments show that this architecture catastrophically collapses at K=25, exactly mirroring the collapse of its unstable `KrawtchoukNet` branch. We identify this critical finding as "Instability Poisoning," where `NaN`/`Inf` gradients from the adaptive branch destroy the training of the model. Finally, we propose and validate an advanced architecture that uses "Late Fusion" to completely isolate the gradient pathways. We demonstrate that this successfully solves the instability problem, remaining perfectly stable up to K=30 while retaining its SOTA performance across all graph types. This work identifies a critical architectural pitfall in hybrid GNN design and provides the robust architectural solution.

artificial intelligence, krawtchouknet, machine learning, (12 more...)

arXiv.org Artificial Intelligence

2511.16101

Country:

North America > United States (0.15)
Asia > Middle East > Republic of Türkiye (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

Add feedback

DM-QPMNET: Dual-modality fusion network for cell segmentation in quantitative phase microscopy

Chakraborty, Rajatsubhra, Espinosa-Momox, Ana, Haskin, Riley, Xu, Depeng, Porras-Aguilar, Rosario

arXiv.org Artificial IntelligenceNov-4-2025

ABSTRACT Cell segmentation in single-shot quantitative phase microscopy (ssQPM) faces challenges from traditional thresh-olding methods that are sensitive to noise and cell density, while deep learning approaches using simple channel concatenation fail to exploit the complementary nature of polarized intensity images and phase maps. We introduce DM-QPMNet, a dual-encoder network that treats these as distinct modalities with separate encoding streams. Our architecture fuses modality-specific features at intermediate depth via multi-head attention, enabling polarized edge and texture representations to selectively integrate complementary phase information. This content-aware fusion preserves training stability while adding principled multi-modal integration through dual-source skip connections and per-modality normalization at minimal overhead. Our approach demonstrates substantial improvements over monolithic concatenation and single-modality baselines, showing that modality-specific encoding with learnable fusion effectively exploits ssQPM's simultaneous capture of complementary illumination and phase cues for robust cell segmentation.

artificial intelligence, machine learning, segmentation, (18 more...)

arXiv.org Artificial Intelligence

2511.00218

Country: North America > United States > North Carolina (0.15)

Genre: Research Report (0.82)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Deep Sequence-to-Sequence Models for GNSS Spoofing Detection

Zelinka, Jan, Kost, Oliver, Hrúz, Marek

arXiv.org Artificial IntelligenceOct-24-2025

Abstract--We present a data generation framework designed to simulate spoofing attacks and randomly place attack scenarios worldwide. We apply deep neural network-based models for spoofing detection, utilizing Long Short-T erm Memory networks and Transformer-inspired architectures. These models are specifically designed for online detection and are trained using the generated dataset. Our results demonstrate that deep learning models can accurately distinguish spoofed signals from genuine ones, achieving high detection performance. The best results are achieved by Transformer-inspired architectures with early fusion of the inputs resulting in an error rate of 0.16%. Unencrypted civilian global navigation satellite system (GNSS) signals are vulnerable to spoofing attacks, which pose a significant threat.

artificial intelligence, machine learning, trajectory, (19 more...)

arXiv.org Artificial Intelligence

2510.1989

Country: Europe (0.47)

Genre: Research Report > New Finding (0.87)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Does a Technique for Building Multimodal Representation Matter? -- Comparative Analysis

Pawłowski, Maciej, Wróblewska, Anna, Sysko-Romańczuk, Sylwia

arXiv.org Artificial IntelligenceAug-8-2025

Creating a meaningful representation by fusing single modalities (e.g., text, images, or audio) is the core concept of multimodal learning. Although several techniques for building multimodal representations have been proven successful, they have not been compared yet. Therefore it has been ambiguous which technique can be expected to yield the best results in a given scenario and what factors should be considered while choosing such a technique. This paper explores the most common techniques for building multimodal data representations -- the late fusion, the early fusion, and the sketch, and compares them in classification tasks. Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M datasets. In general, our results confirm that multimodal representations are able to boost the performance of unimodal models from 0.919 to 0.969 of accuracy on Amazon Reviews and 0.907 to 0.918 of AUC on MovieLens25M. However, experiments on both MovieLens datasets indicate the importance of the meaningful input data to the given task. In this article, we show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance, that comes with the proper modalities combination. Such choice relies on: the influence that each modality has on the analyzed machine learning (ML) problem; the type of the ML task; the memory constraints while training and predicting phase.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.3390/s23052381

2206.06367

Country: North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Media > Film (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Multi-view mid fusion: a universal approach for learning in an HDLSS setting

Houthuys, Lynn

arXiv.org Artificial IntelligenceJul-9-2025

The high-dimensional low-sample-size (HDLSS) setting presents significant challenges in various applications where the feature dimension far exceeds the number of available samples. This paper introduces a universal approach for learning in HDLSS settings using multi-view mid fusion techniques. It shows how existing mid fusion multi-view methods perform well in an HDLSS setting even if no inherent views are provided. Three view construction methods are proposed that split the high-dimensional feature vectors into smaller subsets, each representing a different view. Extensive experimental validation across model-types and learning tasks confirm the effectiveness and generalization of the approach. We believe the work in this paper lays the foundation for further research into the universal benefits of multi-view mid fusion learning.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.06026

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Belgium (0.04)
Asia > Singapore (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

GRAM: Generative Recommendation via Semantic-aware Multi-granular Late Fusion

Lee, Sunkyung, Choi, Minjin, Choi, Eunseong, Kim, Hye-young, Lee, Jongwuk

arXiv.org Artificial IntelligenceJun-3-2025

Generative recommendation is an emerging paradigm that leverages the extensive knowledge of large language models by formulating recommendations into a text-to-text generation task. However, existing studies face two key limitations in (i) incorporating implicit item relationships and (ii) utilizing rich yet lengthy item information. To address these challenges, we propose a Generative Recommender via semantic-Aware Multi-granular late fusion (GRAM), introducing two synergistic innovations. First, we design semantic-to-lexical translation to encode implicit hierarchical and collaborative item relationships into the vocabulary space of LLMs. Second, we present multi-granular late fusion to integrate rich semantics efficiently with minimal information loss. It employs separate encoders for multi-granular prompts, delaying the fusion until the decoding stage. Experiments on four benchmark datasets show that GRAM outperforms eight state-of-the-art generative recommendation models, achieving significant improvements of 11.5-16.0% in Recall@5 and 5.3-13.6% in NDCG@5. The source code is available at https://github.com/skleee/GRAM.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.01673

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Review for NeurIPS paper: Deep Multimodal Fusion by Channel Exchanging

Neural Information Processing SystemsJan-23-2025, 07:07:44 GMT

GuessWhat?! Visual object discovery through multi-modal dialogue.

fusion, modality, proceedings, (11 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.06)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.34)
Information Technology > Data Science > Data Integration (0.32)
Information Technology > Artificial Intelligence > Vision (0.32)

Add feedback

MARIA: a Multimodal Transformer Model for Incomplete Healthcare Data

Caruso, Camillo Maria, Soda, Paolo, Guarrasi, Valerio

arXiv.org Artificial IntelligenceDec-19-2024

In healthcare, the integration of multimodal data is pivotal for developing comprehensive diagnostic and predictive models. However, managing missing data remains a significant challenge in real-world applications. We introduce MARIA (Multimodal Attention Resilient to Incomplete datA), a novel transformer-based deep learning model designed to address these challenges through an intermediate fusion strategy. Unlike conventional approaches that depend on imputation, MARIA utilizes a masked self-attention mechanism, which processes only the available data without generating synthetic values. This approach enables it to effectively handle incomplete datasets, enhancing robustness and minimizing biases introduced by imputation methods. We evaluated MARIA against 10 state-of-the-art machine learning and deep learning models across 8 diagnostic and prognostic tasks. The results demonstrate that MARIA outperforms existing methods in terms of performance and resilience to varying levels of data incompleteness, underscoring its potential for critical healthcare applications.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2412.1481

Country:

North America > United States > California (0.28)
Europe > Sweden > Västerbotten County > Umeå (0.04)
North America > Canada (0.04)
Europe > Italy (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Late fusion ensembles for speech recognition on diverse input audio representations

Jezidžić, Marin, Mihelčić, Matej

arXiv.org Artificial IntelligenceDec-1-2024

We explore diverse representations of speech audio, and their effect on a performance of late fusion ensemble of E-Branchformer models, applied to Automatic Speech Recognition (ASR) task. Although it is generally known that ensemble methods often improve the performance of the system even for speech recognition, it is very interesting to explore how ensembles of complex state-of-the-art models, such as medium-sized and large E-Branchformers, cope in this setting when their base models are trained on diverse representations of the input speech audio. The results are evaluated on four widely-used benchmark datasets: \textit{Librispeech, Aishell, Gigaspeech}, \textit{TEDLIUMv2} and show that improvements of $1\% - 14\%$ can still be achieved over the state-of-the-art models trained using comparable techniques on these datasets. A noteworthy observation is that such ensemble offers improvements even with the use of language models, although the gap is closing.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.01861

Country:

Europe > Croatia > Zagreb County > Zagreb (0.05)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia (0.04)

Genre: Research Report > Promising Solution (0.54)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback