multimodal data
- Europe > Portugal > Braga > Braga (0.04)
- North America > United States (0.04)
- Europe > Germany (0.04)
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Beyond Unimodal: Generalising Neural Processes for Multimodal Uncertainty Estimation
Uncertainty estimation is an important research area to make deep neural networks (DNNs) more trustworthy. While extensive research on uncertainty estimation has been conducted with unimodal data, uncertainty estimation for multimodal data remains a challenge. Neural processes (NPs) have been demonstrated to be an effective uncertainty estimation method for unimodal data by providing the reliability of Gaussian processes with efficient and powerful DNNs. While NPs hold significant potential for multimodal uncertainty estimation, the adaptation of NPs for multimodal data has not been carefully studied. To bridge this gap, we propose Multimodal Neural Processes (MNPs) by generalising NPs for multimodal uncertainty estimation. Based on the framework of NPs, MNPs consist of several novel and principled mechanisms tailored to the characteristics of multimodal data. In extensive empirical evaluation, our method achieves state-of-the-art multimodal uncertainty estimation performance, showing its appealing robustness against noisy samples and reliability in out-of-distribution detection with faster computation time compared to the current state-of-the-art multimodal uncertainty estimation method.
RegBN: Batch Normalization of Multimodal Data with Regularization
Recent years have witnessed a surge of interest in integrating high-dimensional data captured by multisource sensors, driven by the impressive success of neural networks in integrating multimodal data. However, the integration of heterogeneous multimodal data poses a significant challenge, as confounding effects and dependencies among such heterogeneous data sources introduce unwanted variability and bias, leading to suboptimal performance of multimodal models. Therefore, it becomes crucial to normalize the low-or high-level features extracted from data modalities before their fusion takes place. This paper introduces RegBN, a novel approach for multimodal Batch Normalization with REGularization. RegBN uses the Frobenius norm as a regularizer term to address the side effects of confounders and underlying dependencies among different data sources. The proposed method generalizes well across multiple modalities and eliminates the need for learnable parameters, simplifying training and inference.
CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders
A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations. Our method separately encodes masked-out multispectral optical and synthetic aperture radar samples--aligned in space and time--and performs cross-modal contrastive learning. Another encoder fuses these sensors, producing joint multimodal encodings that are used to predict the masked patches via a lightweight decoder. We show that these objectives are complementary when leveraged on spatially aligned multimodal data. We also introduce X-and 2D-ALiBi, which spatially biases our cross-and self-attention matrices. These strategies improve representations and allow our models to effectively extrapolate to images up to $17.6\times$ larger at test-time.
Physics-Informed Neural Koopman Machine for Interpretable Longitudinal Personalized Alzheimer's Disease Forecasting
Hrusanov, Georgi, Vu, Duy-Thanh, Can, Duy-Cat, Tascedda, Sophie, Ryan, Margaret, Bodelet, Julien, Koscielska, Katarzyna, Magnus, Carsten, Chén, Oliver Y.
Early forecasting of individual cognitive decline in Alzheimer's disease (AD) is central to disease evaluation and management. Despite advances, it is as of yet challenging for existing methodological frameworks to integrate multimodal data for longitudinal personalized forecasting while maintaining interpretability. To address this gap, we present the Neural Koopman Machine (NKM), a new machine learning architecture inspired by dynamical systems and attention mechanisms, designed to forecast multiple cognitive scores simultaneously using multimodal genetic, neuroimaging, proteomic, and demographic data. NKM integrates analytical ($α$) and biological ($β$) knowledge to guide feature grouping and control the hierarchical attention mechanisms to extract relevant patterns. By implementing Fusion Group-Aware Hierarchical Attention within the Koopman operator framework, NKM transforms complex nonlinear trajectories into interpretable linear representations. To demonstrate NKM's efficacy, we applied it to study the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. Our results suggest that NKM consistently outperforms both traditional machine learning methods and deep learning models in forecasting trajectories of cognitive decline. Specifically, NKM (1) forecasts changes of multiple cognitive scores simultaneously, (2) quantifies differential biomarker contributions to predicting distinctive cognitive scores, and (3) identifies brain regions most predictive of cognitive deterioration. Together, NKM advances personalized, interpretable forecasting of future cognitive decline in AD using past multimodal data through an explainable, explicit system and reveals potential multimodal biological underpinnings of AD progression.
- Asia > Middle East > Jordan (0.14)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (7 more...)
A review on data fusion in multimodal learning analytics and educational data mining
Chango, Wilson, Lara, Juan A., Cerezo, Rebeca, Romero, Cristóbal
Th e new educational models such as Smart Learning environments use of digita l and context - aware devices to facilitate the learning process . In this new educational scenario, a huge quantity of multimodal students' data from a variety of different sources can be captured, fused and analyze. It offers to researchers and educators a unique opportunity of being able to discover new knowledge to better understand the learning process and to intervene if necessary. However, it is necessary t o apply correctly d ata f usion approaches and techniques in order to combine various sources of Multimodal Learning Data (MLA) . The se sources or modalities in MLA include audio, video, electrodermal activity data, eye - tracking, user logs and click - stream data, but also learning artifacts and more natural human signals such as gestures, gaze, speech or writing. This survey introduces data fusion in Learning Analytics (LA) and Educational Data Mining (EDM) and how these data fusion techniques have been applied in Smart Learning. It shows the current state of the art by reviewing the main publications, the main type of fused educational data, and the data fusion approaches and techniques used in EDM/LA, as well as the main open problems, trends and challenges in th is specific research area.
- South America > Ecuador (0.04)
- South America > Brazil (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (2 more...)
- Research Report (1.00)
- Overview (1.00)
- Instructional Material > Course Syllabus & Notes (0.46)
- Education > Educational Setting > Online (1.00)
- Education > Educational Technology > Educational Software > Computer Based Training (0.94)
- Health & Medicine > Therapeutic Area (0.93)
- Education > Educational Setting > Higher Education (0.68)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Data Science > Data Integration (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque
Arana, Lukas, Etxaniz, Julen, Salaberria, Ander, Azkune, Gorka
Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (5 more...)
Improved Multimodal Deep Learning with Variation of Information
Deep learning has been successfully applied to multimodal representation learning problems, with a common strategy to learning joint representations that are shared across multiple modalities on top of layers of modality-specific networks. Nonetheless, there still remains a question how to learn a good association between data modalities; in particular, a good generative model of multimodal data should be able to reason about missing data modality given the rest of data modalities. In this paper, we propose a novel multimodal representation learning framework that explicitly aims this goal. Rather than learning with maximum likelihood, we train the model to minimize the variation of information. We provide a theoretical insight why the proposed learning objective is sufficient to estimate the data-generating joint distribution of multimodal data. We apply our method to restricted Boltzmann machines and introduce learning methods based on contrastive divergence and multi-prediction training. In addition, we extend to deep networks with recurrent encoding structure to finetune the whole network. In experiments, we demonstrate the state-of-the-art visual recognition performance on MIR-Flickr database and PASCAL VOC 2007 database with and without text features.
VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation
Park, Hyeongcheol, Seo, Jiyoung, Jang, MinHyuk, Park, Hogun, Baek, Ha Dam, Chang, Gyusam, Im, Hyeonsoo, Kim, Sangpil
Multimodal Knowledge Graphs (MMKGs), which represent explicit knowledge across multiple modalities, play a pivotal role by complementing the implicit knowledge of Multimodal Large Language Models (MLLMs) and enabling more grounded reasoning via Retrieval Augmented Generation (RAG). However, existing MMKGs are generally limited in scope: they are often constructed by augmenting pre-existing knowledge graphs, which restricts their knowledge, resulting in outdated or incomplete knowledge coverage, and they often support only a narrow range of modalities, such as text and visual information. These limitations restrict applicability to multimodal tasks, particularly as recent MLLMs adopt richer modalities like video and audio. Therefore, we propose the Visual-Audio-Text Knowledge Graph (VAT-KG), the first concept-centric and knowledge-intensive multimodal knowledge graph that covers visual, audio, and text information, where each triplet is linked to multimodal data and enriched with detailed descriptions of concepts. Specifically, our construction pipeline ensures cross-modal knowledge alignment between multimodal data and fine-grained semantics through a series of stringent filtering and alignment steps, enabling the automatic generation of MMKGs from any multimodal dataset. We further introduce a novel multimodal RAG framework that retrieves detailed concept-level knowledge in response to queries from arbitrary modalities. Experiments on question answering tasks across various modalities demonstrate the effectiveness of VAT-KG in supporting MLLMs, highlighting its practical value in unifying and leveraging multimodal knowledge.
- Europe > Austria > Vienna (0.14)
- Oceania > Australia (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning
Zhu, Guanghao, Hou, Zhitian, Liu, Zeyu, Sang, Zhijie, Xie, Congkai, Yang, Hongxia
Multimodal large language models (MLLMs) have shown remarkable potential in various domains, yet their application in the medical field is hindered by several challenges. General-purpose MLLMs often lack the specialized knowledge required for medical tasks, leading to uncertain or hallucinatory responses. Knowledge distillation from advanced models struggles to capture domain-specific expertise in radiology and pharmacology. Additionally, the computational cost of continual pretraining with large-scale medical data poses significant efficiency challenges. To address these issues, we propose InfiMed-Foundation-1.7B and InfiMed-Foundation-4B, two medical-specific MLLMs designed to deliver state-of-the-art performance in medical applications. We combined high-quality general-purpose and medical multimodal data and proposed a novel five-dimensional quality assessment framework to curate high-quality multimodal medical datasets. We employ low-to-high image resolution and multimodal sequence packing to enhance training efficiency, enabling the integration of extensive medical data. Furthermore, a three-stage supervised fine-tuning process ensures effective knowledge extraction for complex medical tasks. Evaluated on the MedEvalKit framework, InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating superior performance in medical visual question answering and diagnostic tasks. By addressing key challenges in data quality, training efficiency, and domain-specific knowledge extraction, our work paves the way for more reliable and effective AI-driven solutions in healthcare. InfiMed-Foundation-4B model is available at \href{https://huggingface.co/InfiX-ai/InfiMed-Foundation-4B}{InfiMed-Foundation-4B}.
- North America > United States > Massachusetts (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- (2 more...)