Goto

Collaborating Authors

 connector



Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

Neural Information Processing Systems

Multi-modal large language models (MLLMs) have shown impressive capabilities as a general-purpose interface for various visual and linguistic tasks. However, building a unified MLLM for multi-task learning in the medical field remains a thorny challenge. To mitigate the tug-of-war problem of multi-modal multi-task optimization in MLLMs, recent advances primarily focus on improving the LLM components, while neglecting the connector that bridges the gap between modalities. In this paper, we introduce Uni-Med, a novel medical generalist foundation model which consists of a universal visual feature extraction module, a connector mixture-of-experts (CMoE) module, and an LLM. Benefiting from the proposed CMoE that leverages a well-designed router with a mixture of projection experts at the connector, Uni-Med achieves efficient solution to the tug-of-war problem and can perform six different medical tasks including question answering, visual question answering, report generation, referring expression comprehension, referring expression generation and image classification. To the best of our knowledge, Uni-Med is the first effort to tackle multi-task interference at the connector in MLLMs.


Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

Vendrame, Katia, Yusuf, Bolaji, Kesiraju, Santosh, Sedláček, Šimon, Plchot, Oldřich, Černocký, Jan

arXiv.org Artificial Intelligence

End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.



MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

Pang, Yuqi, Yang, Bowen, Cao, Yun, Fan, Rong, Li, Xiaoyu, He, Chen

arXiv.org Artificial Intelligence

Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features. We train MoCHA on two mainstream LLMs (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks. Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks. For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA.




Democratizing Tabular Data Access with an Open$\unicode{x2013}$Source Synthetic$\unicode{x2013}$Data SDK

Krchova, Ivona, Vieyra, Mariana Vargas, Scriminaci, Mario, Sidorenko, Andrey

arXiv.org Artificial Intelligence

Abstract--Machine learning development critically depends on access to high-quality data. However, increasing restrictions due to privacy, proprietary interests, and ethical concerns have created significant barriers to data accessibility. Synthetic data offers a viable solution by enabling safe, broad data usage without compromising sensitive information. This paper presents the MOSTL Y AI Synthetic Data Software Development Kit (SDK), an open-source toolkit designed specifically for synthesizing high-quality tabular data. The SDK integrates robust features such as differential privacy guarantees, fairness-aware data generation, and automated quality assurance into a flexible and accessible Python interface. Leveraging the T abularARGN autoregressive framework, the SDK supports diverse data types and complex multi-table and sequential datasets, delivering competitive performance with notable improvements in speed and usability. Currently deployed both as a cloud service and locally installable software, the SDK has seen rapid adoption, highlighting its practicality in addressing real-world data bottlenecks and promoting widespread data democratization. HE development of Machine Learning applications requires broad access to training data. This necessity has become more critical in recent years with the advent of Deep Learning, which requires large-scale datasets to effectively train models.


Backdoor Attacks Against Speech Language Models

Fortier, Alexandrine, Thebaud, Thomas, Villalba, Jesús, Dehak, Najim, Cardinal, Patrick

arXiv.org Artificial Intelligence

Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders. Large language models (LLMs) are increasingly extended to multimodal settings, processing combinations of text, images, video, and audio (DeepMind, 2023; Biadsy et al., 2023; Radford et al., 2021; Rajaa & Tushar, 2024). While powerful, these systems inherit vulnerabilities from each of their components. Among them are backdoor attacks, in which a model behaves normally on clean inputs but produces targeted outputs when a hidden trigger is present (Gu et al., 2017). Prior backdoor studies have largely focused on single-modality large language models (Xu et al., 2023; Y ao et al., 2024) or speech processing models (Zhai et al., 2021; Koffas et al., 2022), leaving open questions about how such attacks propagate in a cascaded speech language model.


MemEIC: A Step Toward Continual and Compositional Knowledge Editing

Seong, Jin, Park, Jiyun, Liermann, Wencke, Choi, Hongseok, Nam, Yoonji, Kim, Hyun, Lim, Soojong, Lee, Namhoon

arXiv.org Artificial Intelligence

The dynamic nature of information necessitates continuously updating large vision-language models (LVLMs). While recent knowledge editing techniques hint at promising directions, they often focus on editing a single modality (vision or language) in isolation. This prevalent practice neglects the inherent multimodality of LVLMs and the continuous nature of knowledge updates, potentially leading to suboptimal editing outcomes when considering the interplay between modalities and the need for ongoing knowledge refinement. To address these limitations, we propose MemEIC, a novel method for Continual and Compositional Knowledge Editing (CCKE) in LVLMs. MemEIC enables compositional editing of both visual and textual knowledge sequentially. Our approach employs a hybrid external-internal editor featuring a dual external memory for cross-modal evidence retrieval and dual LoRA adapters that facilitate disentangled parameter updates for each modality. A key component is a brain-inspired knowledge connector, activated selectively for compositional reasoning, that integrates information across different modalities. Experiments demonstrate that MemEIC significantly improves performance on complex multimodal questions and effectively preserves prior edits, setting a new benchmark for CCKE in LVLMs.