Goto

Collaborating Authors

 semantic alignment



Heterogeneous Graph Learning for Visual Commonsense Reasoning

Neural Information Processing Systems

Visual commonsense reasoning task aims at leading the research field into solving cognition-level reasoning with the ability to predict correct answers and meanwhile providing convincing reasoning paths, resulting in three sub-tasks i.e., Q-> A, QA-> R and Q-> AR. It poses great challenges over the proper semantic alignment between vision and linguistic domains and knowledge reasoning to generate persuasive reasoning paths. Existing works either resort to a powerful end-to-end network that cannot produce interpretable reasoning paths or solely explore intra-relationship of visual objects (homogeneous graph) while ignoring the cross-domain semantic alignment among visual concepts and linguistic words. In this paper, we propose a new Heterogeneous Graph Learning (HGL) framework for seamlessly integrating the intra-graph and inter-graph reasoning in order to bridge the vision and language domain. Our HGL consists of a primal vision-to-answer heterogeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph (QAHG) module to interactively refine reasoning paths for semantic agreement. Moreover, our HGL integrates a contextual voting module to exploit a long-range visual context for better global reasoning. Experiments on the large-scale Visual Commonsense Reasoning benchmark demonstrate the superior performance of our proposed modules on three tasks (improving 5% accuracy on Q-> A, 3.5% on QA-> R, 5.8% on Q-> AR).


Empathy by Design: Aligning Large Language Models for Healthcare Dialogue

Umucu, Emre, Solis, Guillermina, Garza, Leon, Rivas, Emilia, Lee, Beatrice, Kotal, Anantaa, Piplai, Aritran

arXiv.org Artificial Intelligence

Abstract--General-purpose large language models (LLMs) have demonstrated remarkable generative and reasoning capabilities but remain limited in healthcare and caregiving applications due to two key deficiencies: factual unreliability and a lack of empathetic communication. These shortcomings pose significant risks in sensitive contexts where users, particularly nonprofessionals and caregivers, seek medically relevant guidance or emotional reassurance. T o address these challenges, we introduce a Direct Preference Optimization (DPO)-based alignment framework designed to improve factual correctness, semantic coherence, and human-centric qualities such as empathy, politeness, and simplicity in caregiver-patient dialogues. Our approach fine-tunes domain-adapted Large Language Models (LLMs) using pairwise preference data, where preferred responses reflect supportive and accessible communication styles while rejected ones represent prescriptive or overly technical tones. Empirical evaluations across multiple open and proprietary LLMs show that our DPO-tuned models achieve higher semantic alignment, improved factual accuracy, and stronger human-centric evaluation scores compared to baseline and commercial alternatives such as Google's medical dialogue systems. These improvements demonstrate that preference-based alignment offers a scalable and transparent pathway toward developing trustworthy, empathetic, and clinically informed AI assistants for caregiver and healthcare communication. Caring for individuals with chronic or neuro-degenerative conditions such as Alzheimer's disease and dementia requires not only clinical coordination but also constant emotional resilience. Family caregivers and care partners often become the primary interpreters of medical information, navigating complex treatment decisions, behavioral changes, and communication challenges on a daily basis. LLMs have rapidly become integrated into everyday life. They can explain complex ideas in plain language, adjust to a user's tone, and offer a sense of understanding that static websites cannot. For caregivers seeking clear, kind, and quick answers, these systems can feel like an always-available companion in moments of doubt or stress.


Over-the-Air Semantic Alignment with Stacked Intelligent Metasurfaces

Pandolfo, Mario Edoardo, Stylianopoulos, Kyriakos, Alexandropoulos, George C., Di Lorenzo, Paolo

arXiv.org Machine Learning

Abstract--Semantic communication systems aim to transmit task-relevant information between devices capable of artificial intelligence, but their performance can degrade when heterogeneous transmitter-receiver models produce misaligned latent representations. Existing semantic alignment methods typically rely on additional digital processing at the transmitter or receiver, increasing overall device complexity. In this work, we introduce the first over-the-air semantic alignment framework based on stacked intelligent metasurfaces (SIM), which enables latent-space alignment directly in the wave domain, reducing substantially the computational burden at the device level. T o realize these operators physically, we develop a gradient-based optimization procedure that tailors the metasurface transfer function to a desired semantic mapping. Experiments with heterogeneous vision transformer (ViT) encoders show that SIMs can accurately reproduce both supervised and zero-shot semantic equalizers, achieving up to 90% task accuracy in regimes with high signal-to-noise ratio (SNR), while maintaining strong robustness even at low SNR values.


DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis

Wen, Yuhua, Li, Qifei, Zhou, Yingying, Gao, Yingming, Wen, Zhengqi, Tao, Jianhua, Li, Ya

arXiv.org Artificial Intelligence

Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called Dual-stream Alignment with Hierarchical Bottleneck Fusion (DashFusion). Firstly, dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Secondly, supervised contrastive learning leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at https://github.com/ultramarineX/DashFusion.


CLLMRec: LLM-powered Cognitive-Aware Concept Recommendation via Semantic Alignment and Prerequisite Knowledge Distillation

Xiong, Xiangrui, Lu, Yichuan, Pan, Zifei, Sun, Chang

arXiv.org Artificial Intelligence

The growth of Massive Open Online Courses (MOOCs) presents significant challenges for personalized learning, where concept recommendation is crucial. Existing approaches typically rely on heterogeneous information networks or knowledge graphs to capture conceptual relationships, combined with knowledge tracing models to assess learners' cognitive states. However, these methods face significant limitations due to their dependence on high-quality structured knowledge graphs, which are often scarce in real-world educational scenarios. To address this fundamental challenge, this paper proposes CLLMRec, a novel framework that leverages Large Language Models through two synergistic technical pillars: Semantic Alignment and Prerequisite Knowledge Distillation. The Semantic Alignment component constructs a unified representation space by encoding unstructured textual descriptions of learners and concepts. The Prerequisite Knowledge Distillation paradigm employs a teacher-student architecture, where a large teacher LLM (implemented as the Prior Knowledge Aware Component) extracts conceptual prerequisite relationships from its internalized world knowledge and distills them into soft labels to train an efficient student ranker. Building upon these foundations, our framework incorporates a fine-ranking mechanism that explicitly models learners' real-time cognitive states through deep knowledge tracing, ensuring recommendations are both structurally sound and cognitively appropriate. Extensive experiments on two real-world MOOC datasets demonstrate that CLLMRec significantly outperforms existing baseline methods across multiple evaluation metrics, validating its effectiveness in generating truly cognitive-aware and personalized concept recommendations without relying on explicit structural priors.


Technique to Baseline QE Artefact Generation Aligned to Quality Metrics

Farchi, Eitan, Nayak, Kiran, Majumdar, Papia Ghosh, Route, Saritha

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are transforming Quality Engineering (QE) by automating the generation of artefacts such as requirements, test cases, and Behavior Driven Development (BDD) scenarios. However, ensuring the quality of these outputs remains a challenge. This paper presents a systematic technique to baseline and evaluate QE artefacts using quantifiable metrics. The approach combines LLM-driven generation, reverse generation , and iterative refinement guided by rubrics technique for clarity, completeness, consistency, and testability. Experimental results across 12 projects show that reverse-generated artefacts can outperform low-quality inputs and maintain high standards when inputs are strong. The framework enables scalable, reliable QE artefact validation, bridging automation with accountability.


R-Tuning: Wavelet-Decomposed Replay and Semantic Alignment for Continual Adaptation of Pretrained Time-Series Models

Yin, Tianyi, Wang, Jingwei, Wang, Chenze, Wang, Han, Cai, Jiexuan, Liu, Min, Ma, Yunlong, Gao, Kun, Song, Yuting, Shen, Weiming

arXiv.org Artificial Intelligence

Pre-trained models have demonstrated exceptional generalization capabilities in time-series forecasting; however, adapting them to evolving data distributions remains a significant challenge. A key hurdle lies in accessing the original training data, as fine-tuning solely on new data often leads to catastrophic forgetting. To address this issue, we propose Replay Tuning (R-Tuning), a novel framework designed for the continual adaptation of pre-trained time-series models. R-Tuning constructs a unified latent space that captures both prior and current task knowledge through a frequency-aware replay strategy. Specifically, it augments model-generated samples via wavelet-based decomposition across multiple frequency bands, generating trend-preserving and fusion-enhanced variants to improve representation diversity and replay efficiency. To further reduce reliance on synthetic samples, R-Tuning introduces a latent consistency constraint that aligns new representations with the prior task space. This constraint guides joint optimization within a compact and semantically coherent latent space, ensuring robust knowledge retention and adaptation. Extensive experimental results demonstrate the superiority of R-Tuning, which reduces MAE and MSE by up to 46.9% and 46.8%, respectively, on new tasks, while preserving prior knowledge with gains of up to 5.7% and 6.0% on old tasks. Notably, under few-shot settings, R-Tuning outperforms all state-of-the-art baselines even when synthetic proxy samples account for only 5% of the new task dataset.