Zhang, Wenyu
Bridging Domain Gaps between Pretrained Multimodal Models and Recommendations
Zhang, Wenyu, Luo, Jie, Zhang, Xinming, Fang, Yuan
With the explosive growth of multimodal content online, pre-trained visual-language models have shown great potential for multimodal recommendation. However, while these models achieve decent performance when applied in a frozen manner, surprisingly, due to significant domain gaps (e.g., feature distribution discrepancy and task objective misalignment) between pre-training and personalized recommendation, adopting a joint training approach instead leads to performance worse than baseline. Existing approaches either rely on simple feature extraction or require computationally expensive full model fine-tuning, struggling to balance effectiveness and efficiency. To tackle these challenges, we propose \textbf{P}arameter-efficient \textbf{T}uning for \textbf{M}ultimodal \textbf{Rec}ommendation (\textbf{PTMRec}), a novel framework that bridges the domain gap between pre-trained models and recommendation systems through a knowledge-guided dual-stage parameter-efficient training strategy. This framework not only eliminates the need for costly additional pre-training but also flexibly accommodates various parameter-efficient tuning methods.
Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models
Wang, Bin, Zou, Xunlong, Sun, Shuo, Zhang, Wenyu, He, Yingxu, Liu, Zhuohan, Wei, Chengwei, Chen, Nancy F., Aw, AiTi
Existing Singlish spoken corpora have primarily focused on linguistic analysis and speech recognition Speech technologies have evolved over decades, tasks (Deterding and Low, 2001; Chen et al., progressing from modularized solutions for speech 2010; Lyu et al., 2010; Tan, 2019). Given the relatively recognition (Povey et al., 2011; Radford et al., small population of Singlish speakers, estimated 2023), speaker identification (Togneri and Pullella, at just a few million, resources for Singlish 2011), and gender recognition (Hechmi et al., speech corpora are significantly more limited compared 2021) with modularized toolkits like Kaldi (Povey to major languages like English, Chinese, et al., 2011) and ESPnet (Watanabe et al., 2018) French, and Spanish. Singapore's government to advanced solutions integrating large language agency, IMDA, has open-sourced the largest available models for multimodal understanding in an allencompassing, Singlish corpus, known as the National Speech omni-style approach (Team et al., Corpus (Koh et al., 2019).
SS-CTML: Self-Supervised Cross-Task Mutual Learning for CT Image Reconstruction
Chen, Gaofeng, Zhang, Yaoduo, Huang, Li, Wang, Pengfei, Zhang, Wenyu, Zeng, Dong, Ma, Jianhua, He, Ji
Supervised deep-learning (SDL) techniques with paired training datasets have been widely studied for X-ray computed tomography (CT) image reconstruction. However, due to the difficulties of obtaining paired training datasets in clinical routine, the SDL methods are still away from common uses in clinical practices. In recent years, self-supervised deep-learning (SSDL) techniques have shown great potential for the studies of CT image reconstruction. In this work, we propose a self-supervised cross-task mutual learning (SS-CTML) framework for CT image reconstruction. Specifically, a sparse-view scanned and a limited-view scanned sinogram data are first extracted from a full-view scanned sinogram data, which results in three individual reconstruction tasks, i.e., the full-view CT (FVCT) reconstruction, the sparse-view CT (SVCT) reconstruction, and limited-view CT (LVCT) reconstruction. Then, three neural networks are constructed for the three reconstruction tasks. Considering that the ultimate goals of the three tasks are all to reconstruct high-quality CT images, we therefore construct a set of cross-task mutual learning objectives for the three tasks, in which way, the three neural networks can be self-supervised optimized by learning from each other. Clinical datasets are adopted to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the SS-CTML framework can obtain promising CT image reconstruction performance in terms of both quantitative and qualitative measurements.
SPHERE: A Hierarchical Evaluation on Spatial Perception and Reasoning for Vision-Language Models
Zhang, Wenyu, Ng, Wei En, Ma, Lixin, Wang, Yuwen, Zhao, Jungqi, Li, Boyang, Wang, Lu
Current vision-language models may incorporate single-dimensional spatial cues, such as depth, object boundary, and basic spatial directions (e.g. left, right, front, back), yet often lack the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework with a new human-annotated dataset to pinpoint model strengths and weaknesses, advancing from single-skill tasks to multi-skill tasks, and ultimately to complex reasoning tasks that require the integration of multiple spatial and visual cues with logical reasoning. Benchmark evaluation of state-of-the-art open-source models reveal significant shortcomings, especially in the abilities to understand distance and proximity, to reason from both allocentric and egocentric viewpoints, and to perform complex reasoning in a physical context. This work underscores the need for more advanced approaches to spatial understanding and reasoning, paving the way for improvements in vision-language models and their alignment with human-like spatial capabilities. The dataset will be open-sourced upon publication.
MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models
He, Yingxu, Liu, Zhuohan, Sun, Shuo, Wang, Bin, Zhang, Wenyu, Zou, Xunlong, Chen, Nancy F., Aw, Ai Ti
We introduce MERaLiON-AudioLLM (Multimodal Empathetic Reasoning and Learning in One Network), the first speech-text model tailored for Singapore's multilingual and multicultural landscape. Developed under the National Large Language Models Funding Initiative, Singapore, MERaLiON-AudioLLM integrates advanced speech and text processing to address the diverse linguistic nuances of local accents and dialects, enhancing accessibility and usability in complex, multilingual environments. Our results demonstrate improvements in both speech recognition and task-specific understanding, positioning MERaLiON-AudioLLM as a pioneering solution for region specific AI applications. We envision this release to set a precedent for future models designed to address localised linguistic and cultural contexts in a global framework.
Perception Helps Planning: Facilitating Multi-Stage Lane-Level Integration via Double-Edge Structures
You, Guoliang, Chu, Xiaomeng, Duan, Yifan, Zhang, Wenyu, Li, Xingchen, Zhang, Sha, Li, Yao, Ji, Jianmin, Zhang, Yanyong
When planning for autonomous driving, it is crucial to consider essential traffic elements such as lanes, intersections, traffic regulations, and dynamic agents. However, they are often overlooked by the traditional end-to-end planning methods, likely leading to inefficiencies and non-compliance with traffic regulations. In this work, we endeavor to integrate the perception of these elements into the planning task. To this end, we propose Perception Helps Planning (PHP), a novel framework that reconciles lane-level planning with perception. This integration ensures that planning is inherently aligned with traffic constraints, thus facilitating safe and efficient driving. Specifically, PHP focuses on both edges of a lane for planning and perception purposes, taking into consideration the 3D positions of both lane edges and attributes for lane intersections, lane directions, lane occupancy, and planning. In the algorithmic design, the process begins with the transformer encoding multi-camera images to extract the above features and predicting lane-level perception results. Next, the hierarchical feature early fusion module refines the features for predicting planning attributes. Finally, the double-edge interpreter utilizes a late-fusion process specifically designed to integrate lane-level perception and planning information, culminating in the generation of vehicle control signals. Experiments on three Carla benchmarks show significant improvements in driving score of 27.20%, 33.47%, and 15.54% over existing algorithms, respectively, achieving the state-of-the-art performance, with the system operating up to 22.57 FPS.
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Lovenia, Holy, Mahendra, Rahmad, Akbar, Salsabil Maulana, Miranda, Lester James V., Santoso, Jennifer, Aco, Elyanah, Fadhilah, Akhdan, Mansurov, Jonibek, Imperial, Joseph Marvin, Kampman, Onno P., Moniz, Joel Ruben Antony, Habibi, Muhammad Ravi Shulthan, Hudi, Frederikus, Montalan, Railey, Ignatius, Ryan, Lopo, Joanito Agili, Nixon, William, Karlsson, Bรถrje F., Jaya, James, Diandaru, Ryandito, Gao, Yuze, Amadeus, Patrick, Wang, Bin, Cruz, Jan Christian Blaise, Whitehouse, Chenxi, Parmonangan, Ivan Halim, Khelli, Maria, Zhang, Wenyu, Susanto, Lucky, Ryanda, Reynard Adha, Hermawan, Sonny Lazuardi, Velasco, Dan John, Kautsar, Muhammad Dehan Al, Hendria, Willy Fitra, Moslem, Yasmin, Flynn, Noah, Adilazuarda, Muhammad Farid, Li, Haochen, Lee, Johanes, Damanhuri, R., Sun, Shuo, Qorib, Muhammad Reza, Djanibekov, Amirbek, Leong, Wei Qi, Do, Quyet V., Muennighoff, Niklas, Pansuwan, Tanrada, Putra, Ilham Firdausi, Xu, Yan, Tai, Ngee Chia, Purwarianti, Ayu, Ruder, Sebastian, Tjhi, William, Limkonchotiwat, Peerat, Aji, Alham Fikri, Keh, Sedrick, Winata, Genta Indra, Zhang, Ruochen, Koto, Fajri, Yong, Zheng-Xin, Cahyawijaya, Samuel
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
AudioBench: A Universal Benchmark for Audio Large Language Models
Wang, Bin, Zou, Xunlong, Lin, Geyu, Sun, Shuo, Liu, Zhuohan, Zhang, Wenyu, Liu, Zhengyuan, Aw, AiTi, Chen, Nancy F.
We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in comprehensive benchmarks for thoroughly evaluating their capabilities. AudioBench addresses this gap by providing relevant datasets and evaluation metrics. In our study, we evaluated the capabilities of four models across various aspects and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-source code, data, and leaderboard will offer a robust testbed for future model developments.
Evidentially Calibrated Source-Free Time-Series Domain Adaptation with Temporal Imputation
Ragab, Mohamed, Gong, Peiliang, Eldele, Emadeldeen, Zhang, Wenyu, Wu, Min, Foo, Chuan-Sheng, Zhang, Daoqiang, Li, Xiaoli, Chen, Zhenghua
Source-free domain adaptation (SFDA) aims to adapt a model pre-trained on a labeled source domain to an unlabeled target domain without access to source data, preserving the source domain's privacy. While SFDA is prevalent in computer vision, it remains largely unexplored in time series analysis. Existing SFDA methods, designed for visual data, struggle to capture the inherent temporal dynamics of time series, hindering adaptation performance. This paper proposes MAsk And imPUte (MAPU), a novel and effective approach for time series SFDA. MAPU addresses the critical challenge of temporal consistency by introducing a novel temporal imputation task. This task involves randomly masking time series signals and leveraging a dedicated temporal imputer to recover the original signal within the learned embedding space, bypassing the complexities of noisy raw data. Notably, MAPU is the first method to explicitly address temporal consistency in the context of time series SFDA. Additionally, it offers seamless integration with existing SFDA methods, providing greater flexibility. We further introduce E-MAPU, which incorporates evidential uncertainty estimation to address the overconfidence issue inherent in softmax predictions. To achieve that, we leverage evidential deep learning to obtain a better-calibrated pre-trained model and adapt the target encoder to map out-of-support target samples to a new feature representation closer to the source domain's support. This fosters better alignment, ultimately enhancing adaptation performance. Extensive experiments on five real-world time series datasets demonstrate that both MAPU and E-MAPU achieve significant performance gains compared to existing methods. These results highlight the effectiveness of our proposed approaches for tackling various time series domain adaptation problems.
Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-Training
Zhang, Wenyu, Shen, Li, Foo, Chuan-Sheng
Source-free domain adaptation (SFDA) aims to adapt a source model trained on a fully-labeled source domain to a related but unlabeled target domain. While the source model is a key avenue for acquiring target pseudolabels, the generated pseudolabels may exhibit source bias. In the conventional SFDA pipeline, a large data (e.g. ImageNet) pre-trained feature extractor is used to initialize the source model at the start of source training, and subsequently discarded. Despite having diverse features important for generalization, the pre-trained feature extractor can overfit to the source data distribution during source training and forget relevant target domain knowledge. Rather than discarding this valuable knowledge, we introduce an integrated framework to incorporate pre-trained networks into the target adaptation process. The proposed framework is flexible and allows us to plug modern pre-trained networks into the adaptation process to leverage their stronger representation learning capabilities. For adaptation, we propose the Co-learn algorithm to improve target pseudolabel quality collaboratively through the source model and a pre-trained feature extractor. Building on the recent success of the vision-language model CLIP in zero-shot image recognition, we present an extension Co-learn++ to further incorporate CLIP's zero-shot classification decisions. We evaluate on 3 benchmark datasets and include more challenging scenarios such as open-set, partial-set and open-partial SFDA. Experimental results demonstrate that our proposed strategy improves adaptation performance and can be successfully integrated with existing SFDA methods.