Luo, Yucong
A Survey on Knowledge-Oriented Retrieval-Augmented Generation
Cheng, Mingyue, Luo, Yucong, Ouyang, Jie, Liu, Qi, Liu, Huijie, Li, Li, Yu, Shuo, Zhang, Bohou, Cao, Jiawei, Ma, Jie, Wang, Daoyu, Chen, Enhong
Retrieval-Augmented Generation (RAG) has gained significant attention in recent years for its potential to enhance natural language understanding and generation by combining large-scale retrieval systems with generative models. RAG leverages external knowledge sources, such as documents, databases, or structured data, to improve model performance and generate more accurate and contextually relevant outputs. This survey aims to provide a comprehensive overview of RAG by examining its fundamental components, including retrieval mechanisms, generation processes, and the integration between the two. We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge, and the challenges associated with aligning retrieved information with generative objectives. We also present a taxonomy that categorizes RAG methods, ranging from basic retrieval-augmented approaches to more advanced models incorporating multi-modal data and reasoning capabilities. Additionally, we review the evaluation benchmarks and datasets commonly used to assess RAG systems, along with a detailed exploration of its applications in fields such as question answering, summarization, and information retrieval. Finally, we highlight emerging research directions and opportunities for improving RAG systems, such as enhanced retrieval efficiency, model interpretability, and domain-specific adaptations. This paper concludes by outlining the prospects for RAG in addressing real-world challenges and its potential to drive further advancements in natural language processing.
HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation
Ouyang, Jie, Pan, Tingyue, Cheng, Mingyue, Yan, Ruiran, Luo, Yucong, Lin, Jiaying, Liu, Qi
While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures temporal knowledge evolution in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG.
TextMatch: Enhancing Image-Text Consistency Through Multimodal Optimization
Luo, Yucong, Cheng, Mingyue, Ouyang, Jie, Tao, Xiaoyu, Liu, Qi
Text-to-image generative models excel in creating images from text but struggle with ensuring alignment and consistency between outputs and prompts. This paper introduces TextMatch, a novel framework that leverages multimodal optimization to address image-text discrepancies in text-to-image (T2I) generation and editing. TextMatch employs a scoring strategy powered by large language models (LLMs) and visual question-answering (VQA) models to evaluate semantic consistency between prompts and generated images. By integrating multimodal in-context learning and chain of thought reasoning, our method dynamically refines prompts through iterative optimization. This process ensures that the generated images better capture user intent of, resulting in higher fidelity and relevance. Extensive experiments demonstrate that TextMatch significantly improves text-image consistency across multiple benchmarks, establishing a reliable framework for advancing the capabilities of text-to-image generative models. Our code is available at https://anonymous.4open.science/r/TextMatch-F55C/.
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation
Luo, Yucong, Qin, Qitao, Zhang, Hao, Cheng, Mingyue, Yan, Ruiran, Wang, Kefan, Ouyang, Jie
Sequential recommendation (SR) systems have evolved significantly over the past decade, transitioning from traditional collaborative filtering to deep learning approaches and, more recently, to large language models (LLMs). While the adoption of LLMs has driven substantial advancements, these models inherently lack collaborative filtering information, relying primarily on textual content data neglecting other modalities and thus failing to achieve optimal recommendation performance. To address this limitation, we propose Molar, a Multimodal large language sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively. Molar employs an MLLM to generate unified item representations from both textual and non-textual data, facilitating comprehensive multimodal modeling and enriching item embeddings. Additionally, it incorporates collaborative filtering signals through a post-alignment mechanism, which aligns user representations from content-based and ID-based models, ensuring precise personalization and robust performance. By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy. Extensive experiments validate that Molar significantly outperforms traditional and LLM-based baselines, highlighting its strength in utilizing multimodal data and collaborative signals for sequential recommendation tasks. The source code is available at https://anonymous.4open.science/r/Molar-8B06/.
Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study
Yu, Shuo, Cheng, Mingyue, Yang, Jiqian, Ouyang, Jie, Luo, Yucong, Lei, Chenyi, Liu, Qi, Chen, Enhong
Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach for mitigating the hallucination of large language models (LLMs) through the integration of external knowledge. While numerous efforts, most studies focus on a single type of externeal knowledge source. However, in real-world applications, most situations involve diverse knowledge from various sources, yet this area has been less explored. The main dilemma is the lack of a suitable dataset containing multiple knowledge sources and pre-exploration of the associated issues. To address these challenges, we standardize a benchmark dataset that combines structured and unstructured knowledge across diverse and complementary domains. Based on this dataset, we further develop a plug-and-play RAG framework, PruningRAG, whose main characteristic is to employ multi-granularity pruning strategies for optimizing the integration of relevant information and minimizing misleading context. Building upon the standardized dataset and PruningRAG, we also report a series of experimental results, as well as insightful findings. Our dataset and code are publicly available\footnote{https://github.com/USTCAGI/PruningRAG}, with the aim of advancing future research in the RAG community.
Hierarchical Multimodal LLMs with Semantic Space Alignment for Enhanced Time Series Classification
Tao, Xiaoyu, Pan, Tingyue, Cheng, Mingyue, Luo, Yucong
Leveraging large language models (LLMs) has garnered increasing attention and introduced novel perspectives in time series classification. However, existing approaches often overlook the crucial dynamic temporal information inherent in time series data and face challenges in aligning this data with textual semantics. To address these limitations, we propose HiTime, a hierarchical multi-modal model that seamlessly integrates temporal information into LLMs for multivariate time series classification (MTSC). Our model employs a hierarchical feature encoder to capture diverse aspects of time series data through both data-specific and task-specific embeddings. To facilitate semantic space alignment between time series and text, we introduce a dual-view contrastive alignment module that bridges the gap between modalities. Additionally, we adopt a hybrid prompting strategy to fine-tune the pre-trained LLM in a parameter-efficient manner. By effectively incorporating dynamic temporal features and ensuring semantic alignment, HiTime enables LLMs to process continuous time series data and achieves state-of-the-art classification performance through text generation. Extensive experiments on benchmark datasets demonstrate that HiTime significantly enhances time series classification accuracy compared to most competitive baseline methods. Our findings highlight the potential of integrating temporal features into LLMs, paving the way for advanced time series analysis. The code is publicly available for further research and validation. Our codes are publicly available1.
Revisiting the Solution of Meta KDD Cup 2024: CRAG
Ouyang, Jie, Luo, Yucong, Cheng, Mingyue, Wang, Daoyu, Yu, Shuo, Liu, Qi, Chen, Enhong
This paper presents the solution of our team APEX in the Meta KDD CUP 2024: CRAG Comprehensive RAG Benchmark Challenge. The CRAG benchmark addresses the limitations of existing QA benchmarks in evaluating the diverse and dynamic challenges faced by Retrieval-Augmented Generation (RAG) systems. It provides a more comprehensive assessment of RAG performance and contributes to advancing research in this field. We propose a routing-based domain and dynamic adaptive RAG pipeline, which performs specific processing for the diverse and dynamic nature of the question in all three stages: retrieval, augmentation, and generation. Our method achieved superior performance on CRAG and ranked 2nd for Task 2&3 on the final competition leaderboard. Our implementation is available at this link: https://github.com/USTCAGI/CRAG-in-KDD-Cup2024.
Advancing Time Series Classification with Multimodal Language Modeling
Cheng, Mingyue, Chen, Yiheng, Liu, Qi, Liu, Zhiding, Luo, Yucong
For the advancements of time series classification, scrutinizing previous studies, most existing methods adopt a common learning-to-classify paradigm - a time series classifier model tries to learn the relation between sequence inputs and target label encoded by one-hot distribution. Although effective, this paradigm conceals two inherent limitations: (1) encoding target categories with one-hot distribution fails to reflect the comparability and similarity between labels, and (2) it is very difficult to learn transferable model across domains, which greatly hinder the development of universal serving paradigm. In this work, we propose InstructTime, a novel attempt to reshape time series classification as a learning-to-generate paradigm. Relying on the powerful generative capacity of the pre-trained language model, the core idea is to formulate the classification of time series as a multimodal understanding task, in which both task-specific instructions and raw time series are treated as multimodal inputs while the label information is represented by texts. To accomplish this goal, three distinct designs are developed in the InstructTime. Firstly, a time series discretization module is designed to convert continuous time series into a sequence of hard tokens to solve the inconsistency issue across modal inputs. To solve the modality representation gap issue, for one thing, we introduce an alignment projected layer before feeding the transformed token of time series into language models. For another, we highlight the necessity of auto-regressive pre-training across domains, which can facilitate the transferability of the language model and boost the generalization performance. Extensive experiments are conducted over benchmark datasets, whose results uncover the superior performance of InstructTime and the potential for a universal foundation model in time series classification.
Unlocking the Potential of Large Language Models for Explainable Recommendations
Luo, Yucong, Cheng, Mingyue, Zhang, Hao, Lu, Junyu, Liu, Qi, Chen, Enhong
Generating user-friendly explanations regarding why an item is recommended has become increasingly common, largely due to advances in language generation technology, which can enhance user trust and facilitate more informed decision-making when using online services. However, existing explainable recommendation systems focus on using small-size language models. It remains uncertain what impact replacing the explanation generator with the recently emerging large language models (LLMs) would have. Can we expect unprecedented results? In this study, we propose LLMXRec, a simple yet effective two-stage explainable recommendation framework aimed at further boosting the explanation quality by employing LLMs. Unlike most existing LLM-based recommendation works, a key characteristic of LLMXRec is its emphasis on the close collaboration between previous recommender models and LLM-based explanation generators. Specifically, by adopting several key fine-tuning techniques, including parameter-efficient instructing tuning and personalized prompt techniques, controllable and fluent explanations can be well generated to achieve the goal of explanation recommendation. Most notably, we provide three different perspectives to evaluate the effectiveness of the explanations. Finally, we conduct extensive experiments over several benchmark recommender models and publicly available datasets. The experimental results not only yield positive results in terms of effectiveness and efficiency but also uncover some previously unknown outcomes. To facilitate further explorations in this area, the full code and detailed original results are open-sourced at https://github.com/GodFire66666/LLM_rec_explanation/.
FormerTime: Hierarchical Multi-Scale Representations for Multivariate Time Series Classification
Cheng, Mingyue, Liu, Qi, Liu, Zhiding, Li, Zhi, Luo, Yucong, Chen, Enhong
Deep learning-based algorithms, e.g., convolutional networks, have significantly facilitated multivariate time series classification (MTSC) task. Nevertheless, they suffer from the limitation in modeling long-range dependence due to the nature of convolution operations. Recent advancements have shown the potential of transformers to capture long-range dependence. However, it would incur severe issues, such as fixed scale representations, temporal-invariant and quadratic time complexity, with transformers directly applicable to the MTSC task because of the distinct properties of time series data. To tackle these issues, we propose FormerTime, an hierarchical representation model for improving the classification capacity for the MTSC task. In the proposed FormerTime, we employ a hierarchical network architecture to perform multi-scale feature maps. Besides, a novel transformer encoder is further designed, in which an efficient temporal reduction attention layer and a well-informed contextual positional encoding generating strategy are developed. To sum up, FormerTime exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism. Extensive experiments performed on $10$ publicly available datasets from UEA archive verify the superiorities of the FormerTime compared to previous competitive baselines.