Information Retrieval
Multimodal semantic retrieval for product search
Liu, Dong, Ramos, Esther Lopez
Semantic retrieval (also known as dense retrieval) based on textual data has been extensively studied for both web search and product search application fields, where the relevance of a query and a potential target document is computed by their dense vector representation comparison. Product image is crucial for e-commence search interactions and is a key factor for customers at product explorations. But its impact for semantic retrieval has not been well studied yet. In this research, we build a multimodal representation for product items in e-commerece search in contrast to pure-text representation of products, and investigate the impact of such representations. The models are developed and evaluated on e-commerce datasets. We demonstrate that a multimodal representation scheme for a product can show improvement either on purchase recall or relevance accuracy in semantic retrieval. Additionally, we provide numerical analysis for exclusive matches retrieved by a multimodal semantic retrieval model versus a text-only semantic retrieval model, to demonstrate the validation of multimodal solutions.
A Proposed Large Language Model-Based Smart Search for Archive System
Nguyen, Ha Dung, Nguyen, Thi-Hoang Anh, Nguyen, Thanh Binh
This study presents a novel framework for smart search in digital archival systems, leveraging the capabilities of Large Language Models (LLMs) to enhance information retrieval. By employing a Retrieval-Augmented Generation (RAG) approach, the framework enables the processing of natural language queries and transforming non-textual data into meaningful textual representations. The system integrates advanced metadata generation techniques, a hybrid retrieval mechanism, a router query engine, and robust response synthesis, the results proved search precision and relevance. We present the architecture and implementation of the system and evaluate its performance in four experiments concerning LLM efficiency, hybrid retrieval optimizations, multilingual query handling, and the impacts of individual components. Obtained results show significant improvements over conventional approaches and have demonstrated the potential of AI-powered systems to transform modern archival practices.
Causal Claims in Economics
Garg, Prashant, Fetzer, Thiemo
We analyze over 44,000 NBER and CEPR working papers from 1980 to 2023 using a custom language model to construct knowledge graphs that map economic concepts and their relationships. We distinguish between general claims and those documented via causal inference methods (e.g., DiD, IV, RDD, RCTs). We document a substantial rise in the share of causal claims-from roughly 4% in 1990 to nearly 28% in 2020-reflecting the growing influence of the "credibility revolution." We find that causal narrative complexity (e.g., the depth of causal chains) strongly predicts both publication in top-5 journals and higher citation counts, whereas non-causal complexity tends to be uncorrelated or negatively associated with these outcomes. Novelty is also pivotal for top-5 publication, but only when grounded in credible causal methods: introducing genuinely new causal edges or paths markedly increases both the likelihood of acceptance at leading outlets and long-run citations, while non-causal novelty exhibits weak or even negative effects. Papers engaging with central, widely recognized concepts tend to attract more citations, highlighting a divergence between factors driving publication success and long-term academic impact. Finally, bridging underexplored concept pairs is rewarded primarily when grounded in causal methods, yet such gap filling exhibits no consistent link with future citations. Overall, our findings suggest that methodological rigor and causal innovation are key drivers of academic recognition, but sustained impact may require balancing novel contributions with conceptual integration into established economic discourse.
Human-inspired Perspectives: A Survey on AI Long-term Memory
He, Zihong, Lin, Weizhe, Zheng, Hao, Zhang, Fan, Jones, Matt W., Aitchison, Laurence, Xu, Xuhai, Liu, Miao, Kristensson, Per Ola, Shen, Junxiao
With the rapid advancement of AI systems, their abilities to store, retrieve, and utilize information over the long term - referred to as long-term memory - have become increasingly significant. These capabilities are crucial for enhancing the performance of AI systems across a wide range of tasks. However, there is currently no comprehensive survey that systematically investigates AI's long-term memory capabilities, formulates a theoretical framework, and inspires the development of next-generation AI long-term memory systems. This paper begins by introducing the mechanisms of human long-term memory, then explores AI long-term memory mechanisms, establishing a mapping between the two. Based on the mapping relationships identified, we extend the current cognitive architectures and propose the Cognitive Architecture of Self-Adaptive Long-term Memory (SALM). SALM provides a theoretical framework for the practice of AI long-term memory and holds potential for guiding the creation of next-generation long-term memory driven AI systems. Finally, we delve into the future directions and application prospects of AI long-term memory.
Large Language Models, Knowledge Graphs and Search Engines: A Crossroads for Answering Users' Questions
Hogan, Aidan, Dong, Xin Luna, Vrandeฤiฤ, Denny, Weikum, Gerhard
Much has been discussed about how Large Language Models, Knowledge Graphs and Search Engines can be combined in a synergistic manner. A dimension largely absent from current academic discourse is the user perspective. In particular, there remain many open questions regarding how best to address the diverse information needs of users, incorporating varying facets and levels of difficulty. This paper introduces a taxonomy of user information needs, which guides us to study the pros, cons and possible synergies of Large Language Models, Knowledge Graphs and Search Engines. From this study, we derive a roadmap for future research.
Quantifying Relational Exploration in Cultural Heritage Knowledge Graphs with LLMs: A Neuro-Symbolic Approach
This paper introduces a neuro-symbolic approach for relational exploration in cultural heritage knowledge graphs, leveraging Large Language Models (LLMs) for explanation generation and a novel mathematical framework to quantify the interestingness of relationships. We demonstrate the importance of interestingness measure using a quantitative analysis, by highlighting its impact on the overall performance of our proposed system, particularly in terms of precision, recall, and F1-score. Using the Wikidata Cultural Heritage Linked Open Data (WCH-LOD) dataset, our approach yields a precision of 0.70, recall of 0.68, and an F1-score of 0.69, representing an improvement compared to graph-based (precision: 0.28, recall: 0.25, F1-score: 0.26) and knowledge-based baselines (precision: 0.45, recall: 0.42, F1-score: 0.43). Furthermore, our LLM-powered explanations exhibit better quality, reflected in BLEU (0.52), ROUGE-L (0.58), and METEOR (0.63) scores, all higher than the baseline approaches. We show a strong correlation (0.65) between interestingness measure and the quality of generated explanations, validating its effectiveness. The findings highlight the importance of LLMs and a mathematical formalization for interestingness in enhancing the effectiveness of relational exploration in cultural heritage knowledge graphs, with results that are measurable and testable. We further show that the system enables more effective exploration compared to purely knowledge-based and graph-based methods. Keywords Knowledge Graphs, Large Language Models (LLMs), Explainable AI (XAI), Cultural Heritage, Neuro-Symbolic AI, Interestingness Score, Contextual Relevance, Relational Search 1. Introduction The digitization of cultural heritage artifacts and historical records has generated a vast amount of knowledge encoded in the form of interconnected knowledge graphs (KGs) [1, 2]. Unlocking meaningful insights from these KGs requires more than simple keyword searches [3].
GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian
Dorkin, Aleksei, Sirts, Kairit
Effective lemmatization enhances various downstream NLP We present GliLem--a novel hybrid tasks, including information retrieval based on lexical lemmatization system for Estonian that search and text analysis. Although dense vector enhances the highly accurate rule-based retrieval is gaining traction in information retrieval, morphological analyzer Vabamorf with an lexical search methods remain highly relevant, external disambiguation module based on particularly in modern hybrid systems. Lexical GliNER--an open vocabulary NER model search excels as a first-stage retriever due to its that is able to match text spans with text labels efficiency with inverted indices, and provides reliable in natural language. We leverage the exact term matching that dense retrievers may flexibility of a pre-trained GliNER model miss (Gao et al., 2021). Recent research demonstrates to improve the lemmatization accuracy of that lexical and dense retrieval are complementary, Vabamorf by 10% compared to its original lexical matching providing a strong foundation disambiguation module and achieve an for precise word-level matches, while dense improvement over the token classificationbased retrieval captures semantic relationships and handles baseline. To measure the impact vocabulary mismatches. The complementary of improvements in lemmatization accuracy nature of these approaches has led to state-of-theart on the information retrieval downstream hybrid systems that outperform either method task, we first created an information alone (Lee et al., 2023).
Retrieval-Augmented Generation with Graphs (GraphRAG)
Han, Haoyu, Wang, Yu, Shomer, Harry, Guo, Kai, Ding, Jiayuan, Lei, Yongjia, Halappanavar, Mahantesh, Rossi, Ryan A., Mukherjee, Subhabrata, Tang, Xianfeng, He, Qi, Hua, Zhigang, Long, Bo, Zhao, Tong, Shah, Neil, Javari, Amin, Xia, Yinglong, Tang, Jiliang
Retrieval-augmented generation (RAG) is a powerful technique that enhances downstream task execution by retrieving additional information, such as knowledge, skills, and tools from external sources. Graph, by its intrinsic "nodes connected by edges" nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real-world applications. As a result, we have recently witnessed increasing attention on equipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG, where the retriever, generator, and external data sources can be uniformly designed in the neural-embedding space, the uniqueness of graph-structured data, such as diverse-formatted and domain-specific relational knowledge, poses unique and significant challenges when designing GraphRAG for different domains. Given the broad applicability, the associated design challenges, and the recent surge in GraphRAG, a systematic and up-to-date survey of its key concepts and techniques is urgently desired. Following this motivation, we present a comprehensive and up-to-date survey on GraphRAG. Our survey first proposes a holistic GraphRAG framework by defining its key components, including query processor, retriever, organizer, generator, and data source. Furthermore, recognizing that graphs in different domains exhibit distinct relational patterns and require dedicated designs, we review GraphRAG techniques uniquely tailored to each domain. Finally, we discuss research challenges and brainstorm directions to inspire cross-disciplinary opportunities.
TACLR: A Scalable and Efficient Retrieval-based Method for Industrial Product Attribute Value Identification
Su, Yindu, Zou, Huike, Sun, Lin, Zhang, Ting, Yang, Haiyang, Chen, Liyu, Lo, David, Zhang, Qingheng, Han, Shuguang, Chen, Jufeng
Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendations, and business analytics on e-commerce platforms. However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs. To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity to the item embedding. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds. TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial scenarios. Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Moreover, it has been successfully deployed in a real-world e-commerce platform, processing millions of product listings daily while supporting dynamic, large-scale attribute taxonomies.
Multilingual Open QA on the MIA Shared Task
Yarrabelly, Navya, Mittal, Saloni, Todi, Ketan, Hasegawa, Kimihiro
Cross-lingual information retrieval (CLIR) ~\cite{shi2021cross, asai2021one, jiang2020cross} for example, can find relevant text in any language such as English(high resource) or Telugu (low resource) even when the query is posed in a different, possibly low-resource, language. In this work, we aim to develop useful CLIR models for this constrained, yet important, setting where we do not require any kind of additional supervision or labelled data for retrieval task and hence can work effectively for low-resource languages. \par We propose a simple and effective re-ranking method for improving passage retrieval in open question answering. The re-ranker re-scores retrieved passages with a zero-shot multilingual question generation model, which is a pre-trained language model, to compute the probability of the input question in the target language conditioned on a retrieved passage, which can be possibly in a different language. We evaluate our method in a completely zero shot setting and doesn't require any training. Thus the main advantage of our method is that our approach can be used to re-rank results obtained by any sparse retrieval methods like BM-25. This eliminates the need for obtaining expensive labelled corpus required for the retrieval tasks and hence can be used for low resource languages.