Information Retrieval
Improving Retrieval in Theme-specific Applications using a Corpus Topical Taxonomy
Kang, SeongKu, Agarwal, Shivam, Jin, Bowen, Lee, Dongha, Yu, Hwanjo, Han, Jiawei
Document retrieval has greatly benefited from the advancements of large-scale pre-trained language models (PLMs). However, their effectiveness is often limited in theme-specific applications for specialized areas or industries, due to unique terminologies, incomplete contexts of user queries, and specialized search intents. To capture the theme-specific information and improve retrieval, we propose to use a corpus topical taxonomy, which outlines the latent topic structure of the corpus while reflecting user-interested aspects. We introduce ToTER (Topical Taxonomy Enhanced Retrieval) framework, which identifies the central topics of queries and documents with the guidance of the taxonomy, and exploits their topical relatedness to supplement missing contexts. As a plug-and-play framework, ToTER can be flexibly employed to enhance various PLM-based retrievers. Through extensive quantitative, ablative, and exploratory experiments on two real-world datasets, we ascertain the benefits of using topical taxonomy for retrieval in theme-specific applications and demonstrate the effectiveness of ToTER.
Backtracing: Retrieving the Cause of the Query
Wang, Rose E., Wirawarn, Pawan, Khattab, Omar, Goodman, Noah, Demszky, Dorottya
Many online content portals allow users to ask questions to supplement their understanding (e.g., of lectures). While information retrieval (IR) systems may provide answers for such user queries, they do not directly assist content creators -- such as lecturers who want to improve their content -- identify segments that _caused_ a user to ask those questions. We introduce the task of backtracing, in which systems retrieve the text segment that most likely caused a user query. We formalize three real-world domains for which backtracing is important in improving content delivery and communication: understanding the cause of (a) student confusion in the Lecture domain, (b) reader curiosity in the News Article domain, and (c) user emotion in the Conversation domain. We evaluate the zero-shot performance of popular information retrieval methods and language modeling methods, including bi-encoder, re-ranking and likelihood-based methods and ChatGPT. While traditional IR systems retrieve semantically relevant information (e.g., details on "projection matrices" for a query "does projecting multiple times still lead to the same point?"), they often miss the causally relevant context (e.g., the lecturer states "projecting twice gets me the same answer as one projection"). Our results show that there is room for improvement on backtracing and it requires new retrieval approaches. We hope our benchmark serves to improve future retrieval systems for backtracing, spawning systems that refine content generation and identify linguistic triggers influencing user queries. Our code and data are open-sourced: https://github.com/rosewang2008/backtracing.
Artificial Intelligence Exploring the Patent Field
Advanced language-processing and machine-learning techniques promise massive efficiency improvements in the previously widely manual field of patent and technical knowledge management. This field presents large-scale and complex data with very precise contents and language representation of those contents. Particularly, patent texts can differ from mundane texts in various aspects, which entails significant opportunities and challenges. This paper presents a systematic overview of patent-related tasks and popular methodologies with a special focus on evolving and promising techniques. Language processing and particularly large language models as well as the recent boost of general generative methods promise to become game changers in the patent field. The patent literature and the fact-based argumentative procedures around patents appear almost as an ideal use case. However, patents entail a number of difficulties with which existing models struggle. The paper introduces fundamental aspects of patents and patent-related data that affect technology that wants to explore or manage them. It further reviews existing methods and approaches and points out how important reliable and unbiased evaluation metrics become. Although research has made substantial progress on certain tasks, the performance across many others remains suboptimal, sometimes because of either the special nature of patents and their language or inconsistencies between legal terms and the everyday meaning of terms. Moreover, yet few methods have demonstrated the ability to produce satisfactory text for specific sections of patents. By pointing out key developments, opportunities, and gaps, we aim to encourage further research and accelerate the advancement of this field.
Pfeed: Generating near real-time personalized feeds using precomputed embedding similarities
Gebre, Binyam, Ranta, Karoliina, Elzen, Stef van den, Kuiper, Ernst, Baars, Thijs, Heskes, Tom
In personalized recommender systems, embeddings are often used to encode customer actions and items, and retrieval is then performed in the embedding space using approximate nearest neighbor search. However, this approach can lead to two challenges: 1) user embeddings can restrict the diversity of interests captured and 2) the need to keep them up-to-date requires an expensive, real-time infrastructure. In this paper, we propose a method that overcomes these challenges in a practical, industrial setting. The method dynamically updates customer profiles and composes a feed every two minutes, employing precomputed embeddings and their respective similarities. We tested and deployed this method to personalise promotional items at Bol, one of the largest e-commerce platforms of the Netherlands and Belgium. The method enhanced customer engagement and experience, leading to a significant 4.9% uplift in conversions.
Reliable, Adaptable, and Attributable Language Models with Retrieval
Asai, Akari, Zhong, Zexuan, Chen, Danqi, Koh, Pang Wei, Zettlemoyer, Luke, Hajishirzi, Hannaneh, Yih, Wen-tau
Parametric language models (LMs), which are trained on vast amounts of web data, exhibit remarkable flexibility and capability. However, they still face practical challenges such as hallucinations, difficulty in adapting to new data distributions, and a lack of verifiability. In this position paper, we advocate for retrieval-augmented LMs to replace parametric LMs as the next generation of LMs. By incorporating large-scale datastores during inference, retrieval-augmented LMs can be more reliable, adaptable, and attributable. Despite their potential, retrieval-augmented LMs have yet to be widely adopted due to several obstacles: specifically, current retrieval-augmented LMs struggle to leverage helpful text beyond knowledge-intensive tasks such as question answering, have limited interaction between retrieval and LM components, and lack the infrastructure for scaling. To address these, we propose a roadmap for developing general-purpose retrieval-augmented LMs. This involves a reconsideration of datastores and retrievers, the exploration of pipelines with improved retriever-LM interaction, and significant investment in infrastructure for efficient training and inference.
LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK Case Law Dataset
Izzidien, Ahmed, Sargeant, Holli, Steffek, Felix
To undertake computational research of the law, efficiently identifying datasets of court decisions that relate to a specific legal issue is a crucial yet challenging endeavour. This study addresses the gap in the literature working with large legal corpora about how to isolate cases, in our case summary judgments, from a large corpus of UK court decisions. We introduce a comparative analysis of two computational methods: (1) a traditional natural language processing-based approach leveraging expert-generated keywords and logical operators and (2) an innovative application of the Claude 2 large language model to classify cases based on content-specific prompts. We use the Cambridge Law Corpus of 356,011 UK court decisions and determine that the large language model achieves a weighted F1 score of 0.94 versus 0.78 for keywords. Despite iterative refinement, the search logic based on keywords fails to capture nuances in legal language. We identify and extract 3,102 summary judgment cases, enabling us to map their distribution across various UK courts over a temporal span. The paper marks a pioneering step in employing advanced natural language processing to tackle core legal research tasks, demonstrating how these technologies can bridge systemic gaps and enhance the accessibility of legal information. We share the extracted dataset metrics to support further research on summary judgments.
Search Intenion Network for Personalized Query Auto-Completion in E-Commerce
Bao, Wei, Zhang, Mi, Zhang, Tao, Huo, Chengfu
Query Auto-Completion(QAC), as an important part of the modern Generally, in search engines, traditional QAC system follows a search engine, plays a key role in complementing user queries two-stage method: matching and ranking. In the matching phase, and helping them refine their search intentions. Today's QAC systems a sufficient number of candidate queries matching the prefix are in real-world scenarios face two major challenges:1)intention recalled from the history log. In the ranking stage,the candidate equivocality(IE): during the user's typing process, the prefix often historical frequency features[3, 24, 45] and semantic features[22, contains a combination of characters and subwords, which makes 32, 44] are used to obtain the final list ranking order. Finally, due to the current intention ambiguous and difficult to model.2)intention the limitation of display space, several top ranked candidates will transfer (IT):previous works make personalized recommendations be provided to users.
Query Augmentation by Decoding Semantics from Brain Signals
Ye, Ziyi, Zhan, Jingtao, Ai, Qingyao, Liu, Yiqun, de Rijke, Maarten, Lioma, Christina, Ruotsalo, Tuukka
Query augmentation is a crucial technique for refining semantically imprecise queries. Traditionally, query augmentation relies on extracting information from initially retrieved, potentially relevant documents. If the quality of the initially retrieved documents is low, then the effectiveness of query augmentation would be limited as well. We propose Brain-Aug, which enhances a query by incorporating semantic information decoded from brain signals. BrainAug generates the continuation of the original query with a prompt constructed with brain signal information and a ranking-oriented inference approach. Experimental results on fMRI (functional magnetic resonance imaging) datasets show that Brain-Aug produces semantically more accurate queries, leading to improved document ranking performance. Such improvement brought by brain signals is particularly notable for ambiguous queries.
HearHere: Mitigating Echo Chambers in News Consumption through an AI-based Web System
Jeon, Youngseung, Kim, Jaehoon, Park, Sohyun, Ko, Yunyong, Ryu, Seongeun, Kim, Sang-Wook, Han, Kyungsik
This practice can lead to more rational decision-making that is not heavily influenced by specific opinions or positions [12, 22, 23]. As the Internet is a primary source of information for many people and the volume of online information is immense, effectively helping people consume and share information from diverse perspectives is necessary but challenging [57, 93]. Researchers have proposed various support methods for this, including the development and use of computer technology. In particular, artificial intelligence (AI)-based recommendation systems have been designed to support efficient information consumption by learning users' demographic characteristics or online activity patterns and providing tailored information based on their preferences [77]. Although computer technology plays an important role in enabling people to access and share online information, it should be noted that providing information solely based on individuals' preferences and tendencies can inadvertently contribute to the formation of echo chambers [77], a phenomenon where individuals are exposed primarily to the like-minded groups or information, leading to a reinforcement of shared narratives [28]. Research has shown that echo chambers can have many negative outcomes, including the creation and dissemination of biased information [77], increased susceptibility to fake news [8, 27], resistance towards accepting scientific evidence [63], and the adoption of unbalanced perspectives [36]. To prevent users from becoming polarized towards a specific political stance, many studies have proposed the use of computer-based tools designed to present information from diverse perspectives [31, 48, 53, 62].
Crafting Knowledge: Exploring the Creative Mechanisms of Chat-Based Search Engines
Ma, Lijia, Xu, Xingchen, Tan, Yong
In the domain of digital information dissemination, search engines act as pivotal conduits linking information seekers with providers. The advent of chat-based search engines utilizing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG), exemplified by Bing Chat, marks an evolutionary leap in the search ecosystem. They demonstrate metacognitive abilities in interpreting web information and crafting responses with human-like understanding and creativity. Nonetheless, the intricate nature of LLMs renders their "cognitive" processes opaque, challenging even their designers' understanding. This research aims to dissect the mechanisms through which an LLM-powered chat-based search engine, specifically Bing Chat, selects information sources for its responses. To this end, an extensive dataset has been compiled through engagements with New Bing, documenting the websites it cites alongside those listed by the conventional search engine. Employing natural language processing (NLP) techniques, the research reveals that Bing Chat exhibits a preference for content that is not only readable and formally structured, but also demonstrates lower perplexity levels, indicating a unique inclination towards text that is predictable by the underlying LLM. Further enriching our analysis, we procure an additional dataset through interactions with the GPT-4 based knowledge retrieval API, unveiling a congruent text preference between the RAG API and Bing Chat. This consensus suggests that these text preferences intrinsically emerge from the underlying language models, rather than being explicitly crafted by Bing Chat's developers. Moreover, our investigation documents a greater similarity among websites cited by RAG technologies compared to those ranked highest by conventional search engines.