Information Retrieval
Numbers Matter! Bringing Quantity-awareness to Retrieval Systems
Almasian, Satya, Bruseva, Milena, Gertz, Michael
Quantitative information plays a crucial role in understanding and interpreting the content of documents. Many user queries contain quantities and cannot be resolved without understanding their semantics, e.g., ``car that costs less than $10k''. Yet, modern search engines apply the same ranking mechanisms for both words and quantities, overlooking magnitude and unit information. In this paper, we introduce two quantity-aware ranking techniques designed to rank both the quantity and textual content either jointly or independently. These techniques incorporate quantity information in available retrieval systems and can address queries with numerical conditions equal, greater than, and less than. To evaluate the effectiveness of our proposed models, we introduce two novel quantity-aware benchmark datasets in the domains of finance and medicine and compare our method against various lexical and neural models. The code and data are available under https://github.com/satya77/QuantityAwareRankers.
Comparing Complex Concepts with Transformers: Matching Patent Claims Against Natural Language Text
Blume, Matthias, Heidari, Ghobad, Hewel, Christoph
An entity defending itself against infringement may attempt to A key capability in managing patent applications or a patent invalidate a patent by finding novelty-destroying prior art to that portfolio is comparing claims to other text, e.g. a patent patent. In all cases, the key task is to search through a set of specification. Because the language of claims is different from documents and determine whether those documents cover all language used elsewhere in the patent application or in non-patent aspects of each claim of the subject patent application or granted text, this has been challenging for computer based natural patent. Thus, a claim of a subject patent (application) may be language processing. We test two new LLM-based approaches considered a query to an information retrieval system whose and find that both provide substantially better performance than objective is to retrieve a document or set of documents that previously published values. The ability to match dense contain all aspects of that claim.
Position: Measure Dataset Diversity, Don't Just Claim It
Zhao, Dora, Andrews, Jerone T. A., Papakyriakopoulos, Orestis, Xiang, Alice
Machine learning (ML) datasets, often perceived as neutral, inherently encapsulate abstract and disputed social constructs. Dataset curators frequently employ value-laden terms such as diversity, bias, and quality to characterize datasets. Despite their prevalence, these terms lack clear definitions and validation. Our research explores the implications of this issue by analyzing "diversity" across 135 image and text datasets. Drawing from social sciences, we apply principles from measurement theory to identify considerations and offer recommendations for conceptualizing, operationalizing, and evaluating diversity in datasets. Our findings have broader implications for ML research, advocating for a more nuanced and precise approach to handling value-laden properties in dataset construction.
Automatic Generation of Web Censorship Probe Lists
Tang, Jenny, Alvarez, Leo, Brar, Arjun, Hoang, Nguyen Phong, Christin, Nicolas
Domain probe lists--used to determine which URLs to probe for Web censorship--play a critical role in Internet censorship measurement studies. Indeed, the size and accuracy of the domain probe list limits the set of censored pages that can be detected; inaccurate lists can lead to an incomplete view of the censorship landscape or biased results. Previous efforts to generate domain probe lists have been mostly manual or crowdsourced. This approach is time-consuming, prone to errors, and does not scale well to the ever-changing censorship landscape. In this paper, we explore methods for automatically generating probe lists that are both comprehensive and up-to-date for Web censorship measurement. We start from an initial set of 139,957 unique URLs from various existing test lists consisting of pages from a variety of languages to generate new candidate pages. By analyzing content from these URLs (i.e., performing topic and keyword extraction), expanding these topics, and using them as a feed to search engines, our method produces 119,255 new URLs across 35,147 domains. We then test the new candidate pages by attempting to access each URL from servers in eleven different global locations over a span of four months to check for their connectivity and potential signs of censorship. Our measurements reveal that our method discovered over 1,400 domains--not present in the original dataset--we suspect to be blocked. In short, automatically updating probe lists is possible, and can help further automate censorship measurements at scale.
Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans
Machine Learning (ML) is increasingly used to automate impactful decisions, which leads to concerns regarding their correctness, reliability, and fairness. We envision highly-automated software platforms to assist data scientists with developing, validating, monitoring, and analysing their ML pipelines. In contrast to existing work, our key idea is to extract "logical query plans" from ML pipeline code relying on popular libraries. Based on these plans, we automatically infer pipeline semantics and instrument and rewrite the ML pipelines to enable diverse use cases without requiring data scientists to manually annotate or rewrite their code. First, we developed such an abstract ML pipeline representation together with machinery to extract it from Python code. Next, we used this representation to efficiently instrument static ML pipelines and apply provenance tracking, which enables lightweight screening for common data preparation issues. Finally, we built machinery to automatically rewrite ML pipelines to perform more advanced what-if analyses and proposed using multi-query optimisation for the resulting workloads. In future work, we aim to interactively assist data scientists as they work on their ML pipelines.
Probabilistic Routing for Graph-Based Approximate Nearest Neighbor Search
Lu, Kejing, Xiao, Chuan, Ishikawa, Yoshiharu
Approximate nearest neighbor search (ANNS) in high-dimensional spaces is a pivotal challenge in the field of machine learning. In recent years, graph-based methods have emerged as the superior approach to ANNS, establishing a new state of the art. Although various optimizations for graph-based ANNS have been introduced, they predominantly rely on heuristic methods that lack formal theoretical backing. This paper aims to enhance routing within graph-based ANNS by introducing a method that offers a probabilistic guarantee when exploring a node's neighbors in the graph. We formulate the problem as probabilistic routing and develop two baseline strategies by incorporating locality-sensitive techniques. Subsequently, we introduce PEOs, a novel approach that efficiently identifies which neighbors in the graph should be considered for exact distance calculation, thus significantly improving efficiency in practice. Our experiments demonstrate that equipping PEOs can increase throughput on commonly utilized graph indexes (HNSW and NSSG) by a factor of 1.6 to 2.5, and its efficiency consistently outperforms the leading-edge routing technique by 1.1 to 1.4 times.
Robust Neural Information Retrieval: An Adversarial and Out-of-distribution Perspective
Liu, Yu-An, Zhang, Ruqing, Guo, Jiafeng, de Rijke, Maarten, Fan, Yixing, Cheng, Xueqi
According to the global overview report from Digital 2023, nearly 82% of Internet users between 18 and 64 have used a search engine or web portal in the past month. Specifically, IR is the process of finding and providing relevant information in response to the user query from a large collection of data. Recently, with advances in deep learning, neural IR models have witnessed significant progress [51, 53]. With the development of training methodologies such as pre-training [44, 100] and fine-tuning [73, 117, 162], neural IR models have demonstrated remarkable effectiveness in learning query-document relevance patterns. Why is robustness important in IR? In real-world deployment of neural IR models, an aspect equally essential as their effectiveness is their robustness. A good IR system must not only exhibit high effectiveness under normal conditions but also demonstrate robustness in the face of abnormal conditions. The natural openness of IR systems makes them vulnerable to intrusion, and the consequences can be severe. For example: (i) Search engines are vulnerable to black hat SEO attacks, necessitating significant efforts to curb these infringements.
TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
Liang, Renjie, Li, Li, Zhang, Chongzhi, Wang, Jing, Zhu, Xizhou, Sun, Aixin
In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and studied by CV, NLP, and IR communities, RVMR is the task that best reflects the practical setting of moment search. To facilitate research in RVMR, we develop the TVR-Ranking dataset, based on the raw videos and existing moment annotations provided in the TVR dataset. Our key contribution is the manual annotation of relevance levels for 94,442 query-moment pairs. We then develop the $NDCG@K, IoU\geq \mu$ evaluation metric for this new task and conduct experiments to evaluate three baseline models. Our experiments show that the new RVMR task brings new challenges to existing models and we believe this new dataset contributes to the research on multi-modality search. The dataset is available at \url{https://github.com/Ranking-VMR/TVR-Ranking}
Enhancing Mobile "How-to" Queries with Automated Search Results Verification and Reranking
Ding, Lei, Bheemanpally, Jeshwanth, Zhang, Yi
Many people use search engines to find online guidance to solve computer or mobile device problems. Users frequently encounter challenges in identifying effective solutions from search results, often wasting time trying ineffective solutions that seem relevant yet fail to solve real problems. This paper introduces a novel approach to improving the accuracy and relevance of online technical support search results through automated search results verification and reranking. Taking "How-to" queries specific to on-device execution as a starting point, we developed the first solution that allows an AI agent to interpret and execute step-by-step instructions in the search results in a controlled Android environment. We further integrated the agent's findings into a reranking mechanism that orders search results based on the success indicators of the tested solutions. The paper details the architecture of our solution and a comprehensive evaluation of the system through a series of tests across various application domains. The results demonstrate a significant improvement in the quality and reliability of the top-ranked results. Our findings suggest a paradigm shift in how search engine ranking for online technical support help can be optimized, offering a scalable and automated solution to the pervasive challenge of finding effective and reliable online help.
Artificial Intuition: Efficient Classification of Scientific Abstracts
Sakhrani, Harsh, Pervez, Naseela, Kumar, Anirudh Ravi, Morstatter, Fred, Reed, Alexandra Graddy, Belz, Andrea
It is desirable to coarsely classify short scientific texts, such as grant or publication abstracts, for strategic insight or research portfolio management. These texts efficiently transmit dense information to experts possessing a rich body of knowledge to aid interpretation. Yet this task is remarkably difficult to automate because of brevity and the absence of context. To address this gap, we have developed a novel approach to generate and appropriately assign coarse domain-specific labels. We show that a Large Language Model (LLM) can provide metadata essential to the task, in a process akin to the augmentation of supplemental knowledge representing human intuition, and propose a workflow. As a pilot study, we use a corpus of award abstracts from the National Aeronautics and Space Administration (NASA). We develop new assessment tools in concert with established performance metrics.