AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Approximate Nearest Neighbor Search with Window Filters

Engels, Joshua, Landrum, Benjamin, Yu, Shangdi, Dhulipala, Laxman, Shun, Julian

arXiv.org Artificial IntelligenceFeb-1-2024

The nearest neighbor search problem has been widely studied for more than 30 years (Arya & Mount, 1993). Given Although this problem has many motivating examples, there a dataset D, the problem requires the construction of an is a dearth of papers examining it in the literature. Some vector index that can efficiently answer queries of the form "what databases analyze window search-like problem instances is the closest vector to x in D?" Solving this problem exactly as an additional feature of their system, but this analysis degrades to a brute force linear search in high dimensions is typically secondary to their main approach and too slow (Rubinstein, 2018), so instead both theoreticians and for large-scale real-world systems; as far as we are aware, practitioners focus on the relaxed c-approximate nearest we are the first to propose, analyze, and experiment with a neighbor search problem (ANNS), which asks "what is a non-trivial solution to the window search problem.

approximate nearest neighbor search, filter fraction, query, (8 more...)

arXiv.org Artificial Intelligence

2402.00943

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Maryland (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

A First Look at Information Highlighting in Stack Overflow Answers

Ahmed, Shahla Shaan, Wang, Shaowei, Tian, Yuan, Tse-Hsun, null, Chen, null, Zhang, Haoxiang

arXiv.org Artificial IntelligenceJan-31-2024

Context: Navigating the knowledge of Stack Overflow (SO) remains challenging. To make the posts vivid to users, SO allows users to write and edit posts with Markdown or HTML so that users can leverage various formatting styles (e.g., bold, italic, and code) to highlight the important information. Nonetheless, there have been limited studies on the highlighted information. Objective: We carried out the first large-scale exploratory study on the information highlighted in SO answers in our recent study. To extend our previous study, we develop approaches to automatically recommend highlighted content with formatting styles using neural network architectures initially designed for the Named Entity Recognition task. Method: In this paper, we studied 31,169,429 answers of Stack Overflow. For training recommendation models, we choose CNN and BERT models for each type of formatting (i.e., Bold, Italic, Code, and Heading) using the information highlighting dataset we collected from SO answers. Results: Our models based on CNN architecture achieve precision ranging from 0.71 to 0.82. The trained model for automatic code content highlighting achieves a recall of 0.73 and an F1 score of 0.71, outperforming the trained models for other formatting styles. The BERT models have even lower recalls and F1 scores than the CNN models. Our analysis of failure cases indicates that the majority of the failure cases are missing identification (i.e., the model misses the content that is supposed to be highlighted) due to the models tend to learn the frequently highlighted words while struggling to learn less frequent words. Conclusion: Our findings suggest that it is possible to develop recommendation models for highlighting information for answers with different formatting styles on Stack Overflow.

information, italic, stack overflow, (13 more...)

arXiv.org Artificial Intelligence

2401.01472

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Manitoba (0.04)
Europe > Italy (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Commonsense for Zero-Shot Natural Language Video Localization

Holla, Meghana, Lourentzou, Ismini

arXiv.org Artificial IntelligenceJan-31-2024

Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.

enhancement, information, query, (13 more...)

arXiv.org Artificial Intelligence

2312.17429

Country:

North America > United States > Virginia (0.04)
North America > United States > Illinois (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.35)

Add feedback

How Does Beam Search improve Span-Level Confidence Estimation in Generative Sequence Labeling?

Hashimoto, Kazuma, Naim, Iftekhar, Raman, Karthik

arXiv.org Artificial IntelligenceJan-30-2024

Sequence labeling is a core task in text understanding for IE/IR systems. Text generation models have increasingly become the go-to solution for such tasks (e.g., entity extraction and dialog slot filling). While most research has focused on the labeling accuracy, a key aspect -- of vital practical importance -- has slipped through the cracks: understanding model confidence. More specifically, we lack a principled understanding of how to reliably gauge the confidence of a model in its predictions for each labeled span. This paper aims to provide some empirical insights on estimating model confidence for generative sequence labeling. Most notably, we find that simply using the decoder's output probabilities \textbf{is not} the best in realizing well-calibrated confidence estimates. As verified over six public datasets of different tasks, we show that our proposed approach -- which leverages statistics from top-$k$ predictions by a beam search -- significantly reduces calibration errors of the predictions of a generative sequence labeling model.

computational linguistic, equation, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2212.10767

Country:

Asia > Middle East > Qatar (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Pennsylvania (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.73)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

NNOSE: Nearest Neighbor Occupational Skill Extraction

Zhang, Mike, van der Goot, Rob, Kan, Min-Yen, Plank, Barbara

arXiv.org Artificial IntelligenceJan-30-2024

The labor market is changing rapidly, prompting increased interest in the automatic extraction of occupational skills from text. With the advent of English benchmark job description datasets, there is a need for systems that handle their diversity well. We tackle the complexity in occupational skill datasets tasks -- combining and leveraging multiple datasets for skill extraction, to identify rarely observed skills within a dataset, and overcoming the scarcity of skills across datasets. In particular, we investigate the retrieval-augmentation of language models, employing an external datastore for retrieving similar skills in a dataset-unifying manner. Our proposed method, \textbf{N}earest \textbf{N}eighbor \textbf{O}ccupational \textbf{S}kill \textbf{E}xtraction (NNOSE) effectively leverages multiple datasets by retrieving neighboring skills from other datasets in the datastore. This improves skill extraction \emph{without} additional fine-tuning. Crucially, we observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30\% span-F1 in cross-dataset settings.

computational linguistic, dataset, datastore, (13 more...)

arXiv.org Artificial Intelligence

2401.17092

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(15 more...)

Genre: Research Report (1.00)

Industry: Banking & Finance > Economy (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.46)
(2 more...)

Add feedback

History-Aware Conversational Dense Retrieval

Mo, Fengran, Qu, Chen, Mao, Kelong, Zhu, Tianyu, Su, Zhan, Huang, Kaiyu, Nie, Jian-Yun

arXiv.org Artificial IntelligenceJan-29-2024

Conversational search facilitates complex information retrieval by enabling multi-turn interactions between users and the system. Supporting such interactions requires a comprehensive understanding of the conversational inputs to formulate a good search query based on historical information. In particular, the search query should include the relevant information from the previous conversation turns. However, current approaches for conversational dense retrieval primarily rely on fine-tuning a pre-trained ad-hoc retriever using the whole conversational search session, which can be lengthy and noisy. Moreover, existing approaches are limited by the amount of manual supervision signals in the existing datasets. To address the aforementioned issues, we propose a History-Aware Conversational Dense Retrieval (HAConvDR) system, which incorporates two ideas: context-denoised query reformulation and automatic mining of supervision signals based on the actual impact of historical turns. Experiments on two public conversational search datasets demonstrate the improved history modeling capability of HAConvDR, in particular for long conversations with topic shifts.

historical turn, query, retrieval, (14 more...)

arXiv.org Artificial Intelligence

2401.16659

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > New York > Suffolk County > Huntington (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Navigating the Post-API Dilemma Search Engine Results Pages Present a Biased View of Social Media Data

Poudel, Amrit, Weninger, Tim

arXiv.org Artificial IntelligenceJan-27-2024

Recent decisions to discontinue access to social media APIs are having detrimental effects on Internet research and the field of computational social science as a whole. This lack of access to data has been dubbed the Post-API era of Internet research. Fortunately, popular search engines have the means to crawl, capture, and surface social media data on their Search Engine Results Pages (SERP) if provided the proper search query, and may provide a solution to this dilemma. In the present work we ask: does SERP provide a complete and unbiased sample of social media data? Is SERP a viable alternative to direct API-access? To answer these questions, we perform a comparative analysis between (Google) SERP results and nonsampled data from Reddit and Twitter/X. We find that SERP results are highly biased in favor of popular posts; against political, pornographic, and vulgar posts; are more positive in their sentiment; and have large topical gaps. Overall, we conclude that SERP is not a viable alternative to social media API access.

media data, serp, social media data, (15 more...)

arXiv.org Artificial Intelligence

2401.15479

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Ukraine (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Industry:

Information Technology > Services (1.00)
Media (0.96)
Government (0.68)
Health & Medicine > Therapeutic Area > Immunology (0.47)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.83)
(2 more...)

Add feedback

Query Complexity of Tournament Solutions

Maiti, Arnab, Dey, Palash

arXiv.org Artificial IntelligenceJan-27-2024

A directed graph where there is exactly one edge between every pair of vertices is called a {\em tournament}. Finding the "best" set of vertices of a tournament is a well studied problem in social choice theory. A {\em tournament solution} takes a tournament as input and outputs a subset of vertices of the input tournament. However, in many applications, for example, choosing the best set of drugs from a given set of drugs, the edges of the tournament are given only implicitly and knowing the orientation of an edge is costly. In such scenarios, we would like to know the best set of vertices (according to some tournament solution) by "querying" as few edges as possible. We, in this paper, precisely study this problem for commonly used tournament solutions: given an oracle access to the edges of a tournament T, find $f(T)$ by querying as few edges as possible, for a tournament solution f. We first show that the set of Condorcet non-losers in a tournament can be found by querying $2n-\lfloor \log n \rfloor -2$ edges only and this is tight in the sense that every algorithm for finding the set of Condorcet non-losers needs to query at least $2n-\lfloor \log n \rfloor -2$ edges in the worst case, where $n$ is the number of vertices in the input tournament. We then move on to study other popular tournament solutions and show that any algorithm for finding the Copeland set, the Slater set, the Markov set, the bipartisan set, the uncovered set, the Banks set, and the top cycle must query $\Omega(n^2)$ edges in the worst case. On the positive side, we are able to circumvent our strong query complexity lower bound results by proving that, if the size of the top cycle of the input tournament is at most $k$, then we can find all the tournament solutions mentioned above by querying $O(nk + \frac{n\log n}{\log(1-\frac{1}{k})})$ edges only.

algorithm, query complexity, tournament, (15 more...)

arXiv.org Artificial Intelligence

1611.06189

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Virginia (0.04)
(4 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.66)

Add feedback

Roq: Robust Query Optimization Based on a Risk-aware Learned Cost Model

Kamali, Amin, Kantere, Verena, Zuzarte, Calisto, Corvinelli, Vincent

arXiv.org Artificial IntelligenceJan-26-2024

Query optimizers in relational database management systems (RDBMSs) search for execution plans expected to be optimal for a given queries. They use parameter estimates, often inaccurate, and make assumptions that may not hold in practice. Consequently, they may select execution plans that are suboptimal at runtime, when these estimates and assumptions are not valid, which may result in poor query performance. Therefore, query optimizers do not sufficiently support robust query optimization. Recent years have seen a surge of interest in using machine learning (ML) to improve efficiency of data systems and reduce their maintenance overheads, with promising results obtained in the area of query optimization in particular. In this paper, inspired by these advancements, and based on several years of experience of IBM Db2 in this journey, we propose Robust Optimization of Queries, (Roq), a holistic framework that enables robust query optimization based on a risk-aware learning approach. Roq includes a novel formalization of the notion of robustness in the context of query optimization and a principled approach for its quantification and measurement based on approximate probabilistic ML. It also includes novel strategies and algorithms for query plan evaluation and selection. Roq also includes a novel learned cost model that is designed to predict query execution cost and the associated risks and performs query optimization accordingly. We demonstrate experimentally that Roq provides significant improvements to robust query optimization compared to the state-of-the-art.

query, robustness, suboptimality, (17 more...)

arXiv.org Artificial Intelligence

2401.1521

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.34)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)

Add feedback

LongFin: A Multimodal Document Understanding Model for Long Financial Domain Documents

Masry, Ahmed, Hajian, Amir

arXiv.org Artificial IntelligenceJan-26-2024

Document AI is a growing research field that focuses on the comprehension and extraction of information from scanned and digital documents to make everyday business operations more efficient. Numerous downstream tasks and datasets have been introduced to facilitate the training of AI models capable of parsing and extracting information from various document types such as receipts and scanned forms. Despite these advancements, both existing datasets and models fail to address critical challenges that arise in industrial contexts. Existing datasets primarily comprise short documents consisting of a single page, while existing models are constrained by a limited maximum length, often set at 512 tokens. Consequently, the practical application of these methods in financial services, where documents can span multiple pages, is severely impeded. To overcome these challenges, we introduce LongFin, a multimodal document AI model capable of encoding up to 4K tokens. We also propose the LongForms dataset, a comprehensive financial dataset that encapsulates several industrial challenges in financial documents. Through an extensive evaluation, we demonstrate the effectiveness of the LongFin model on the LongForms dataset, surpassing the performance of existing public models while maintaining comparable results on existing single-page benchmarks.

dataset, encoder, longfin, (15 more...)

arXiv.org Artificial Intelligence

2401.1505

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.40)

Industry: Banking & Finance > Financial Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)

Add feedback