AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Chen, Yang, Hu, Hexiang, Luan, Yi, Sun, Haitian, Changpinyo, Soravit, Ritter, Alan, Chang, Ming-Wei

arXiv.org Artificial IntelligenceOct-17-2023

Pre-trained vision and language models have demonstrated state-of-the-art capabilities over existing tasks involving images and texts, including visual question answering. However, it remains unclear whether these models possess the capability to answer questions that are not only querying visual content but knowledge-intensive and information-seeking. In this study, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training. Furthermore, we show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents, showing a significant space for improvement.

dataset, knowledge, visual entity, (14 more...)

arXiv.org Artificial Intelligence

2302.11713

Country:

North America > United States > Washington > King County > Seattle (0.04)
Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.04)
Asia > Armenia (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.76)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval

Zhang, Peitian, Liu, Zheng, Xiao, Shitao, Dou, Zhicheng, Yao, Jing

arXiv.org Artificial IntelligenceOct-17-2023

Inverted file structure is a common technique for accelerating dense retrieval. It clusters documents based on their embeddings; during searching, it probes nearby clusters w.r.t. an input query and only evaluates documents within them by subsequent codecs, thus avoiding the expensive cost of exhaustive traversal. However, the clustering is always lossy, which results in the miss of relevant documents in the probed clusters and hence degrades retrieval quality. In contrast, lexical matching, such as overlaps of salient terms, tends to be strong feature for identifying relevant documents. In this work, we present the Hybrid Inverted Index (HI$^2$), where the embedding clusters and salient terms work collaboratively to accelerate dense retrieval. To make best of both effectiveness and efficiency, we devise a cluster selector and a term selector, to construct compact inverted lists and efficiently searching through them. Moreover, we leverage simple unsupervised algorithms as well as end-to-end knowledge distillation to learn these two modules, with the latter further boosting the effectiveness. Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI$^2$ to achieve lossless retrieval quality with competitive efficiency across various index settings. Our code and checkpoint are publicly available at https://github.com/namespace-Pt/Adon/tree/HI2.

efficiency, hi 2, retrieval, (14 more...)

arXiv.org Artificial Intelligence

2210.05521

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Czechia > Prague (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(18 more...)

Genre: Research Report > Experimental Study (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Data Science (0.67)

Add feedback

BLM-17m: A Large-Scale Dataset for Black Lives Matter Topic Detection on Twitter

Kemik, Hasan, Özateş, Nusret, Asgari-Chenaghlu, Meysam, Li, Yang, Cambria, Erik

arXiv.org Artificial IntelligenceOct-17-2023

Protection of human rights is one of the most important problems of our world. In this paper, our aim is to provide a dataset which covers one of the most significant human rights contradiction in recent months affected the whole world, George Floyd incident. We propose a labeled dataset for topic detection that contains 17 million tweets. These Tweets are collected from 25 May 2020 to 21 August 2020 that covers 89 days from start of this incident. We labeled the dataset by monitoring most trending news topics from global and local newspapers. Apart from that, we present two baselines, TF-IDF and LDA. We evaluated the results of these two methods with three different k values for metrics of precision, recall and f1-score. The collected dataset is available at https://github.com/MeysamAsgariC/BLMT.

cambria, sentiment analysis, tweet, (14 more...)

arXiv.org Artificial Intelligence

2105.01331

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Law > Civil Rights & Constitutional Law (1.00)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.91)

Add feedback

A 'Green' Search Engine Sees Danger--and Opportunity--in the Generative AI Revolution

WIREDOct-16-2023, 08:51:57 GMT

In the era of search wars fought between giants, it's tough to be small. Berlin-based Ecosia offers a search engine for the climate-conscious, promising to be carbon-negative by investing all of its profits into planting trees--more than 180 million of them since it launched in 2009. It's not likely to topple Google, but it has won a stable clientele of around 20 million users with that green branding and by repackaging search results from Microsoft's Bing. But after a decade of little change in the search business, everything is now in flux, thanks to generative AI. "I've never seen so much change in the market as in the last six months," says Christian Kroll, Ecosia's CEO. The tumult has forced Ecosia to rethink its business plan in order to compete with new chatbot-like search engines built on large language models.

generative ai revolution, microsoft, search engine, (7 more...)

WIRED

Country: North America > United States (0.06)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.77)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.66)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.59)

Add feedback

AutoML in Heavily Constrained Applications

Neutatz, Felix, Lindauer, Marius, Abedjan, Ziawasch

arXiv.org Artificial IntelligenceOct-16-2023

Optimizing a machine learning pipeline for a task at hand requires careful configuration of various hyperparameters, typically supported by an AutoML system that optimizes the hyperparameters for the given training dataset. Yet, depending on the AutoML system's own second-order meta-configuration, the performance of the AutoML process can vary significantly. Current AutoML systems cannot automatically adapt their own configuration to a specific use case. Further, they cannot compile user-defined application constraints on the effectiveness and efficiency of the pipeline and its generation. In this paper, we propose CAML, which uses meta-learning to automatically adapt its own AutoML parameters, such as the search strategy, the validation strategy, and the search space, for a task at hand. The dynamic AutoML strategy of CAML takes user-defined constraints into account and obtains constraint-satisfying pipelines with high predictive performance.

automl configuration, configuration, constraint, (15 more...)

arXiv.org Artificial Intelligence

2306.16913

Country:

Europe > Portugal > Lisbon > Lisbon (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
(3 more...)

Add feedback

From Cloze to Comprehension: Retrofitting Pre-trained Masked Language Model to Pre-trained Machine Reader

Xu, Weiwen, Li, Xin, Zhang, Wenxuan, Zhou, Meng, Lam, Wai, Si, Luo, Bing, Lidong

arXiv.org Artificial IntelligenceOct-16-2023

We present Pre-trained Machine Reader (PMR), a novel method for retrofitting pre-trained masked language models (MLMs) to pre-trained machine reading comprehension (MRC) models without acquiring labeled data. PMR can resolve the discrepancy between model pre-training and downstream fine-tuning of existing MLMs. To build the proposed PMR, we constructed a large volume of general-purpose and high-quality MRC-style training data by using Wikipedia hyperlinks and designed a Wiki Anchor Extraction task to guide the MRC-style pre-training. Apart from its simplicity, PMR effectively solves extraction tasks, such as Extractive Question Answering and Named Entity Recognition. PMR shows tremendous improvements over existing approaches, especially in low-resource scenarios. When applied to the sequence classification task in the MRC formulation, PMR enables the extraction of high-quality rationales to explain the classification process, thereby providing greater prediction explainability. PMR also has the potential to serve as a unified model for tackling various extraction and classification tasks in the MRC formulation.

computational linguistic, pmr, proceedings, (13 more...)

arXiv.org Artificial Intelligence

2212.04755

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Syria (0.05)
Asia > Japan (0.05)
(9 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment > Sports > Football (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
(2 more...)

Add feedback

Unsupervised Domain Adaption for Neural Information Retrieval

Dominguez, Carlos, Campos, Jon Ander, Agirre, Eneko, Azkune, Gorka

arXiv.org Artificial IntelligenceOct-13-2023

Neural information retrieval requires costly annotated data for each target domain to be competitive. Synthetic annotation by query generation using Large Language Models or rulebased string manipulation has been proposed as an alternative, but their relative merits have not been analysed. In this paper, we compare both methods head-to-head using the same neural IR architecture. We focus on the BEIR benchmark, which includes test datasets from several domains with no training data, and explore two scenarios: zero-shot, where the supervised system is trained in a large out-ofdomain dataset (MS-MARCO); and unsupervised Figure 1: Experimental design: (left) a supervised retriever domain adaptation, where, in addition to is trained with manual annotations from MS-MS-MARCO, the system is fine-tuned in synthetic MARCO; (middle) an unsupervised retriever is trained data from the target domain. Our results with automatically generated queries for MS-MARCO indicate that Large Language Models outperform documents; (right) an unsupervised domain adaptation rule-based methods in all scenarios by a retriever is trained with both MS-MARCO manual annotations large margin, and, more importantly, that unsupervised and automatically generated queries in-domain domain adaptation is effective compared BEIR dataset documents. Evaluation is performed in to applying a supervised IR system in a BEIR producing two scenarios: zero-shot (left and middle zero-shot fashion. In addition we explore several retrievers); unsupervised domain adaptation (right sizes of open Large Language Models to retriever).

checkpoint, dataset, query, (15 more...)

arXiv.org Artificial Intelligence

2310.0935

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Texas (0.05)
South America (0.04)
(5 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

BibRank: Automatic Keyphrase Extraction Platform Using~Metadata

Eldallal, Abdelrhman, Barbu, Eduard

arXiv.org Artificial IntelligenceOct-13-2023

Automatic Keyphrase Extraction involves identifying essential phrases in a document. These keyphrases are crucial in various tasks such as document classification, clustering, recommendation, indexing, searching, summarization, and text simplification. This paper introduces a platform that integrates keyphrase datasets and facilitates the evaluation of keyphrase extraction algorithms. The platform includes BibRank, an automatic keyphrase extraction algorithm that leverages a rich dataset obtained by parsing bibliographic data in BibTeX format. BibRank combines innovative weighting techniques with positional, statistical, and word co-occurrence information to extract keyphrases from documents. The platform proves valuable for researchers and developers seeking to enhance their keyphrase extraction algorithms and advance the field of natural language processing.

algorithm, dataset, keyphrase extraction algorithm, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.3390/info14100549

2310.09151

Country:

Europe > Estonia > Tartu County > Tartu (0.05)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)

Genre:

Research Report (1.00)
Overview (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval

Kumar, Ramnath, Mittal, Anshul, Gupta, Nilesh, Kusupati, Aditya, Dhillon, Inderjit, Jain, Prateek

arXiv.org Artificial IntelligenceOct-13-2023

Dense embedding-based retrieval is now the industry standard for semantic search and ranking problems, like obtaining relevant web documents for a given query. Such techniques use a two-stage process: (a) contrastive learning to train a dual encoder to embed both the query and documents and (b) approximate nearest neighbor search (ANNS) for finding similar documents for a given query. These two stages are disjoint; the learned embeddings might be ill-suited for the ANNS method and vice-versa, leading to suboptimal performance. In this work, we propose End-to-end Hierarchical Indexing -- EHI -- that jointly learns both the embeddings and the ANNS structure to optimize retrieval performance. EHI uses a standard dual encoder model for embedding queries and documents while learning an inverted file index (IVF) style tree structure for efficient ANNS. To ensure stable and efficient learning of discrete tree-based ANNS structure, EHI introduces the notion of dense path embedding that captures the position of a query/document in the tree. We demonstrate the effectiveness of EHI on several benchmarks, including de-facto industry standard MS MARCO (Dev set and TREC DL19) datasets. For example, with the same compute budget, EHI outperforms state-of-the-art (SOTA) in by 0.6% (MRR@10) on MS MARCO dev set and by 4.2% (nDCG@10) on TREC DL19 benchmarks.

dataset, ehi, query, (14 more...)

arXiv.org Artificial Intelligence

2310.08891

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.67)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

The Computational Complexity of Finding Stationary Points in Non-Convex Optimization

Hollender, Alexandros, Zampetakis, Manolis

arXiv.org Machine LearningOct-13-2023

Finding approximate stationary points, i.e., points where the gradient is approximately zero, of non-convex but smooth objective functions $f$ over unrestricted $d$-dimensional domains is one of the most fundamental problems in classical non-convex optimization. Nevertheless, the computational and query complexity of this problem are still not well understood when the dimension $d$ of the problem is independent of the approximation error. In this paper, we show the following computational and query complexity results: 1. The problem of finding approximate stationary points over unrestricted domains is PLS-complete. 2. For $d = 2$, we provide a zero-order algorithm for finding $\varepsilon$-approximate stationary points that requires at most $O(1/\varepsilon)$ value queries to the objective function. 3. We show that any algorithm needs at least $\Omega(1/\varepsilon)$ queries to the objective function and/or its gradient to find $\varepsilon$-approximate stationary points when $d=2$. Combined with the above, this characterizes the query complexity of this problem to be $\Theta(1/\varepsilon)$. 4. For $d = 2$, we provide a zero-order algorithm for finding $\varepsilon$-KKT points in constrained optimization problems that requires at most $O(1/\sqrt{\varepsilon})$ value queries to the objective function. This closes the gap between the works of Bubeck and Mikulincer [2020] and Vavasis [1993] and characterizes the query complexity of this problem to be $\Theta(1/\sqrt{\varepsilon})$. 5. Combining our results with the recent result of Fearnley et al. [2022], we show that finding approximate KKT points in constrained optimization is reducible to finding approximate stationary points in unconstrained optimization but the converse is impossible.

artificial intelligence, information retrieval, natural language, (19 more...)

arXiv.org Machine Learning

2310.09157

Country:

North America > United States (0.14)
North America > Canada > Ontario > Toronto (0.14)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.94)

Add feedback