AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Academic competitions

Escalante, Hugo Jair, Kruchinina, Aleksandra

arXiv.org Artificial IntelligenceNov-30-2023

Academic challenges comprise effective means for (i) advancing the state of the art, (ii) putting in the spotlight of a scientific community specific topics and problems, as well as (iii) closing the gap for under represented communities in terms of accessing and participating in the shaping of research fields. Competitions can be traced back for centuries and their achievements have had great influence in our modern world. Recently, they (re)gained popularity, with the overwhelming amounts of data that is being generated in different domains, as well as the need of pushing the barriers of existing methods, and available tools to handle such data. This chapter provides a survey of academic challenges in the context of machine learning and related fields. We review the most influential competitions in the last few years and analyze challenges per area of knowledge. The aims of scientific challenges, their goals, major achievements and expectations for the next few years are reviewed.

competition, hugo jair escalante, proceedings, (10 more...)

arXiv.org Artificial Intelligence

2312.00268

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
(23 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.46)

Industry:

Information Technology (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Education (0.93)
(2 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(5 more...)

Add feedback

Beyond Two-Tower Matching: Learning Sparse Retrievable Cross-Interactions for Recommendation

Su, Liangcai, Yan, Fan, Zhu, Jieming, Xiao, Xi, Duan, Haoyi, Zhao, Zhou, Dong, Zhenhua, Tang, Ruiming

arXiv.org Artificial IntelligenceNov-29-2023

Two-tower models are a prevalent matching framework for recommendation, which have been widely deployed in industrial applications. The success of two-tower matching attributes to its efficiency in retrieval among a large number of items, since the item tower can be precomputed and used for fast Approximate Nearest Neighbor (ANN) search. However, it suffers two main challenges, including limited feature interaction capability and reduced accuracy in online serving. Existing approaches attempt to design novel late interactions instead of dot products, but they still fail to support complex feature interactions or lose retrieval efficiency. To address these challenges, we propose a new matching paradigm named SparCode, which supports not only sophisticated feature interactions but also efficient retrieval. Specifically, SparCode introduces an all-to-all interaction module to model fine-grained query-item interactions. Besides, we design a discrete code-based sparse inverted index jointly trained with the model to achieve effective and efficient model inference. Extensive experiments have been conducted on open benchmark datasets to demonstrate the superiority of our framework. The results show that SparCode significantly improves the accuracy of candidate item matching while retaining the same level of retrieval efficiency with two-tower models. Our source code will be available at MindSpore/models.

codebook, interaction, sparcode, (12 more...)

arXiv.org Artificial Intelligence

2311.18213

Country:

Asia > Taiwan > Taiwan Province > Taipei (0.05)
Asia > China > Guangdong Province > Shenzhen (0.05)
Oceania > Australia > Queensland (0.04)
(10 more...)

Genre: Research Report > New Finding (0.34)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Information Management (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.69)

Add feedback

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Wei, Cong, Chen, Yang, Chen, Haonan, Hu, Hexiang, Zhang, Ge, Fu, Jie, Ritter, Alan, Chen, Wenhu

arXiv.org Artificial IntelligenceNov-28-2023

Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.

dataset, instruction, uniir, (15 more...)

arXiv.org Artificial Intelligence

2311.17136

Country:

North America > United States > Texas > Hays County > San Marcos (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Afghanistan (0.04)
(27 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Government > Regional Government > Europe Government (0.94)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Experimental Analysis of Large-scale Learnable Vector Storage Compression

Zhang, Hailin, Zhao, Penghao, Miao, Xupeng, Shao, Yingxia, Liu, Zirui, Yang, Tong, Cui, Bin

arXiv.org Artificial IntelligenceNov-27-2023

Learnable embedding vector is one of the most important applications in machine learning, and is widely used in various database-related domains. However, the high dimensionality of sparse data in recommendation tasks and the huge volume of corpus in retrieval-related tasks lead to a large memory consumption of the embedding table, which poses a great challenge to the training and deployment of models. Recent research has proposed various methods to compress the embeddings at the cost of a slight decrease in model quality or the introduction of other overheads. Nevertheless, the relative performance of these methods remains unclear. Existing experimental comparisons only cover a subset of these methods and focus on limited metrics. In this paper, we perform a comprehensive comparative analysis and experimental evaluation of embedding compression. We introduce a new taxonomy that categorizes these techniques based on their characteristics and methodologies, and further develop a modular benchmarking framework that integrates 14 representative methods. Under a uniform test environment, our benchmark fairly evaluates each approach, presents their strengths and weaknesses under different memory budgets, and recommends the best method based on the use case. In addition to providing useful guidelines, our study also uncovers the limitations of current methods and suggests potential directions for future research.

compression method, memory budget, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2311.15578

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Shandong Province > Dongying (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
(2 more...)

Add feedback

Leveraging deep active learning to identify low-resource mobility functioning information in public clinical notes

Le, Tuan-Dung, Miao, Zhuqi, Alvarado, Samuel, Smith, Brittany, Paiva, William, Thieu, Thanh

arXiv.org Artificial IntelligenceNov-27-2023

Function is increasingly recognized as an important indicator of whole-person health, although it receives little attention in clinical natural language processing research. We introduce the first public annotated dataset specifically on the Mobility domain of the International Classification of Functioning, Disability and Health (ICF), aiming to facilitate automatic extraction and analysis of functioning information from free-text clinical notes. We utilize the National NLP Clinical Challenges (n2c2) research dataset to construct a pool of candidate sentences using keyword expansion. Our active learning approach, using query-by-committee sampling weighted by density representativeness, selects informative sentences for human annotation. We train BERT and CRF models, and use predictions from these models to guide the selection of new sentences for subsequent annotation iterations. Our final dataset consists of 4,265 sentences with a total of 11,784 entities, including 5,511 Action entities, 5,328 Mobility entities, 306 Assistance entities, and 639 Quantification entities. The inter-annotator agreement (IAA), averaged over all entity types, is 0.72 for exact matching and 0.91 for partial matching. We also train and evaluate common BERT models and state-of-the-art Nested NER models. The best F1 scores are 0.84 for Action, 0.7 for Mobility, 0.62 for Assistance, and 0.71 for Quantification. Empirical results demonstrate promising potential of NER models to accurately extract mobility functioning information from clinical text. The public availability of our annotated dataset will facilitate further research to comprehensively capture functioning information in electronic health records (EHRs).

dataset, entity type, information, (12 more...)

arXiv.org Artificial Intelligence

2311.15946

Country:

North America > United States > Florida > Hillsborough County > Tampa (0.14)
South America > Uruguay > Maldonado > Maldonado (0.05)
North America > United States > Oklahoma > Payne County > Stillwater (0.04)
(6 more...)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.95)

Add feedback

SCStory: Self-supervised and Continual Online Story Discovery

Yoon, Susik, Meng, Yu, Lee, Dongha, Han, Jiawei

arXiv.org Artificial IntelligenceNov-26-2023

We present a framework SCStory for online story discovery, that helps people digest rapidly published news article streams in real-time without human annotations. To organize news article streams into stories, existing approaches directly encode the articles and cluster them based on representation similarity. However, these methods yield noisy and inaccurate story discovery results because the generic article embeddings do not effectively reflect the story-indicative semantics in an article and cannot adapt to the rapidly evolving news article streams. SCStory employs self-supervised and continual learning with a novel idea of story-indicative adaptive modeling of news article streams. With a lightweight hierarchical embedding module that first learns sentence representations and then article representations, SCStory identifies story-relevant information of news articles and uses them to discover stories. The embedding module is continuously updated to adapt to evolving news streams with a contrastive learning objective, backed up by two unique techniques, confidence-aware memory replay and prioritized-augmentation, employed for label absence and data scarcity problems. Thorough experiments on real and the latest news data sets demonstrate that SCStory outperforms existing state-of-the-art algorithms for unsupervised online story discovery.

proceedings, representation, scstory, (17 more...)

arXiv.org Artificial Intelligence

2312.03725

Country:

Africa (0.28)
North America > United States > Texas > Travis County > Austin (0.05)
Europe > Ukraine (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Media > News (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)

Add feedback

Overview of the VLSP 2022 -- Abmusu Shared Task: A Data Challenge for Vietnamese Abstractive Multi-document Summarization

Tran, Mai-Vu, Le, Hoang-Quynh, Can, Duy-Cat, Nguyen, Quoc-An

arXiv.org Artificial IntelligenceNov-26-2023

This paper reports the overview of the VLSP 2022 - Vietnamese abstractive multi-document summarization (Abmusu) shared task for Vietnamese News. This task is hosted at the 9$^{th}$ annual workshop on Vietnamese Language and Speech Processing (VLSP 2022). The goal of Abmusu shared task is to develop summarization systems that could create abstractive summaries automatically for a set of documents on a topic. The model input is multiple news documents on the same topic, and the corresponding output is a related abstractive summary. In the scope of Abmusu shared task, we only focus on Vietnamese news summarization and build a human-annotated dataset of 1,839 documents in 600 clusters, collected from Vietnamese news in 8 categories. Participated models are evaluated and ranked in terms of \texttt{ROUGE2-F1} score, the most typical evaluation metric for document summarization problem.

abmusu, baseline, summarization, (13 more...)

arXiv.org Artificial Intelligence

2311.15525

Country:

Asia > Vietnam > Hanoi > Hanoi (0.05)
North America > United States > Maryland > Baltimore (0.04)

Genre:

Research Report (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.48)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.34)

Add feedback

A Corpus for Named Entity Recognition in Chinese Novels with Multi-genres

Zhao, Hanjie, Xie, Jinge, Yan, Yuchen, Jia, Yuxiang, Ye, Yawen, Zan, Hongying

arXiv.org Artificial IntelligenceNov-26-2023

Entities like person, location, organization are important for literary text analysis. The lack of annotated data hinders the progress of named entity recognition (NER) in literary domain. To promote the research of literary NER, we build the largest multi-genre literary NER corpus containing 263,135 entities in 105,851 sentences from 260 online Chinese novels spanning 13 different genres. Based on the corpus, we investigate characteristics of entities from different genres. We propose several baseline NER models and conduct cross-genre and cross-domain experiments. Experimental results show that genre difference significantly impact NER performance though not as much as domain difference like literary domain and news domain. Compared with NER in news domain, literary NER still needs much improvement and the Out-of-Vocabulary (OOV) problem is more challenging due to the high variety of entities in literary works.

computational linguistic, corpus, entity recognition, (13 more...)

arXiv.org Artificial Intelligence

2311.15509

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Henan Province > Zhengzhou (0.04)
North America > United States > Illinois (0.04)
(4 more...)

Genre:

Overview (0.68)
Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Relevance feedback strategies for recall-oriented neural information retrieval

Kats, Timo, van der Putten, Peter, Scholtes, Jan

arXiv.org Artificial IntelligenceNov-25-2023

In a number of information retrieval applications (e.g., patent search, literature review, due diligence, etc.), preventing false negatives is more important than preventing false positives. However, approaches designed to reduce review effort (like "technology assisted review") can create false negatives, since they are often based on active learning systems that exclude documents automatically based on user feedback. Therefore, this research proposes a more recall-oriented approach to reducing review effort. More specifically, through iteratively re-ranking the relevance rankings based on user feedback, which is also referred to as relevance feedback. In our proposed method, the relevance rankings are produced by a BERT-based dense-vector search and the relevance feedback is based on cumulatively summing the queried and selected embeddings. Our results show that this method can reduce review effort between 17.85% and 59.04%, compared to a baseline approach (of no feedback), given a fixed recall target.

experiment, paragraph, similarity method, (15 more...)

arXiv.org Artificial Intelligence

2311.1511

Country:

Europe > Netherlands > South Holland > Leiden (0.04)
Europe > Netherlands > Limburg > Maastricht (0.04)
Europe > Switzerland > Vaud (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.73)

Add feedback

Code Search Debiasing:Improve Search Results beyond Overall Ranking Performance

Zhang, Sheng, Li, Hui, Wang, Yanlin, Wei, Zhao, Xiu, Yong, Wang, Juhong, Ji, Rongong

arXiv.org Artificial IntelligenceNov-24-2023

Code search engine is an essential tool in software development. Many code search methods have sprung up, focusing on the overall ranking performance of code search. In this paper, we study code search from another perspective by analyzing the bias of code search models. Biased code search engines provide poor user experience, even though they show promising overall performance. Due to different development conventions (e.g., prefer long queries or abbreviations), some programmers will find the engine useful, while others may find it hard to get desirable search results. To mitigate biases, we develop a general debiasing framework that employs reranking to calibrate search results. It can be easily plugged into existing engines and handle new code search biases discovered in the future. Experiments show that our framework can effectively reduce biases. Meanwhile, the overall ranking performance of code search gets improved after debiasing.

code snippet, query, snippet, (15 more...)

arXiv.org Artificial Intelligence

2311.14901

Country:

Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China > Fujian Province > Xiamen (0.04)

Genre:

Research Report (0.64)
Overview (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback