AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Cross-modal Retrieval for Knowledge-based Visual Question Answering

Lerner, Paul, Ferret, Olivier, Guinaudeau, Camille

arXiv.org Artificial IntelligenceJan-11-2024

Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.

information retrieval, natural language, question answering, (17 more...)

arXiv.org Artificial Intelligence

2401.05736

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.05)
Asia > Middle East > Saudi Arabia > Mecca Province > Jeddah (0.04)
(13 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.92)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.73)

Add feedback

Learning Unsupervised Semantic Document Representation for Fine-grained Aspect-based Sentiment Analysis

Fu, Hao-Ming, Cheng, Pu-Jen

arXiv.org Artificial IntelligenceJan-11-2024

Document representation is the core of many NLP tasks on machine understanding. A general representation learned in an unsupervised manner reserves generality and can be used for various applications. In practice, sentiment analysis (SA) has been a challenging task that is regarded to be deeply semantic-related and is often used to assess general representations. Existing methods on unsupervised document representation learning can be separated into two families: sequential ones, which explicitly take the ordering of words into consideration, and non-sequential ones, which do not explicitly do so. However, both of them suffer from their own weaknesses. In this paper, we propose a model that overcomes difficulties encountered by both families of methods. Experiments show that our model outperforms state-of-the-art methods on popular SA datasets and a fine-grained aspect-based SA by a large margin.

representation, target sentence, vector, (10 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3331184.3331320

2401.0621

Country:

Asia > Taiwan (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.93)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.74)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.64)

Add feedback

DREQ: Document Re-Ranking Using Entity-based Query Understanding

Chatterjee, Shubham, Mackie, Iain, Dalton, Jeff

arXiv.org Artificial IntelligenceJan-11-2024

While entity-oriented neural IR models have advanced significantly, they often overlook a key nuance: the varying degrees of influence individual entities within a document have on its overall relevance. Addressing this gap, we present DREQ, an entity-oriented dense document re-ranking model. Uniquely, we emphasize the query-relevant entities within a document's representation while simultaneously attenuating the less relevant ones, thus obtaining a query-specific entity-centric document representation. We then combine this entity-centric document representation with the text-centric representation of the document to obtain a "hybrid" representation of the document. We learn a relevance score for the document using this hybrid representation. Using four large-scale benchmarks, we show that DREQ outperforms state-of-the-art neural and non-neural re-ranking methods, highlighting the effectiveness of our entity-oriented representation approach.

computing machinery, proceedings, query, (11 more...)

arXiv.org Artificial Intelligence

2401.05939

Country:

North America > United States > New York > New York County > New York City (0.07)
North America > United States > Alaska (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(6 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

Entity Recognition from Colloquial Text

Babaian, Tamara, Xu, Jennifer

arXiv.org Artificial IntelligenceJan-9-2024

Extraction of concepts and entities of interest from non-formal texts such as social media posts and informal communication is an important capability for decision support systems in many domains, including healthcare, customer relationship management, and others. Despite the recent advances in training large language models for a variety of natural language processing tasks, the developed models and techniques have mainly focused on formal texts and do not perform as well on colloquial data, which is characterized by a number of distinct challenges. In our research, we focus on the healthcare domain and investigate the problem of symptom recognition from colloquial texts by designing and evaluating several training strategies for BERT-based model fine-tuning. These strategies are distinguished by the choice of the base model, the training corpora, and application of term perturbations in the training data. The best-performing models trained using these strategies outperform the state-of-the-art specialized symptom recognizer by a large margin. Through a series of experiments, we have found specific patterns of model behavior associated with the training strategies we designed. We present design principles for training strategies for effective entity recognition in colloquial texts based on our findings.

colloquial text, recognition, symptom, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.dss.2024.114172

2401.04853

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Ontario > Toronto (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Health Care Technology (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.86)

Add feedback

Sibyl: Forecasting Time-Evolving Query Workloads

Huang, Hanxian, Siddiqui, Tarique, Alotaibi, Rana, Curino, Carlo, Leeka, Jyoti, Jindal, Alekh, Zhao, Jishen, Camacho-Rodriguez, Jesus, Tian, Yuanyuan

arXiv.org Artificial IntelligenceJan-8-2024

For workload-based optimization, the input workload plays a crucial role and needs to be a good representation of the expected Database systems often rely on historical query traces to perform workload. Traditionally, historical query traces have been used as workload-based performance tuning. However, real production input workloads with the assumption that workloads are mostly workloads are time-evolving, making historical queries ineffective static. However, as we discuss in 2, many real workloads exhibit for optimizing future workloads. To address this challenge, we propose highly recurring query structures with changing patterns in both Sibyl, an end-to-end machine learning-based framework that their arrival intervals and data accesses. For instance, query templates accurately forecasts a sequence of future queries, with the entire are often shared across users, teams, and applications, but query statements, in various prediction windows. Drawing insights may be customized with different parameter values to access varying from real-workloads, we propose template-based featurization techniques data at different points in time. Consider a log analysis query and develop a stacked-LSTM with an encoder-decoder architecture that reports errors for different devices and error types: "SELECT for accurate forecasting of query workloads. We also * FROM T WHERE deviceType =? AND errorType =? AND develop techniques to improve forecasting accuracy over large prediction eventDate BETWEEN?

query, template, workload, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3639308

2401.03723

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > Texas > Harris County > Houston (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.93)

Add feedback

Synthetic Query Generation for Privacy-Preserving Deep Retrieval Systems using Differentially Private Language Models

Carranza, Aldo Gael, Farahani, Rezsa, Ponomareva, Natalia, Kurakin, Alex, Jagielski, Matthew, Nasr, Milad

arXiv.org Artificial IntelligenceJan-8-2024

We address the challenge of ensuring differential privacy (DP) guarantees in training deep retrieval systems. Training these systems often involves the use of contrastive-style losses, which are typically non-per-example decomposable, making them difficult to directly DP-train with since common techniques require per-example gradient. To address this issue, we propose an approach that prioritizes ensuring query privacy prior to training a deep retrieval system. Our method employs DP language models (LMs) to generate private synthetic queries representative of the original data. These synthetic queries can be used in downstream retrieval system training without compromising privacy. Our approach demonstrates a significant enhancement in retrieval quality compared to direct DP-training, all while maintaining query-level privacy guarantees. This work highlights the potential of harnessing LMs to overcome limitations in standard DP-training methods.

arxiv preprint arxiv, privacy, query, (14 more...)

arXiv.org Artificial Intelligence

2305.05973

Country:

North America > United States > New York > New York County > New York City (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > British Indian Ocean Territory > Diego Garcia (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Data Science > Data Mining > Big Data (0.40)

Add feedback

MaskSearch: Querying Image Masks at Scale

He, Dong, Zhang, Jieyu, Daum, Maureen, Ratner, Alexander, Balazinska, Magdalena

arXiv.org Artificial IntelligenceJan-8-2024

Machine learning tasks over image databases often generate masks that annotate image content (e.g., saliency maps, segmentation maps, depth maps) and enable a variety of applications (e.g., determine if a model is learning spurious correlations or if an image was maliciously modified to mislead a model). While queries that retrieve examples based on mask properties are valuable to practitioners, existing systems do not support them efficiently. In this paper, we formalize the problem and propose MaskSearch, a system that focuses on accelerating queries over databases of image masks while guaranteeing the correctness of query results. MaskSearch leverages a novel indexing technique and an efficient filter-verification query execution framework. Experiments with our prototype show that MaskSearch, using indexes approximately 5% of the compressed data size, accelerates individual queries by up to two orders of magnitude and consistently outperforms existing methods on various multi-query workloads that simulate dataset exploration and analysis processes.

masksearch, predicate, query, (17 more...)

arXiv.org Artificial Intelligence

2305.02375

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Texas > Harris County > Houston (0.04)
North America > United States > Indiana > Marion County > Indianapolis (0.04)
(3 more...)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Diagnostic Medicine (0.68)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.57)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

A Span-based Model for Extracting Overlapping PICO Entities from RCT Publications

Zhang, Gongbo, Zhou, Yiliang, Hu, Yan, Xu, Hua, Weng, Chunhua, Peng, Yifan

arXiv.org Artificial IntelligenceJan-7-2024

Objectives Extraction of PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method PICOX to extract overlapping PICO entities. Materials and Methods PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then it uses a multi-label classifier to assign one or more PICO labels to a span candidate. PICOX was evaluated using one of the best-performing baselines, EBM-NLP, and three more datasets, i.e., PICO-Corpus, and RCT publications on Alzheimer's Disease or COVID-19, using entity-level precision, recall, and F1 scores. Results PICOX achieved superior precision, recall, and F1 scores across the board, with the micro F1 score improving from 45.05 to 50.87 (p << 0.01). On the PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline and improved the micro recall score from 56.66 to 67.33. On the COVID-19 dataset, PICOX also outperformed the baseline and improved the micro F1 score from 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores with higher precision when compared to the baseline. Conclusion PICOX excels in identifying overlapping entities and consistently surpasses a leading baseline across multiple datasets. Ablation studies reveal that its data augmentation strategy effectively minimizes false positives and improves precision.

computational linguistic, dataset, picox, (14 more...)

arXiv.org Artificial Intelligence

2401.06791

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.55)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Building Efficient and Effective OpenQA Systems for Low-Resource Languages

Budur, Emrah, Özçelik, Rıza, Soylu, Dilara, Khattab, Omar, Güngör, Tunga, Potts, Christopher

arXiv.org Artificial IntelligenceJan-7-2024

Question answering (QA) is the task of answering questions posed in natural language with free-form natural language answers extracted from a given passage. In the OpenQA variant, only a question text is given, and the system must retrieve relevant passages from an unstructured knowledge source and use them to provide answers, which is the case in the mainstream QA systems on the Web. QA systems currently are mostly limited to the English language due to the lack of large-scale labeled QA datasets in non-English languages. In this paper, we show that effective, low-cost OpenQA systems can be developed for low-resource languages. The key ingredients are (1) weak supervision using machine-translated labeled datasets and (2) a relevant unstructured knowledge source in the target language. Furthermore, we show that only a few hundred gold assessment examples are needed to reliably evaluate these systems. We apply our method to Turkish as a challenging case study, since English and Turkish are typologically very distinct. We present SQuAD-TR, a machine translation of SQuAD2.0, and we build our OpenQA system by adapting ColBERT-QA for Turkish. We obtain a performance improvement of 9-34% in the EM score and 13-33% in the F1 score compared to the BM25-based and DPR-based baseline QA reader models by using two versions of Wikipedia dumps spanning two years. Our results show that SQuAD-TR makes OpenQA feasible for Turkish, which we hope encourages researchers to build OpenQA systems in other low-resource languages. We make all the code, models, and the dataset publicly available.

dataset, retrieved, retriever, (16 more...)

arXiv.org Artificial Intelligence

2401.0359

Country:

Europe > United Kingdom (0.14)
North America > United States > Washington > King County > Seattle (0.14)
South America > Venezuela (0.04)
(34 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media (0.93)
Leisure & Entertainment (0.93)
Health & Medicine (0.67)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

CAPTAIN at COLIEE 2023: Efficient Methods for Legal Information Retrieval and Entailment Tasks

Nguyen, Chau, Nguyen, Phuong, Tran, Thanh, Nguyen, Dat, Trieu, An, Pham, Tin, Dang, Anh, Nguyen, Le-Minh

arXiv.org Artificial IntelligenceJan-7-2024

The Competition on Legal Information Extraction/Entailment (COLIEE) is held annually to encourage advancements in the automatic processing of legal texts. Processing legal documents is challenging due to the intricate structure and meaning of legal language. In this paper, we outline our strategies for tackling Task 2, Task 3, and Task 4 in the COLIEE 2023 competition. Our approach involved utilizing appropriate state-of-the-art deep learning methods, designing methods based on domain characteristics observation, and applying meticulous engineering practices and methodologies to the competition. As a result, our performance in these tasks has been outstanding, with first places in Task 2 and Task 3, and promising results in Task 4. Our source code is available at https://github.com/Nguyen2015/CAPTAIN-COLIEE2023/tree/coliee2023.

captain, coliee 2023, nguyen, (16 more...)

arXiv.org Artificial Intelligence

2401.03551

Country:

Europe > Portugal > Braga > Braga (0.05)
Asia > Japan > Honshū > Tōhoku (0.05)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.05)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback