AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

A Knowledge Graph-Based Search Engine for Robustly Finding Doctors and Locations in the Healthcare Domain

Kejriwal, Mayank, Haidarian, Hamid, Chiu, Min-Hsueh, Xiang, Andy, Shrestha, Deep, Javed, Faizan

arXiv.org Artificial IntelligenceOct-8-2023

Efficiently finding doctors and locations is an important search problem for patients in the healthcare domain, for which traditional information retrieval methods tend not to work optimally. In the last ten years, knowledge graphs (KGs) have emerged as a powerful way to combine the benefits of gleaning insights from semi-structured data using semantic modeling, natural language processing techniques like information extraction, and robust querying using structured query languages like SPARQL and Cypher. In this short paper, we present a KG-based search engine architecture for robustly finding doctors and locations in the healthcare domain. Early results demonstrate that our approach can lead to significantly higher coverage for complex queries without degrading quality.

doctor and location, healthcare domain, knowledge graph-based search engine

arXiv.org Artificial Intelligence

2310.05258

Genre: Research Report (0.69)

Industry: Health & Medicine (0.80)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

On the Query Complexity of Training Data Reconstruction in Private Learning

Mukherjee, Prateeti, Lokam, Satya

arXiv.org Machine LearningOct-8-2023

We analyze the number of queries that a whitebox adversary needs to make to a private learner in order to reconstruct its training data. For (ϵ, δ) DP learners with training data drawn from any arbitrary compact metric space, we provide the first known lower bounds on the adversary's query complexity as a function of the learner's privacy parameters. Our results are minimax optimal for every ϵ 0, δ [0, 1], covering both ϵ-DP and (0, δ) DP as corollaries. Beyond this, we obtain query complexity lower bounds for (α, ϵ) Rényi DP learners that are valid for any α > 1, ϵ 0. Finally, we analyze data reconstruction attacks on locally compact metric spaces via the framework of Metric DP, a generalization of DP that accounts for the underlying metric structure of the data. In this setting, we provide the first known analysis of data reconstruction in unbounded, high dimensional spaces and obtain query complexity lower bounds that are nearly tight modulo logarithmic factors.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2303.16372

Country:

North America > United States > New York > New York County > New York City (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Zero-shot Cross-lingual Transfer without Parallel Corpus

Zhang, Yuyang, Han, Xiaofeng, Wang, Baojun

arXiv.org Artificial IntelligenceOct-7-2023

Recently, although pre-trained language models have achieved great success on multilingual NLP (Natural Language Processing) tasks, the lack of training data on many tasks in low-resource languages still limits their performance. One effective way of solving that problem is to transfer knowledge from rich-resource languages to low-resource languages. However, many previous works on cross-lingual transfer rely heavily on the parallel corpus or translation models, which are often difficult to obtain. We propose a novel approach to conduct zero-shot cross-lingual transfer with a pre-trained model. It consists of a Bilingual Task Fitting module that applies task-related bilingual information alignment; a self-training module generates pseudo soft and hard labels for unlabeled data and utilizes them to conduct self-training. We got the new SOTA on different tasks without any dependencies on the parallel corpus or translation models.

cross-lingual transfer, dataset, target language, (16 more...)

arXiv.org Artificial Intelligence

2310.04726

Country:

Europe > Germany (0.04)
North America > Canada (0.04)
Europe > France (0.04)
(2 more...)

Genre:

Instructional Material (0.55)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)

Add feedback

An Efficient Content-based Time Series Retrieval System

Yeh, Chin-Chia Michael, Chen, Huiyuan, Dai, Xin, Zheng, Yan, Wang, Junpeng, Lai, Vivian, Fan, Yujie, Der, Audrey, Zhuang, Zhongfang, Wang, Liang, Zhang, Wei, Phillips, Jeff M.

arXiv.org Artificial IntelligenceOct-5-2023

A Content-based Time Series Retrieval (CTSR) system is an information retrieval system for users to interact with time series emerged from multiple domains, such as finance, healthcare, and manufacturing. For example, users seeking to learn more about the source of a time series can submit the time series as a query to the CTSR system and retrieve a list of relevant time series with associated metadata. By analyzing the retrieved metadata, users can gather more information about the source of the time series. Because the CTSR system is required to work with time series data from diverse domains, it needs a high-capacity model to effectively measure the similarity between different time series. On top of that, the model within the CTSR system has to compute the similarity scores in an efficient manner as the users interact with the system in real-time. In this paper, we propose an effective and efficient CTSR model that outperforms alternative models, while still providing reasonable inference runtimes. To demonstrate the capability of the proposed method in solving business problems, we compare it against alternative models using our in-house transaction data. Our findings reveal that the proposed model is the most suitable solution compared to others for our transaction data problem.

content-based time sery retrieval system

arXiv.org Artificial Intelligence

2310.03919

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.53)

Add feedback

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Jin, Qiao, Kim, Won, Chen, Qingyu, Comeau, Donald C., Yeganova, Lana, Wilbur, W. John, Lu, Zhiyong

arXiv.org Artificial IntelligenceOct-3-2023

Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.

medcpt, representation, retriever, (12 more...)

arXiv.org Artificial Intelligence

doi: 10.1093/bioinformatics/btad651

2307.00589

Country:

Europe > Germany (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.93)
Health & Medicine > Pharmaceuticals & Biotechnology (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Automatic Quality Assessment of Wikipedia Articles -- A Systematic Literature Review

Moás, Pedro Miguel, Lopes, Carla Teixeira

arXiv.org Artificial IntelligenceOct-3-2023

Wikipedia is the world's largest online encyclopedia, but maintaining article quality through collaboration is challenging. Wikipedia designed a quality scale, but with such a manual assessment process, many articles remain unassessed. We review existing methods for automatically measuring the quality of Wikipedia articles, identifying and comparing machine learning algorithms, article features, quality metrics, and used datasets, examining 149 distinct studies, and exploring commonalities and gaps in them. The literature is extensive, and the approaches follow past technological trends. However, machine learning is still not widely used by Wikipedia, and we hope that our analysis helps future researchers change that reality.

international conference, new york city, wikipedia, (10 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3625286

2310.02235

Country:

North America > United States > California > San Francisco County > San Francisco (0.28)
North America > United States > New York > New York County > New York City (0.14)
Europe > Austria > Vienna (0.14)
(22 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.93)

Industry:

Information Technology (0.93)
Health & Medicine (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)
(2 more...)

Add feedback

EALM: Introducing Multidimensional Ethical Alignment in Conversational Information Retrieval

Yu, Yiyao, Wang, Junjie, Zhang, Yuxiang, Zhang, Lin, Yang, Yujiu, Sakai, Tetsuya

arXiv.org Artificial IntelligenceOct-2-2023

Artificial intelligence (AI) technologies should adhere to human norms to better serve our society and avoid disseminating harmful or misleading information, particularly in Conversational Information Retrieval (CIR). Previous work, including approaches and datasets, has not always been successful or sufficiently robust in taking human norms into consideration. To this end, we introduce a workflow that integrates ethical alignment, with an initial ethical judgment stage for efficient data screening. To address the need for ethical judgment in CIR, we present the QA-ETHICS dataset, adapted from the ETHICS benchmark, which serves as an evaluation tool by unifying scenarios and label meanings. However, each scenario only considers one ethical concept. Therefore, we introduce the MP-ETHICS dataset to evaluate a scenario under multiple ethical concepts, such as justice and Deontology. In addition, we suggest a new approach that achieves top performance in both binary and multi-label ethical judgment tasks. Our research provides a practical method for introducing ethical alignment into the CIR workflow. The data and code are available at https://github.com/wanng-ide/ealm .

dataset, ethical concept, ethics, (14 more...)

arXiv.org Artificial Intelligence

2310.0097

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > China > Guangdong Province > Shenzhen (0.05)
Asia > China > Beijing > Beijing (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.34)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Query Rewriting for Effective Misinformation Discovery

Kazemi, Ashkan, Abzaliev, Artem, Deng, Naihao, Hou, Rui, Hale, Scott A., Pérez-Rosas, Verónica, Mihalcea, Rada

arXiv.org Artificial IntelligenceOct-2-2023

We propose a novel system to help fact-checkers formulate search queries for known misinformation claims and effectively search across multiple social media platforms. We introduce an adaptable rewriting strategy, where editing actions for queries containing claims (e.g., swap a word with its synonym; change verb tense into present simple) are automatically learned through offline reinforcement learning. Our model uses a decision transformer to learn a sequence of editing actions that maximizes query retrieval metrics such as mean average precision. We conduct a series of experiments showing that our query rewriting system achieves a relative increase in the effectiveness of the queries of up to 42%, while producing editing action sequences that are human interpretable.

computational linguistic, query, sequence, (12 more...)

arXiv.org Artificial Intelligence

2210.07467

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Texas (0.04)
North America > United States > Michigan (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry:

Media > News (0.62)
Information Technology (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

Di, Qiwei, Jin, Tao, Wu, Yue, Zhao, Heyang, Farnoud, Farzad, Gu, Quanquan

arXiv.org Machine LearningOct-2-2023

Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $\tilde O\big(d\sqrt{\sum_{t=1}^T\sigma_t^2} + d\big)$, where $\sigma_t$ is the variance of the pairwise comparison in round $t$, $d$ is the dimension of the context vectors, and $T$ is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $\tilde O(d)$ regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.

data mining, inequality hold, machine learning, (23 more...)

arXiv.org Machine Learning

2310.00968

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > Virginia > Albemarle County > Charlottesville (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.47)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.34)

Add feedback

Holistic Evaluation of Language Models

Liang, Percy, Bommasani, Rishi, Lee, Tony, Tsipras, Dimitris, Soylu, Dilara, Yasunaga, Michihiro, Zhang, Yian, Narayanan, Deepak, Wu, Yuhuai, Kumar, Ananya, Newman, Benjamin, Yuan, Binhang, Yan, Bobby, Zhang, Ce, Cosgrove, Christian, Manning, Christopher D., Ré, Christopher, Acosta-Navas, Diana, Hudson, Drew A., Zelikman, Eric, Durmus, Esin, Ladhak, Faisal, Rong, Frieda, Ren, Hongyu, Yao, Huaxiu, Wang, Jue, Santhanam, Keshav, Orr, Laurel, Zheng, Lucia, Yuksekgonul, Mert, Suzgun, Mirac, Kim, Nathan, Guha, Neel, Chatterji, Niladri, Khattab, Omar, Henderson, Peter, Huang, Qian, Chi, Ryan, Xie, Sang Michael, Santurkar, Shibani, Ganguli, Surya, Hashimoto, Tatsunori, Icard, Thomas, Zhang, Tianyi, Chaudhary, Vishrav, Wang, William, Li, Xuechen, Mai, Yifan, Zhang, Yuhui, Koreeda, Yuta

arXiv.org Artificial IntelligenceOct-1-2023

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

civilcomment raft boolq narrativeqa naturalquestion, civilcomment raft mmlu boolqnarrativeqa naturalquestion, demographic representation and stereotypical association, (13 more...)

arXiv.org Artificial Intelligence

2211.0911

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.13)
North America > United States > Washington > King County > Seattle (0.13)
(49 more...)

Genre:

Research Report > New Finding (1.00)
Personal (1.00)
Overview (1.00)

Industry:

Media > News (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
(11 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(8 more...)

Add feedback