AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Imitation of Life: A Search Engine for Biologically Inspired Design

Emuna, Hen, Borenstein, Nadav, Qian, Xin, Kang, Hyeonsu, Chan, Joel, Kittur, Aniket, Shahaf, Dafna

arXiv.org Artificial IntelligenceDec-19-2023

Biologically Inspired Design (BID), or Biomimicry, is a problem-solving methodology that applies analogies from nature to solve engineering challenges. For example, Speedo engineers designed swimsuits based on shark skin. Finding relevant biological solutions for real-world problems poses significant challenges, both due to the limited biological knowledge engineers and designers typically possess and to the limited BID resources. Existing BID datasets are hand-curated and small, and scaling them up requires costly human annotations. In this paper, we introduce BARcode (Biological Analogy Retriever), a search engine for automatically mining bio-inspirations from the web at scale. Using advances in natural language understanding and data programming, BARcode identifies potential inspirations for engineering challenges. Our experiments demonstrate that BARcode can retrieve inspirations that are valuable to engineers and designers tackling real-world problems, as well as recover famous historical BID examples. We release data and code; we view BARcode as a step towards addressing the challenges that have historically hindered the practical application of BID to engineering innovation.

bar code, query, right attr, (16 more...)

arXiv.org Artificial Intelligence

2312.12681

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > New York > New York County > New York City (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.61)
(2 more...)

Add feedback

UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models

Li, Xiaoxi, Zhou, Yujia, Dou, Zhicheng

arXiv.org Artificial IntelligenceDec-18-2023

Generative information retrieval, encompassing two major tasks of Generative Document Retrieval (GDR) and Grounded Answer Generation (GAR), has gained significant attention in the area of information retrieval and natural language processing. Existing methods for GDR and GAR rely on separate retrieval and reader modules, which hinder simultaneous optimization. To overcome this, we present \textbf{UniGen}, a \textbf{Uni}fied \textbf{Gen}erative framework for retrieval and question answering that integrates both tasks into a single generative model leveraging the capabilities of large language models. UniGen employs a shared encoder and two distinct decoders for generative retrieval and question answering. To facilitate the learning of both tasks, we introduce connectors, generated by large language models, to bridge the gaps between query inputs and generation targets, as well as between document identifiers and answers. Furthermore, we propose an iterative enhancement strategy that leverages generated answers and retrieved documents to iteratively improve both tasks. Through extensive experiments on the MS MARCO and NQ datasets, we demonstrate the effectiveness of UniGen, showcasing its superior performance in both the retrieval and the question answering tasks.

arxiv preprint arxiv, language model, qa task, (14 more...)

arXiv.org Artificial Intelligence

2312.11036

Country:

Asia > China > Beijing > Beijing (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Arabic Fine-Grained Entity Recognition

Liqreina, Haneen, Jarrar, Mustafa, Khalilia, Mohammed, El-Shangiti, Ahmed Oumar, Abdul-Mageed, Muhammad

arXiv.org Artificial IntelligenceDec-18-2023

Traditional NER systems are typically trained to recognize coarse-grained entities, and less attention is given to classifying entities into a hierarchy of fine-grained lower-level subtypes. This article aims to advance Arabic NER with fine-grained entities. We chose to extend Wojood (an open-source Nested Arabic Named Entity Corpus) with subtypes. In particular, four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC), are extended with 31 subtypes. To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC, ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE sub-types. We refer to this extended version of Wojood as WojoodF ine. To evaluate our annotations, we measured the inter-annotator agreement (IAA) using both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively. To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic BERT encoders in three settings: flat NER, nested NER and nested NER with subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our corpus and models are open-source and available at https://sina.birzeit.edu/wojood/.

corpus, subtype, wojood, (13 more...)

arXiv.org Artificial Intelligence

2310.17333

Country:

Asia > Middle East > Palestine > Gaza Strip > Gaza Governorate > Gaza (0.14)
Africa > Sudan (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(15 more...)

Genre: Research Report (0.82)

Industry: Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Limited Query Graph Connectivity Test

Guo, Mingyu, Li, Jialiang, Neumann, Aneta, Neumann, Frank, Nguyen, Hung

arXiv.org Artificial IntelligenceDec-18-2023

We propose a combinatorial optimisation model called Limited Query Graph Connectivity Test. We consider a graph whose edges have two possible states (On/Off). The edges' states are hidden initially. We could query an edge to reveal its state. Given a source s and a destination t, we aim to test s-t connectivity by identifying either a path (consisting of only On edges) or a cut (consisting of only Off edges). We are limited to B queries, after which we stop regardless of whether graph connectivity is established. We aim to design a query policy that minimizes the expected number of queries. Our model is mainly motivated by a cyber security use case where we need to establish whether an attack path exists in a network, between a source and a destination. Edge query is resolved by manual effort from the IT admin, which is the motivation behind query minimization. Our model is highly related to monotone Stochastic Boolean Function Evaluation (SBFE). There are two existing exact algorithms for SBFE that are prohibitively expensive. We propose a significantly more scalable exact algorithm. While previous exact algorithms only scale for trivial graphs (i.e., past works experimented on at most 20 edges), we empirically demonstrate that our algorithm is scalable for a wide range of much larger practical graphs (i.e., Windows domain network graphs with tens of thousands of edges). We propose three heuristics. Our best-performing heuristic is via reducing the search horizon of the exact algorithm. The other two are via reinforcement learning (RL) and Monte Carlo tree search (MCTS). We also derive an anytime algorithm for computing the performance lower bound. Experimentally, we show that all our heuristics are near optimal. The exact algorithm based heuristic outperforms all, surpassing RL, MCTS and 8 existing heuristics ported from SBFE and related literature.

algorithm, exact algorithm, query, (14 more...)

arXiv.org Artificial Intelligence

2302.13036

Country:

North America > United States > Minnesota (0.04)
Oceania > Australia > South Australia > Adelaide (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(4 more...)

Genre: Research Report (0.82)

Industry:

Health & Medicine (1.00)
Information Technology > Security & Privacy (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.72)

Add feedback

Identifying Planetary Names in Astronomy Papers: A Multi-Step Approach

Shapurian, Golnaz, Kurtz, Michael J, Accomazzi, Alberto

arXiv.org Artificial IntelligenceDec-17-2023

The automatic identification of planetary feature names in astronomy publications presents numerous challenges. These features include craters, defined as roughly circular depressions resulting from impact or volcanic activity; dorsas, which are elongate raised structures or wrinkle ridges; and lacus, small irregular patches of dark, smooth material on the Moon, referred to as "lake" (Planetary Names Working Group, n.d.). Many feature names overlap with places or people's names that they are named after, for example, Syria, Tempe, Einstein, and Sagan, to name a few (U.S. Geological Survey, n.d.). Some feature names have been used in many contexts, for instance, Apollo, which can refer to mission, program, sample, astronaut, seismic, seismometers, core, era, data, collection, instrument, and station, in addition to the crater on the Moon. Some feature names can appear in the text as adjectives, like the lunar craters Black, Green, and White. Some feature names in other contexts serve as directions, like craters West and South on the Moon. Additionally, some features share identical names across different celestial bodies, requiring disambiguation, such as the Adams crater, which exists on both the Moon and Mars. We present a multi-step pipeline combining rule-based filtering, statistical relevance analysis, part-of-speech (POS) tagging, named entity recognition (NER) model, hybrid keyword harvesting, knowledge graph (KG) matching, and inference with a locally installed large language model (LLM) to reliably identify planetary names despite these challenges. When evaluated on a dataset of astronomy papers from the Astrophysics Data System (ADS), this methodology achieves an F1-score over 0.97 in disambiguating planetary feature names.

feature name, information retrieval, large language model, (19 more...)

arXiv.org Artificial Intelligence

2312.08579

Country:

Asia > Middle East > Syria (0.24)
North America > United States > New Mexico (0.14)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (0.68)
Energy > Oil & Gas (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.30)

Add feedback

Search Still Matters: Information Retrieval in the Era of Generative AI

Hersh, William R.

arXiv.org Artificial IntelligenceDec-17-2023

Objective: Information retrieval (IR, also known as search) systems are ubiquitous in modern times. How does the emergence of generative artificial intelligence (AI), based on large language models (LLMs), fit into the IR process? Process: This perspective explores the use of generative AI in the context of the motivations, considerations, and outcomes of the IR process with a focus on the academic use of such systems. Conclusions: There are many information needs, from simple to complex, that motivate use of IR. Users of such systems, particularly academics, have concerns for authoritativeness, timeliness, and contextualization of search. While LLMs may provide functionality that aids the IR process, the continued need for search systems, and research into their improvement, remains essential.

genera, informa, search system, (14 more...)

arXiv.org Artificial Intelligence

2311.1855

Country:

North America > United States > Virginia > Arlington County > Arlington (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Epidemiology (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.89)

Add feedback

Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off

Liu, Yu-An, Zhang, Ruqing, Zhang, Mingkun, Chen, Wei, de Rijke, Maarten, Guo, Jiafeng, Cheng, Xueqi

arXiv.org Artificial IntelligenceDec-16-2023

Neural ranking models (NRMs) have shown great success in information retrieval (IR). But their predictions can easily be manipulated using adversarial examples, which are crafted by adding imperceptible perturbations to legitimate documents. This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs. By incorporating adversarial examples into training data, adversarial training has become the de facto defense approach to adversarial attacks against NRMs. However, this defense mechanism is subject to a trade-off between effectiveness and adversarial robustness. In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs. We decompose the robust ranking error into two components, i.e., a natural ranking error for effectiveness evaluation and a boundary ranking error for assessing adversarial robustness. Then, we define the perturbation invariance of a ranking model and prove it to be a differentiable upper bound on the boundary ranking error for attainable computation. Informed by our theoretical analysis, we design a novel \emph{perturbation-invariant adversarial training} (PIAT) method for ranking models to achieve a better effectiveness-robustness trade-off. We design a regularized surrogate loss, in which one term encourages the effectiveness to be maximized while the regularization term encourages the output to be smooth, so as to improve adversarial robustness. Experimental results on several ranking models demonstrate the superiority of PITA compared to existing adversarial defenses.

boundary ranking error, ranking error, robustness, (14 more...)

arXiv.org Artificial Intelligence

2312.10329

Country:

Asia > Middle East > Jordan (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology (0.35)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.46)

Add feedback

Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language

Drchal, Jan, Ullrich, Herbert, Mlynář, Tomáš, Moravec, Václav

arXiv.org Artificial IntelligenceDec-15-2023

This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the Question Answering for Claim Generation (QACG) method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results.We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise V-information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.

dataset, paragraph, pipeline, (16 more...)

arXiv.org Artificial Intelligence

2312.10171

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Indiana > Marion County > Indianapolis (0.05)
Europe > Czechia > Prague (0.04)
(17 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Government > Regional Government > North America Government > United States Government (0.67)
Media > News (0.46)
Government > Voting & Elections (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)
(2 more...)

Add feedback

C-Pack: Packaged Resources To Advance General Chinese Embedding

Xiao, Shitao, Liu, Zheng, Zhang, Peitian, Muennighoff, Niklas

arXiv.org Artificial IntelligenceDec-15-2023

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

arxiv preprint arxiv, c-mtp, dataset, (13 more...)

arXiv.org Artificial Intelligence

2309.07597

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Communications > Social Media (0.68)
(4 more...)

Add feedback

Evaluation of semantic relations impact in query expansion-based retrieval systems

Massai, Lorenzo

arXiv.org Artificial IntelligenceDec-15-2023

With the increasing demand of intelligent systems capable of operating in different contexts (e.g. users on the move) the correct interpretation of the user-need by such systems has become crucial to give consistent answers to the user questions. The most effective applications addressing such task are in the fields of natural language processing and semantic expansion of terms. These techniques are aimed at estimating the goal of an input query reformulating it as an intent, commonly relying on textual resources built exploiting different semantic relations like \emph{synonymy}, \emph{antonymy} and many others. The aim of this paper is to generate such resources using the labels of a given taxonomy as source of information. The obtained resources are integrated into a plain classifier for reformulating a set of input queries as intents and tracking the effect of each relation, in order to quantify the impact of each semantic relation on the classification. As an extension to this, the best tradeoff between improvement and noise introduction when combining such relations is evaluated. The assessment is made generating the resources and their combinations and using them for tuning the classifier which is used to reformulate the user questions as labels. The evaluation employs a wide and varied taxonomy as a use-case, exploiting its labels as basis for the semantic expansion and producing several corpora with the purpose of enhancing the pseudo-queries estimation.

expansion, relation, semantic relation, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.knosys.2023.111183

2203.1623

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana (0.04)
(6 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.53)

Add feedback