AITopics | document id

Collaborating Authors

document id

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Embedding-Based Context-Aware Reranker

Yuan, Ye, Shabani, Mohammad Amin, Liu, Siqi

arXiv.org Artificial IntelligenceOct-16-2025

Retrieval-Augmented Generation (RAG) systems rely on retrieving relevant evidence from a corpus to support downstream generation. The common practice of splitting a long document into multiple shorter passages enables finer-grained and targeted information retrieval. However, it also introduces challenges when a correct retrieval would require inference across passages, such as resolving coreference, disambiguating entities, and aggregating evidence scattered across multiple sources. Many state-of-the-art (SOTA) reranking methods, despite utilizing powerful large pretrained language models with potentially high inference costs, still neglect the aforementioned challenges. Therefore, we propose Embedding-Based Context-Aware Reranker (EBCAR), a lightweight reranking framework operating directly on embeddings of retrieved passages with enhanced cross-passage understandings through the structural information of the passages and a hybrid attention mechanism, which captures both high-level interactions across documents and low-level relationships within each document. We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.13329

Country:

Europe (0.93)
Asia (0.93)
North America > United States (0.68)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)

Add feedback

Improving Word Sense Disambiguation in Neural Machine Translation with Salient Document Context

Rippeth, Elijah, Carpuat, Marine, Duh, Kevin, Post, Matt

arXiv.org Artificial IntelligenceNov-26-2023

Lexical ambiguity is a challenging and pervasive problem in machine translation (\mt). We introduce a simple and scalable approach to resolve translation ambiguity by incorporating a small amount of extra-sentential context in neural \mt. Our approach requires no sense annotation and no change to standard model architectures. Since actual document context is not available for the vast majority of \mt training data, we collect related sentences for each input to construct pseudo-documents. Salient words from pseudo-documents are then encoded as a prefix to each source sentence to condition the generation of the translation. To evaluate, we release \docmucow, a challenge set for translation disambiguation based on the English-German \mucow \cite{raganato-etal-2020-evaluation} augmented with document IDs. Extensive experiments show that our method translates ambiguous source words better than strong sentence-level baselines and comparable document-level baselines while reducing training costs.

computational linguistic, proceedings, translation, (13 more...)

arXiv.org Artificial Intelligence

2311.15507

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(25 more...)

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models

Li, Haoxin, Keung, Phillip, Cheng, Daniel, Kasai, Jungo, Smith, Noah A.

arXiv.org Artificial IntelligenceNov-14-2023

Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a new approach for end-to-end document retrieval that directly generates document identifiers given an input query. Techniques for designing effective, high-quality document IDs remain largely unexplored. We introduce ACID, in which each document's ID is composed of abstractive keyphrases generated by a large language model, rather than an integer ID sequence as done in past work. We compare our method with the current state-of-the-art technique for ID generation, which produces IDs through hierarchical clustering of document embeddings. We also examine simpler methods to generate natural-language document IDs, including the naive approach of using the first k words of each document as its ID or words with high BM25 scores in that document. We show that using ACID improves top-10 and top-20 accuracy by 15.6% and 14.4% (relative) respectively versus the state-of-the-art baseline on the MSMARCO 100k retrieval task, and 4.4% and 4.0% respectively on the Natural Questions 100k retrieval task. Our results demonstrate the effectiveness of human-readable, natural-language IDs in generative retrieval with LMs. The code for reproducing our results and the keyword-augmented datasets will be released on formal publication.

document id, generative retrieval, retrieval, (15 more...)

arXiv.org Artificial Intelligence

2311.08593

Country:

North America > United States > Washington > King County > Seattle (0.14)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

Trends in AI--March 2022

#artificialintelligenceMar-10-2022, 22:48:13 GMT

If there's an extension I'd like to see made this survey, is a more in-depth inclusion of recent multimodal works like relying on prompting such as Multimodal Few-Shot Learning with Frozen Language Models³, which we've highlighted in a previous blog post.

inference, learning, retrieval, (16 more...)

#artificialintelligence

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Overview (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.97)
Information Technology > Communications > Social Media (0.90)

Add feedback

Let's build a Full-Text Search engine - Artem Krylysov

#artificialintelligenceJul-28-2020, 21:32:56 GMT

Full-Text Search is one of those tools people use every day without realizing it. If you ever googled "golang coverage report" or tried to find "indoor wireless camera" on an e-commerce website, you used some kind of full-text search. Full-Text Search (FTS) is a technique for searching text in a collection of documents. A document can refer to a web page, a newspaper article, an email message, or any structured text. Today we are going to build our own FTS engine.

artificial intelligence, information retrieval, natural language, (16 more...)

#artificialintelligence

Industry: Information Technology > Services > e-Commerce Services (0.55)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.41)

Add feedback

How machine learning is revolutionizing journalism - ICIJ

#artificialintelligenceAug-22-2018, 13:52:21 GMT

The rise of the machine has freed ICIJ members globally to pore over millions of documents in a custom-built search engine. But even this next-level research has posed substantial challenges: for example, what to do when certain phrases return an indigestible 150,000 results? Clearly, the next step to speeding up our research was to intelligently filter information relevant to each investigation. Here's how we streamlined the previously daunting process, giving us both unprecedented flexibility and the required search success rate. In leaks like the Paradise Papers, we dealt with millions of documents (including PDFs, photos, and emails) that traditional platforms like Excel can't process.

information retrieval, machine learning, rapidminer, (20 more...)

#artificialintelligence

Industry: Media > News (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.36)

Add feedback