AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Binary Embedding-based Retrieval at Tencent

Gan, Yukang, Ge, Yixiao, Zhou, Chang, Su, Shupeng, Xu, Zhouchuan, Xu, Xuyuan, Hui, Quanchao, Chen, Xiang, Wang, Yexin, Shan, Ying

arXiv.org Artificial IntelligenceFeb-17-2023

Large-scale embedding-based retrieval (EBR) is the cornerstone of search-related industrial applications. Given a user query, the system of EBR aims to identify relevant information from a large corpus of documents that may be tens or hundreds of billions in size. The storage and computation turn out to be expensive and inefficient with massive documents and high concurrent queries, making it difficult to further scale up. To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension. Specifically, we compress the full-precision query and document embeddings, formulated as float vectors in general, into a composition of multiple binary vectors using a lightweight transformation model with residual multilayer perception (MLP) blocks. We can therefore tailor the number of bits for different applications to trade off accuracy loss and cost savings. Importantly, we enable task-agnostic efficient training of the binarization model using a new embedding-to-embedding strategy. We also exploit the compatible training of binary embeddings so that the BEBR engine can support indexing among multiple embedding versions within a unified system. To further realize efficient search, we propose Symmetric Distance Calculation (SDC) to achieve lower response time than Hamming codes. We successfully employed the introduced BEBR to Tencent products, including Sogou, Tencent Video, QQ World, etc. The binarization algorithm can be seamlessly generalized to various tasks with multiple modalities. Extensive experiments on offline benchmarks and online A/B tests demonstrate the efficiency and effectiveness of our method, significantly saving 30%~50% index costs with almost no loss of accuracy at the system level.

data mining, information retrieval, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2302.08714

Country:

Asia > China (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Taiwan > Taiwan Province > Taipei (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Cost-Effective Online Contextual Model Selection

Liu, Xuefeng, Xia, Fangfang, Stevens, Rick L., Chen, Yuxin

arXiv.org Artificial IntelligenceFeb-17-2023

How can we collect the most useful labels to learn a model selection policy, when presented with arbitrary heterogeneous data streams? In this paper, we formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. In particular, we focus on the task of selecting pre-trained classifiers, and propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection. In comparison to prior art, our algorithm does not assume a globally optimal model. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Our experiments on several benchmark classification datasets demonstrate the algorithm's effectiveness in terms of both regret and query complexity. Notably, to achieve the same accuracy, CAMS incurs less than 10% of the label cost when compared to the best online model selection baselines on CIFAR10.

classifier, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2207.0603

Country: North America > United States (0.92)

Genre: Research Report (1.00)

Industry:

Energy (1.00)
Health & Medicine > Therapeutic Area (0.51)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.55)

Add feedback

Foundation Models for Natural Language Processing -- Pre-trained Language Models Integrating Media

Paaß, Gerhard, Giesselbach, Sven

arXiv.org Artificial IntelligenceFeb-16-2023

This open access book provides a comprehensive overview of the state of the art in research and applications of Foundation Models and is intended for readers familiar with basic Natural Language Processing (NLP) concepts. Over the recent years, a revolutionary new paradigm has been developed for training models for NLP. These models are first pre-trained on large collections of text documents to acquire general syntactic knowledge and semantic information. Then, they are fine-tuned for specific tasks, which they can often solve with superhuman accuracy. When the models are large enough, they can be instructed by prompts to solve new tasks without any fine-tuning. Moreover, they can be applied to a wide range of different media and problem domains, ranging from image and video processing to robot control learning. Because they provide a blueprint for solving many tasks in artificial intelligence, they have been called Foundation Models. After a brief introduction to basic NLP models the main pre-trained language models BERT, GPT and sequence-to-sequence transformer are described, as well as the concepts of self-attention and context-sensitive embedding. Then, different approaches to improving these models are discussed, such as expanding the pre-training criteria, increasing the length of input texts, or including extra knowledge. An overview of the best-performing models for about twenty application areas is then presented, e.g., question answering, translation, story generation, dialog systems, generating images from text, etc. For each application area, the strengths and weaknesses of current models are discussed, and an outlook on further developments is given. In addition, links are provided to freely available program code. A concluding chapter summarizes the economic opportunities, mitigation of risks, and potential developments of AI.

large language model, machine learning, pattern recognition, (32 more...)

arXiv.org Artificial Intelligence

2302.08575

Country:

Europe > Ukraine > Kyiv Oblast > Kyiv (0.13)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.13)
North America > Canada > Ontario > Toronto (0.13)
(43 more...)

Genre:

Workflow (1.00)
Summary/Review (1.00)
Research Report > Promising Solution (1.00)
(4 more...)

Industry:

Transportation > Passenger (1.00)
Media > Television (1.00)
Media > News (1.00)
(21 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
(23 more...)

Add feedback

Search-Engine-augmented Dialogue Response Generation with Cheaply Supervised Query Production

Wang, Ante, Song, Linfeng, Liu, Qi, Mi, Haitao, Wang, Longyue, Tu, Zhaopeng, Su, Jinsong, Yu, Dong

arXiv.org Artificial IntelligenceFeb-15-2023

Knowledge-aided dialogue response generation aims at augmenting chatbots with relevant external knowledge in the hope of generating more informative responses. The majority of previous work assumes that the relevant knowledge is given as input or retrieved from a static pool of knowledge. However, this assumption violates the real-world situation, where knowledge is continually updated and a chatbot has to dynamically retrieve useful knowledge. We propose a dialogue model that can access the vast and dynamic information from any search engine for response generation. As the core module, a query producer is used to generate queries from a dialogue context to interact with a search engine. We design a training algorithm using cheap noisy supervision for the query producer, where the signals are obtained by comparing retrieved articles with the next dialogue response. As the result, the query producer is adjusted without any human annotation of gold queries, making it easily transferable to other domains and search engines. Experiments show that our query producer can achieve R@1 and R@5 rates of 62.4% and 74.8% for retrieving gold knowledge, and the overall model generates better responses over strong knowledge-aided baselines using BART and other typical systems.

artificial intelligence, information retrieval, natural language, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.artint.2023.103874

2302.093

Country:

North America > United States > Mississippi > Lee County > Tupelo (0.04)
North America > United States > Tennessee > Davidson County > Nashville (0.04)
North America > Canada > Ontario > Toronto (0.04)
(5 more...)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (1.00)
Media (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Large-Scale Knowledge Synthesis and Complex Information Retrieval from Biomedical Documents

Saxena, Shreya, Sangani, Raj, Prasad, Siva, Kumar, Shubham, Athale, Mihir, Awhad, Rohan, Vaddina, Vishal

arXiv.org Artificial IntelligenceFeb-14-2023

Recent advances in the healthcare industry have led to an abundance of unstructured data, making it challenging to perform tasks such as efficient and accurate information retrieval at scale. Our work offers an all-in-one scalable solution for extracting and exploring complex information from large-scale research documents, which would otherwise be tedious. First, we briefly explain our knowledge synthesis process to extract helpful information from unstructured text data of research documents. Then, on top of the knowledge extracted from the documents, we perform complex information retrieval using three major components- Paragraph Retrieval, Triplet Retrieval from Knowledge Graphs, and Complex Question Answering (QA). These components combine lexical and semantic-based methods to retrieve paragraphs and triplets and perform faceted refinement for filtering these search results. The complexity of biomedical queries and documents necessitates using a QA system capable of handling queries more complex than factoid queries, which we evaluate qualitatively on the COVID-19 Open Research Dataset (CORD-19) to demonstrate the effectiveness and value-add.

information, information retrieval, machine learning, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/BigData55660.2022.10020725

2302.06854

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
Asia > China (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Enhancing Model Performance in Multilingual Information Retrieval with Comprehensive Data Engineering Techniques

Zhang, Qi, Yang, Zijian, Huang, Yilun, Chen, Ze, Cai, Zijian, Wang, Kangxu, Zheng, Jiewen, He, Jiarong, Gao, Jin

arXiv.org Artificial IntelligenceFeb-14-2023

In this paper, we present our solution to the Multilingual Information Retrieval Across a Continuum of Languages (MIRACL) challenge of WSDM CUP 2023\footnote{https://project-miracl.github.io/}. Our solution focuses on enhancing the ranking stage, where we fine-tune pre-trained multilingual transformer-based models with MIRACL dataset. Our model improvement is mainly achieved through diverse data engineering techniques, including the collection of additional relevant training data, data augmentation, and negative sampling. Our fine-tuned model effectively determines the semantic relevance between queries and documents, resulting in a significant improvement in the efficiency of the multilingual information retrieval process. Finally, Our team is pleased to achieve remarkable results in this challenging competition, securing 2nd place in the Surprise-Languages track with a score of 0.835 and 3rd place in the Known-Languages track with an average nDCG@10 score of 0.716 across the 16 known languages on the final leaderboard.

information retrieval, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2302.0701

Country:

Asia > Singapore (0.05)
North America > United States > New York > New York County > New York City (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

AI Generates Articles with Potentially Risky YMYL Content - Bytefeed - News Powered by AI

#artificialintelligenceFeb-13-2023, 18:30:26 GMT

Artificial Intelligence (AI) is becoming increasingly popular in the world of content creation. AI-generated articles are now being used to create serious, Your Money or Your Life (YMYL) content for websites and other digital platforms. The use of AI-generated articles has been growing steadily over the past few years as more businesses recognize its potential to produce high quality, engaging content quickly and efficiently. AI can be used to generate both short form and long form pieces that cover a wide range of topics from finance and health care to travel and lifestyle. AI-generated YMYL content is particularly useful for businesses looking to provide accurate information on important topics such as financial advice, medical advice, legal advice or any other type of topic where accuracy is essential.

ai generate article, news powered, risky ymyl content, (3 more...)

#artificialintelligence

Industry:

Law (0.59)
Health & Medicine (0.59)

Technology:

Information Technology > Artificial Intelligence > Applied AI (0.58)
Information Technology > Information Management > Search (0.40)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.40)

Add feedback

Improving Out-of-Distribution Generalization of Neural Rerankers with Contextualized Late Interaction

Zhang, Xinyu, Li, Minghan, Lin, Jimmy

arXiv.org Artificial IntelligenceFeb-13-2023

Recent progress in information retrieval finds that embedding query and document representation into multi-vector yields a robust bi-encoder retriever on out-of-distribution datasets. In this paper, we explore whether late interaction, the simplest form of multi-vector, is also helpful to neural rerankers that only use the [CLS] vector to compute the similarity score. Although intuitively, the attention mechanism of rerankers at the previous layers already gathers the token-level information, we find adding late interaction still brings an extra 5% improvement in average on out-of-distribution datasets, with little increase in latency and no degradation in in-domain effectiveness. Through extensive experiments and analysis, we show that the finding is consistent across different model sizes and first-stage retrievers of diverse natures and that the improvement is more prominent on longer queries.

information retrieval, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2302.06589

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.28)
North America > United States > New York > New York County > New York City (0.05)
North America > United States > District of Columbia > Washington (0.05)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Large Scale Multi-Lingual Multi-Modal Summarization Dataset

Verma, Yash, Jangra, Anubhav, Kumar, Raghvendra, Saha, Sriparna

arXiv.org Artificial IntelligenceFeb-13-2023

Significant developments in techniques such as encoder-decoder models have enabled us to represent information comprising multiple modalities. This information can further enhance many downstream tasks in the field of information retrieval and natural language processing; however, improvements in multi-modal techniques and their performance evaluation require large-scale multi-modal data which offers sufficient diversity. Multi-lingual modeling for a variety of tasks like multi-modal summarization, text generation, and translation leverages information derived from high-quality multi-lingual annotated data. In this work, we present the current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is derived from news articles published by British Broadcasting Corporation(BBC) over a decade and spans 20 languages, targeting diversity across five language roots, it is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages. We formally define the multi-lingual multi-modal summarization task utilizing our dataset and report baseline scores from various state-of-the-art summarization techniques in a multi-lingual setting. We also compare it with many similar datasets to analyze the uniqueness and difficulty of M3LS.

information retrieval, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2302.0656

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > Papua New Guinea (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(3 more...)

Genre: Research Report (0.84)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

The Search Engine Showdown is Far from Over

#artificialintelligenceFeb-12-2023, 09:36:08 GMT

Back in the 1990s, the search engine category was a hot space. Yahoo, Netscape, AOL, Ask Jeeves, AltaVista, Google search, MSN and others were vying to capture the dominant position. With time, they all fizzled out. Post 2000 was the era of Google Search, the undisputed winner of the space until quite recently. The tide is turning and the crown of Google Search is under threat.

bing, microsoft, search engine, (11 more...)

#artificialintelligence

Industry: Information Technology > Services (0.94)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.69)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.52)
(2 more...)

Add feedback