AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Learning Optimal Control and Dynamical Structure of Global Trajectory Search Problems with Diffusion Models

Graebner, Jannik, Li, Anjian, Sinha, Amlan, Beeson, Ryne

arXiv.org Artificial IntelligenceDec-29-2024

Spacecraft trajectory design is a global search problem, where previous work has revealed specific solution structures that can be captured with data-driven methods. This paper explores two global search problems in the circular restricted three-body problem: hybrid cost function of minimum fuel/time-of-flight and transfers to energy-dependent invariant manifolds. These problems display a fundamental structure either in the optimal control profile or the use of dynamical structures. We build on our prior generative machine learning framework to apply diffusion models to learn the conditional probability distribution of the search problem and analyze the model's capability to capture these structures.

diffusion model, information retrieval, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.02976

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Text2Insight: Transform natural language text into insights seamlessly using multi-model architecture

Sain, Pradeep

arXiv.org Artificial IntelligenceDec-27-2024

The growing demand for dynamic, user-centric data analysis and visualization is evident across domains like healthcare, finance, and research. Traditional visualization tools often fail to meet individual user needs due to their static and predefined nature. To address this gap, Text2Insight is introduced as an innovative solution that delivers customized data analysis and visualizations based on user-defined natural language requirements. Leveraging a multi-model architecture, Text2Insight transforms user inputs into actionable insights and dynamic visualizations. The methodology begins with analyzing the input dataset to extract structural details such as columns and values. A pre-trained Llama3 model converts the user's natural language query into an SQL query, which is further refined using a Named Entity Recognition (NER) model for accuracy. A chart predictor determines the most suitable visualization type, while the Llama3 model generates insights based on the SQL query's results. The output is a user-friendly and visually informative chart. To enhance analysis capabilities, the system integrates a question-answering model and a predictive model using the BERT framework. These models provide insights into historical data and predict future trends. Performance evaluation of Text2Insight demonstrates its effectiveness, achieving high accuracy (99%), precision (100%), recall (99%), and F1-score (99%), with a BLEU score of 0.5. The question-answering model attained an accuracy of 89% and the predictive model achieved 70% accuracy. These results validate Text2Insight as a robust and viable solution for transforming natural language text into dynamic, user-specific data analysis and visualizations.

information retrieval, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2412.19718

Country:

Asia > India > Maharashtra > Mumbai (0.04)
Asia > India > Tamil Nadu > Chennai (0.04)
Asia > India > Karnataka > Bengaluru (0.04)
(10 more...)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment > Sports > Cricket (1.00)
Education (1.00)
Health & Medicine (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(3 more...)

Add feedback

Semi-Supervised Learning from Small Annotated Data and Large Unlabeled Data for Fine-grained PICO Entity Recognition

Chen, Fangyi, Zhang, Gongbo, Fang, Yilu, Peng, Yifan, Weng, Chunhua

arXiv.org Artificial IntelligenceDec-26-2024

Objective: Extracting PICO elements -- Participants, Intervention, Comparison, and Outcomes -- from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities. Materials and Methods: Using a corpus of 2,511 abstracts with PICO mentions from 4 public datasets, we developed a semi-supervised method to facilitate the training of a NER model, FinePICO, by combining limited annotated data of PICO entities and abundant unlabeled data. For evaluation, we divided the entire dataset into two subsets: a smaller group with annotations and a larger group without annotations. We then established the theoretical lower and upper performance bounds based on the performance of supervised learning models trained solely on the small, annotated subset and on the entire set with complete annotations, respectively. Finally, we evaluated FinePICO on both the smaller annotated subset and the larger, initially unannotated subset. We measured the performance of FinePICO using precision, recall, and F1. Results: Our method achieved precision/recall/F1 of 0.567/0.636/0.60, respectively, using a small set of annotated samples, outperforming the baseline model (F1: 0.437) by more than 16\%. The model demonstrates generalizability to a different PICO framework and to another corpus, which consistently outperforms the benchmark in diverse experimental settings (p-value \textless0.001). Conclusion: This study contributes a generalizable and effective semi-supervised approach to named entity recognition leveraging large unlabeled data together with small, annotated data. It also initially supports fine-grained PICO extraction.

information retrieval, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.19346

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

Kornilov, Albert, Shavrina, Tatiana

arXiv.org Artificial IntelligenceDec-26-2024

Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank. This set of benchmarks offers the first comprehensive evaluation of language models' in-context ability to accurately interpret and extract linguistic features, providing a critical resource for scaling NLP to low-resource languages. The code and data are publicly available at \url{https://github.com/al-the-eigenvalue/RAG-on-grammars}.

information retrieval, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2411.15577

Country:

Europe (0.67)
North America (0.46)
Oceania > Australia (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.94)
(2 more...)

Add feedback

FOR: Finetuning for Object Level Open Vocabulary Image Retrieval

Levi, Hila, Heller, Guy, Levi, Dan

arXiv.org Artificial IntelligenceDec-25-2024

As working with large datasets becomes standard, the task of accurately retrieving images containing objects of interest by an open set textual query gains practical importance. The current leading approach utilizes a pre-trained CLIP model without any adaptation to the target domain, balancing accuracy and efficiency through additional post-processing. In this work, we propose FOR: Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows finetuning on a target dataset using closed-set labels while keeping the visual-language association crucial for open vocabulary retrieval. FOR is based on two design elements: a specialized decoder variant of the CLIP head customized for the intended task, and its coupling within a multi-objective training framework. Together, these design choices result in a significant increase in accuracy, showcasing improvements of up to 8 mAP@50 points over SoTA across three datasets. Additionally, we demonstrate that FOR is also effective in a semi-supervised setting, achieving impressive results even when only a small portion of the dataset is labeled.

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.18806

Country:

North America > United States (1.00)
Europe (0.93)
North America > Canada (0.68)
Asia > Middle East > Israel (0.14)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (0.93)
Media (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
(3 more...)

Add feedback

TSDS: Data Selection for Task-Specific Model Finetuning

Liu, Zifan, Karbasi, Amin, Rekatsinas, Theodoros

arXiv.org Artificial IntelligenceDec-24-2024

Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques. We evaluate our method on data selection for both continued pretraining and instruction tuning of language models. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average. Our code is available at https://github.com/ZifanL/TSDS.

artificial intelligence, information retrieval, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.11303

Country:

Europe (1.00)
North America > United States > Wisconsin (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.89)

Add feedback

A Survey of Query Optimization in Large Language Models

Song, Mingyang, Zheng, Mao

arXiv.org Artificial IntelligenceDec-23-2024

\textit{Query Optimization} (QO) refers to techniques aimed at enhancing the efficiency and quality of Large Language Models (LLMs) in understanding and answering queries, especially complex ones in scenarios like Retrieval-Augmented Generation (RAG). Specifically, RAG mitigates the limitations of LLMs by dynamically retrieving and leveraging up-to-date relevant information, which provides a cost-effective solution to the challenge of LLMs producing plausible but potentially inaccurate responses. Recently, as RAG evolves and incorporates multiple components that influence its performance, QO has emerged as a critical element, playing a pivotal role in determining the effectiveness of RAG's retrieval stage in accurately sourcing the necessary multiple pieces of evidence to answer queries correctly. In this paper, we trace the evolution of QO techniques by summarizing and analyzing significant studies. Through an organized framework and categorization, we aim to consolidate existing QO techniques in RAG, elucidate their technological foundations, and highlight their potential to enhance the versatility and applications of LLMs.

computational linguistic, language model, query, (14 more...)

arXiv.org Artificial Intelligence

2412.17558

Country:

Europe > Austria > Vienna (0.14)
Asia > Thailand > Bangkok > Bangkok (0.05)
North America > Mexico > Mexico City > Mexico City (0.05)
(9 more...)

Genre: Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)

Add feedback

Efficient fine-tuning methodology of text embedding models for information retrieval: contrastive learning penalty (clp)

Yu, Jeongsu

arXiv.org Artificial IntelligenceDec-23-2024

Text embedding models play a crucial role in natural language processing, particularly in information retrieval, by mapping text data into a semantically rich vector space. The importance of information retrieval has been further highlighted with the recent utilization of RAG (Retrieval-Augmented Generation) (Lewis et al., 2020) to address the issues of hallucination and outdated information in large language models (LLMs). Pre-trained text embedding models on a massive corpus have significantly improved the quality of text representation. BGE M3-Embedding (Chen et al., 2024) is a representative model that shows outstanding performance in multilingual text embedding and information retrieval. This study proposes an efficient fine-tuning methodology to enhance the information retrieval performance of pre-trained text embedding models by specializing them to a specific domain: 1. Efficient Training Data Selection Technique: Applies ANCE (Approximate Nearest Neighbor Negative Contrastive Estimation) (Xiong et al., 2020) for selecting negative samples in the training data.

information retrieval, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.17364

Country: Asia > South Korea (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

How to Change the Default Search Engine in Google Chrome

WIREDDec-22-2024, 12:30:00 GMT

Part of the reason Google decided to start developing its own Chrome browser--all the way back in 2008--was to funnel people toward all of its web apps, from Google Docs to Gmail to Google Maps. And of course, Chrome has Google's search engine built right in. However, if you love Google Chrome but you've decided you've had enough of Google search, you can change the default search engine in the browser. You can switch to Bing, DuckDuckGo, or whichever alternative search engine you like. Maybe you feel you've spent enough of your life scrolling through Google's sponsored links, or perhaps you'd rather use a search engine without any AI in it.

default search engine, google chrome, search engine, (3 more...)

WIRED

Industry:

Information Technology > Software (1.00)
Information Technology > Services (0.67)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

Survey on Abstractive Text Summarization: Dataset, Models, and Metrics

Nnadi, Gospel Ozioma, Bertini, Flavio

arXiv.org Artificial IntelligenceDec-22-2024

Readers and scholars often desire a concise summary (Too Long; Didn't Read - TL;DR) of texts to effectively prioritize information. However, creating document summaries is mentally taxing and time-consuming, especially considering the overwhelming volume of documents produced annually, as depicted in Figure 1 by [2], Figure 2, [3] reported over 100,000 scientific articles on the Corona virus pandemic in 2020, though these articles contain brief abstracts of the article, the sheer volume poses challenges for researchers and medical professionals in quickly extracting relevant knowledge on a specific topic. An automatically generated multi-document summarization could be valuable, providing readers with essential information and reducing the need to access original files unless refinement is necessary. Text summarization has garnered significant research attention, proving useful in search engines, news clustering, timeline generation, and various other applications. The objective of text summarization is to create a brief, coherent, factually consistent, and readable document that retains the essential information from the source document, whether it is a single or multi-document. In Single Document Summarization (SDS) only one input document is used, eliminating the need for additional processing to assess relationships between inputs. This method is suitable for summarizing standalone documents such as emails, legal contracts, financial reports and so on. The primary goal of Multi Document Summarization (MDS) is to gather information from several texts addressing the same topic, often composed at different times or representing diverse perspectives. The overarching objective is to produce information reports that are both succinct and comprehensive, consolidating varied opinions from documents that explore a topic through multiple viewpoints.

evolutionary algorithm, information retrieval, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2412.17165

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Law (1.00)
Media > News (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
(2 more...)

Add feedback