Goto

Collaborating Authors

 Information Retrieval


Learning Optimal Control and Dynamical Structure of Global Trajectory Search Problems with Diffusion Models

arXiv.org Artificial Intelligence

Spacecraft trajectory design is a global search problem, where previous work has revealed specific solution structures that can be captured with data-driven methods. This paper explores two global search problems in the circular restricted three-body problem: hybrid cost function of minimum fuel/time-of-flight and transfers to energy-dependent invariant manifolds. These problems display a fundamental structure either in the optimal control profile or the use of dynamical structures. We build on our prior generative machine learning framework to apply diffusion models to learn the conditional probability distribution of the search problem and analyze the model's capability to capture these structures.


Text2Insight: Transform natural language text into insights seamlessly using multi-model architecture

arXiv.org Artificial Intelligence

The growing demand for dynamic, user-centric data analysis and visualization is evident across domains like healthcare, finance, and research. Traditional visualization tools often fail to meet individual user needs due to their static and predefined nature. To address this gap, Text2Insight is introduced as an innovative solution that delivers customized data analysis and visualizations based on user-defined natural language requirements. Leveraging a multi-model architecture, Text2Insight transforms user inputs into actionable insights and dynamic visualizations. The methodology begins with analyzing the input dataset to extract structural details such as columns and values. A pre-trained Llama3 model converts the user's natural language query into an SQL query, which is further refined using a Named Entity Recognition (NER) model for accuracy. A chart predictor determines the most suitable visualization type, while the Llama3 model generates insights based on the SQL query's results. The output is a user-friendly and visually informative chart. To enhance analysis capabilities, the system integrates a question-answering model and a predictive model using the BERT framework. These models provide insights into historical data and predict future trends. Performance evaluation of Text2Insight demonstrates its effectiveness, achieving high accuracy (99%), precision (100%), recall (99%), and F1-score (99%), with a BLEU score of 0.5. The question-answering model attained an accuracy of 89% and the predictive model achieved 70% accuracy. These results validate Text2Insight as a robust and viable solution for transforming natural language text into dynamic, user-specific data analysis and visualizations.


Semi-Supervised Learning from Small Annotated Data and Large Unlabeled Data for Fine-grained PICO Entity Recognition

arXiv.org Artificial Intelligence

Objective: Extracting PICO elements -- Participants, Intervention, Comparison, and Outcomes -- from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities. Materials and Methods: Using a corpus of 2,511 abstracts with PICO mentions from 4 public datasets, we developed a semi-supervised method to facilitate the training of a NER model, FinePICO, by combining limited annotated data of PICO entities and abundant unlabeled data. For evaluation, we divided the entire dataset into two subsets: a smaller group with annotations and a larger group without annotations. We then established the theoretical lower and upper performance bounds based on the performance of supervised learning models trained solely on the small, annotated subset and on the entire set with complete annotations, respectively. Finally, we evaluated FinePICO on both the smaller annotated subset and the larger, initially unannotated subset. We measured the performance of FinePICO using precision, recall, and F1. Results: Our method achieved precision/recall/F1 of 0.567/0.636/0.60, respectively, using a small set of annotated samples, outperforming the baseline model (F1: 0.437) by more than 16\%. The model demonstrates generalizability to a different PICO framework and to another corpus, which consistently outperforms the benchmark in diverse experimental settings (p-value \textless0.001). Conclusion: This study contributes a generalizable and effective semi-supervised approach to named entity recognition leveraging large unlabeled data together with small, annotated data. It also initially supports fine-grained PICO extraction.


From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

arXiv.org Artificial Intelligence

Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank. This set of benchmarks offers the first comprehensive evaluation of language models' in-context ability to accurately interpret and extract linguistic features, providing a critical resource for scaling NLP to low-resource languages. The code and data are publicly available at \url{https://github.com/al-the-eigenvalue/RAG-on-grammars}.


FOR: Finetuning for Object Level Open Vocabulary Image Retrieval

arXiv.org Artificial Intelligence

As working with large datasets becomes standard, the task of accurately retrieving images containing objects of interest by an open set textual query gains practical importance. The current leading approach utilizes a pre-trained CLIP model without any adaptation to the target domain, balancing accuracy and efficiency through additional post-processing. In this work, we propose FOR: Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows finetuning on a target dataset using closed-set labels while keeping the visual-language association crucial for open vocabulary retrieval. FOR is based on two design elements: a specialized decoder variant of the CLIP head customized for the intended task, and its coupling within a multi-objective training framework. Together, these design choices result in a significant increase in accuracy, showcasing improvements of up to 8 mAP@50 points over SoTA across three datasets. Additionally, we demonstrate that FOR is also effective in a semi-supervised setting, achieving impressive results even when only a small portion of the dataset is labeled.


TSDS: Data Selection for Task-Specific Model Finetuning

arXiv.org Artificial Intelligence

Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques. We evaluate our method on data selection for both continued pretraining and instruction tuning of language models. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average. Our code is available at https://github.com/ZifanL/TSDS.


A Survey of Query Optimization in Large Language Models

arXiv.org Artificial Intelligence

\textit{Query Optimization} (QO) refers to techniques aimed at enhancing the efficiency and quality of Large Language Models (LLMs) in understanding and answering queries, especially complex ones in scenarios like Retrieval-Augmented Generation (RAG). Specifically, RAG mitigates the limitations of LLMs by dynamically retrieving and leveraging up-to-date relevant information, which provides a cost-effective solution to the challenge of LLMs producing plausible but potentially inaccurate responses. Recently, as RAG evolves and incorporates multiple components that influence its performance, QO has emerged as a critical element, playing a pivotal role in determining the effectiveness of RAG's retrieval stage in accurately sourcing the necessary multiple pieces of evidence to answer queries correctly. In this paper, we trace the evolution of QO techniques by summarizing and analyzing significant studies. Through an organized framework and categorization, we aim to consolidate existing QO techniques in RAG, elucidate their technological foundations, and highlight their potential to enhance the versatility and applications of LLMs.


Efficient fine-tuning methodology of text embedding models for information retrieval: contrastive learning penalty (clp)

arXiv.org Artificial Intelligence

Text embedding models play a crucial role in natural language processing, particularly in information retrieval, by mapping text data into a semantically rich vector space. The importance of information retrieval has been further highlighted with the recent utilization of RAG (Retrieval-Augmented Generation) (Lewis et al., 2020) to address the issues of hallucination and outdated information in large language models (LLMs). Pre-trained text embedding models on a massive corpus have significantly improved the quality of text representation. BGE M3-Embedding (Chen et al., 2024) is a representative model that shows outstanding performance in multilingual text embedding and information retrieval. This study proposes an efficient fine-tuning methodology to enhance the information retrieval performance of pre-trained text embedding models by specializing them to a specific domain: 1. Efficient Training Data Selection Technique: Applies ANCE (Approximate Nearest Neighbor Negative Contrastive Estimation) (Xiong et al., 2020) for selecting negative samples in the training data.


How to Change the Default Search Engine in Google Chrome

WIRED

Part of the reason Google decided to start developing its own Chrome browser--all the way back in 2008--was to funnel people toward all of its web apps, from Google Docs to Gmail to Google Maps. And of course, Chrome has Google's search engine built right in. However, if you love Google Chrome but you've decided you've had enough of Google search, you can change the default search engine in the browser. You can switch to Bing, DuckDuckGo, or whichever alternative search engine you like. Maybe you feel you've spent enough of your life scrolling through Google's sponsored links, or perhaps you'd rather use a search engine without any AI in it.


Survey on Abstractive Text Summarization: Dataset, Models, and Metrics

arXiv.org Artificial Intelligence

Readers and scholars often desire a concise summary (Too Long; Didn't Read - TL;DR) of texts to effectively prioritize information. However, creating document summaries is mentally taxing and time-consuming, especially considering the overwhelming volume of documents produced annually, as depicted in Figure 1 by [2], Figure 2, [3] reported over 100,000 scientific articles on the Corona virus pandemic in 2020, though these articles contain brief abstracts of the article, the sheer volume poses challenges for researchers and medical professionals in quickly extracting relevant knowledge on a specific topic. An automatically generated multi-document summarization could be valuable, providing readers with essential information and reducing the need to access original files unless refinement is necessary. Text summarization has garnered significant research attention, proving useful in search engines, news clustering, timeline generation, and various other applications. The objective of text summarization is to create a brief, coherent, factually consistent, and readable document that retains the essential information from the source document, whether it is a single or multi-document. In Single Document Summarization (SDS) only one input document is used, eliminating the need for additional processing to assess relationships between inputs. This method is suitable for summarizing standalone documents such as emails, legal contracts, financial reports and so on. The primary goal of Multi Document Summarization (MDS) is to gather information from several texts addressing the same topic, often composed at different times or representing diverse perspectives. The overarching objective is to produce information reports that are both succinct and comprehensive, consolidating varied opinions from documents that explore a topic through multiple viewpoints.