AITopics | Information Retrieval

Collaborating Authors

Information Retrieval

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people. Searching traditional indexes of print publications has been aided by computerized databases, but still usually requires time-consuming serial searching of one database after the other, and then moving on to other methods of searching for internet sources. And what if the information being sought is a sound byte? A video clip? Yesterday's e-mail exchange between respected scientists? Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.

News Overviews Instructional Materials AI-Alerts Classics

Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language

Mahboub, Ali, Za'ter, Muhy Eddin, Al-Rfooh, Bashar, Estaitia, Yazan, Jaljuli, Adnan, Hakouz, Asma

arXiv.org Artificial IntelligenceMay-30-2024

The abundance of information has driven the development of semantic search technologies that surpass traditional keyword-based search engines by understanding the context and intent of user queries through natural language processing (NLP) and machine learning [1]. Unlike conventional search methods that focus on matching keywords, semantic search interprets the meaning and relationships between words, aiming to mimic human understanding. This advancement enhances user experience across various applications, including web search engines, knowledge discovery, and personalized content recommendation systems, and most recently Retriever-Augmented Generation (RAG) [2]. RAG represents an innovative approach at the crossroads of information retrieval and natural language generation, leveraging the strengths of both fields to refine Artificial Intelligence (AI) based systems ability to comprehend and generate human-like text [3, 4]. By combining a sophisticated retrieval mechanism with a powerful generation model, RAG systems can produce detailed, contextually relevant responses that significantly improve standalone language models limitations in terms of precision and human-like generation.

encoder, query, semantic search, (16 more...)

arXiv.org Artificial Intelligence

2403.1835

Country: Asia > Middle East > Jordan > Amman Governorate > Amman (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

A Declarative System for Optimizing AI Workloads

Liu, Chunwei, Russo, Matthew, Cafarella, Michael, Cao, Lei, Chen, Peter Baille, Chen, Zui, Franklin, Michael, Kraska, Tim, Madden, Samuel, Vitagliano, Gerardo

arXiv.org Artificial IntelligenceMay-29-2024

A long-standing goal of data management systems has been to build systems which can compute quantitative insights over large corpora of unstructured data in a cost-effective manner. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or metrics from image and video corpora. Today's models can accomplish these tasks with high accuracy. However, a programmer who wants to answer a substantive AI-powered query must orchestrate large numbers of models, prompts, and data operations. For even a single query, the programmer has to make a vast number of decisions such as the choice of model, the right inference method, the most cost-effective inference hardware, the ideal prompt design, and so on. The optimal set of decisions can change as the query changes and as the rapidly-evolving technical landscape shifts. In this paper we present Palimpzest, a system that enables anyone to process AI-powered analytical queries simply by defining them in a declarative language. The system uses its cost optimization framework to implement the query plan with the best trade-offs between runtime, financial cost, and output data quality. We describe the workload of AI-powered analytics tasks, the optimization methods that Palimpzest uses, and the prototype system itself. We evaluate Palimpzest on tasks in Legal Discovery, Real Estate Search, and Medical Schema Matching. We show that even our simple prototype offers a range of appealing plans, including one that is 3.3x faster and 2.9x cheaper than the baseline method, while also offering better data quality. With parallelism enabled, Palimpzest can produce plans with up to a 90.3x speedup at 9.1x lower cost relative to a single-threaded GPT-4 baseline, while obtaining an F1-score within 83.5% of the baseline. These require no additional work by the user.

alimpzest, optimization, workload, (15 more...)

arXiv.org Artificial Intelligence

2405.14696

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Arizona (0.04)
(4 more...)

Genre: Research Report > Experimental Study (0.45)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Information Technology (0.93)
Banking & Finance > Real Estate (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

Semantic In-Domain Product Identification for Search Queries

Sharma, Sanat, Kumar, Jayant, Naik, Twisha, Lu, Zhaoyu, Srikantan, Arvind, King, Tracy Holloway

arXiv.org Artificial IntelligenceMay-29-2024

Accurate explicit and implicit product identification in search queries is critical for enhancing user experiences, especially at a company like Adobe which has over 50 products and covers queries across hundreds of tools. In this work, we present a novel approach to training a product classifier from user behavioral data. Our semantic model led to >25% relative improvement in CTR (click through rate) across the deployed surfaces; a >50% decrease in null rate; a 2x increase in the app cards surfaced, which helps drive product visibility.

app card, dataset, query, (12 more...)

arXiv.org Artificial Intelligence

2404.09091

Country:

North America > United States > California > Santa Clara County > San Jose (0.06)
North America > United States > District of Columbia > Washington (0.05)

Genre: Research Report (0.71)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.50)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Navigable Graphs for High-Dimensional Nearest Neighbor Search: Constructions and Limits

Diwan, Haya, Gou, Jinrui, Musco, Cameron, Musco, Christopher, Suel, Torsten

arXiv.org Artificial IntelligenceMay-28-2024

There has been significant recent interest in graph-based nearest neighbor search methods, many of which are centered on the construction of navigable graphs over high-dimensional point sets. A graph is navigable if we can successfully move from any starting node to any target node using a greedy routing strategy where we always move to the neighbor that is closest to the destination according to a given distance function. The complete graph is navigable for any point set, but the important question for applications is if sparser graphs can be constructed. While this question is fairly well understood in low-dimensions, we establish some of the first upper and lower bounds for high-dimensional point sets. First, we give a simple and efficient way to construct a navigable graph with average degree $O(\sqrt{n \log n })$ for any set of $n$ points, in any dimension, for any distance function. We compliment this result with a nearly matching lower bound: even under the Euclidean metric in $O(\log n)$ dimensions, a random point set has no navigable graph with average degree $O(n^{\alpha})$ for any $\alpha < 1/2$. Our lower bound relies on sharp anti-concentration bounds for binomial random variables, which we use to show that the near-neighborhoods of a set of random points do not overlap significantly, forcing any navigable graph to have many edges.

graph, navigable graph, probability, (15 more...)

arXiv.org Artificial Intelligence

2405.1868

Country: North America > United States > New York (0.05)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.92)

Add feedback

KSW: Khmer Stop Word based Dictionary for Keyword Extraction

Thuon, Nimol, Zhang, Wangrui, Thuon, Sada

arXiv.org Artificial IntelligenceMay-27-2024

This paper introduces KSW, a Khmer-specific approach to keyword extraction that leverages a specialized stop word dictionary. Due to the limited availability of natural language processing resources for the Khmer language, effective keyword extraction has been a significant challenge. KSW addresses this by developing a tailored stop word dictionary and implementing a preprocessing methodology to remove stop words, thereby enhancing the extraction of meaningful keywords. Our experiments demonstrate that KSW achieves substantial improvements in accuracy and relevance compared to previous methods, highlighting its potential to advance Khmer text processing and information retrieval. The KSW resources, including the stop word dictionary, are available at the following GitHub repository: (https://github.com/back-kh/KSWv2-Khmer-Stop-Word-based-Dictionary-for-Keyword-Extraction.git).

extraction, keyword extraction, stop word, (13 more...)

arXiv.org Artificial Intelligence

2405.1739

Country:

Asia > Cambodia (0.05)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Cologne (0.04)
Europe > Belgium (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser

Cheng, Xianfu, Zhang, Hang, Yang, Jian, Li, Xiang, Zhou, Weixiao, Wu, Kui, Liu, Fei, Zhang, Wei, Sun, Tao, Li, Tongliang, Li, Zhoujun

arXiv.org Artificial IntelligenceMay-27-2024

In the domain of document AI, semi-structured form parsing plays a crucial role. This task leverages techniques from key information extraction (KIE), dealing with inputs that range from plain text to intricate modal data comprising images and structural layouts. The advent of pre-trained multimodal models has driven the extraction of key information from form documents in different formats such as PDFs and images. Nonetheless, the endeavor of form parsing is still encumbered by notable challenges like subpar capabilities in multi-lingual parsing and diminished recall in contexts rich in text and visuals. In this work, we introduce a simple but effective \textbf{M}ultimodal and \textbf{M}ultilingual semi-structured \textbf{FORM} \textbf{PARSER} (\textbf{XFormParser}), which is anchored on a comprehensive pre-trained language model and innovatively amalgamates semantic entity recognition (SER) and relation extraction (RE) into a unified framework, enhanced by a novel staged warm-up training approach that employs soft labels to significantly refine form parsing accuracy without amplifying inference overhead. Furthermore, we have developed a groundbreaking benchmark dataset, named InDFormBench, catering specifically to the parsing requirements of multilingual forms in various industrial contexts. Through rigorous testing on established multilingual benchmarks and InDFormBench, XFormParser has demonstrated its unparalleled efficacy, notably surpassing the state-of-the-art (SOTA) models in RE tasks within language-specific setups by achieving an F1 score improvement of up to 1.79\%. Our framework exhibits exceptionally improved performance across tasks in both multi-language and zero-shot contexts when compared to existing SOTA benchmarks. The code is publicly available at https://github.com/zhbuaa0/layoutlmft.

extraction, information, xformparser, (13 more...)

arXiv.org Artificial Intelligence

2405.17336

Country:

Europe > Austria > Vienna (0.14)
Asia > China > Beijing > Beijing (0.04)
Europe > Italy > Piedmont > Turin Province > Turin (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
(2 more...)

Add feedback

Hybrid Context Retrieval Augmented Generation Pipeline: LLM-Augmented Knowledge Graphs and Vector Database for Accreditation Reporting Assistance

Edwards, Candace

arXiv.org Artificial IntelligenceMay-24-2024

In higher education, accreditation is a quality assurance process, where an institution demonstrates a commitment to delivering high quality programs and services to their students. For business schools nationally and internationally the Association to Advance Collegiate Schools of Business (AACSB) accreditation is the gold standard. For a business school to receive and subsequently maintain accreditation, the school must undertake a rigorous, time consuming reporting and peer review process, to demonstrate alignment with the AACSB Standards. For this project we create a hybrid context retrieval augmented generation pipeline that can assist in the documentation alignment and reporting process necessary for accreditation. We implement both a vector database and knowledge graph, as knowledge stores containing both institutional data and AACSB Standard data. The output of the pipeline can be used by institution stakeholders to build their accreditation report, dually grounded by the context from the knowledge stores. To develop our knowledge graphs we utilized both a manual construction process as well as an LLM Augmented Knowledge Graph approach. We evaluated the pipeline using the RAGAs framework and observed optimal performance on answer relevancy and answer correctness metrics.

knowledge graph, node, query, (14 more...)

arXiv.org Artificial Intelligence

2405.15436

Country:

North America > United States > Texas (0.14)
North America > United States > Hawaii (0.04)
North America > United States > Wisconsin (0.04)
(6 more...)

Genre:

Overview (0.67)
Research Report (0.66)

Industry: Education > Educational Setting > Higher Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.68)

Add feedback

Cardinality Estimation on Hyper-relational Knowledge Graphs

Teng, Fei, Li, Haoyang, Di, Shimin, Chen, Lei

arXiv.org Artificial IntelligenceMay-24-2024

Cardinality Estimation (CE) for query is to estimate the number of results without execution, which is an effective index in query optimization. Recently, CE over has achieved great success in knowledge graphs (KGs) that consist of triple facts. To more precisely represent facts, current researchers propose hyper-relational KGs (HKGs) to represent a triple fact with qualifiers, where qualifiers provide additional context to the fact. However, existing CE methods over KGs achieve unsatisfying performance on HKGs due to the complexity of qualifiers in HKGs. Also, there is only one dataset for HKG query cardinality estimation, i.e., WD50K-QE, which is not comprehensive and only covers limited patterns. The lack of querysets over HKG also becomes a bottleneck to comprehensively investigate CE problems on HKGs. In this work, we first construct diverse and unbiased hyper-relational querysets over three popular HKGs for investigating CE. Besides, we also propose a novel qualifier-attached graph neural network (GNN) model that effectively incorporates qualifier information and adaptively combines outputs from multiple GNN layers, to accurately predict the cardinality. Our experiments illustrate that the proposed hyper-relational query encoder outperforms all state-of-the-art CE methods over three popular HKGs on the diverse and unbiased benchmark.

proceedings, qualifier, query, (12 more...)

arXiv.org Artificial Intelligence

2405.15231

Country:

North America > United States > California > San Francisco County > San Francisco (0.28)
North America > United States > New York > New York County > New York City (0.15)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
(16 more...)

Genre: Research Report (0.41)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.88)

Add feedback

Amortized nonmyopic active search via deep imitation learning

Nguyen, Quan, Sarkar, Anindya, Garnett, Roman

arXiv.org Artificial IntelligenceMay-23-2024

Active search formalizes a specialized active learning setting where the goal is to collect members of a rare, valuable class. The state-of-the-art algorithm approximates the optimal Bayesian policy in a budget-aware manner, and has been shown to achieve impressive empirical performance in previous work. However, even this approximate policy has a superlinear computational complexity with respect to the size of the search problem, rendering its application impractical in large spaces or in real-time systems where decisions must be made quickly. We study the amortization of this policy by training a neural network to learn to search. To circumvent the difficulty of learning from scratch, we appeal to imitation learning techniques to mimic the behavior of the expert, expensive-to-compute policy. Our policy network, trained on synthetic data, learns a beneficial search strategy that yields nonmyopic decisions carefully balancing exploration and exploitation. Extensive experiments demonstrate our policy achieves competitive performance at real-world tasks that closely approximates the expert's at a fraction of the cost, while outperforming cheaper baselines.

learning, policy network, query, (15 more...)

arXiv.org Artificial Intelligence

2405.15031

Country:

North America > Haiti (0.14)
Africa > Côte d'Ivoire (0.04)
Asia > Philippines (0.04)
Africa > Malawi (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.89)
(2 more...)

Add feedback

Retrieval-Augmented Mining of Temporal Logic Specifications from Data

Saveri, Gaia, Bortolussi, Luca

arXiv.org Artificial IntelligenceMay-23-2024

The integration of cyber-physical systems (CPS) into everyday life raises the critical necessity of ensuring their safety and reliability. An important step in this direction is requirement mining, i.e. inferring formally specified system properties from observed behaviors, in order to discover knowledge about the system. Signal Temporal Logic (STL) offers a concise yet expressive language for specifying requirements, particularly suited for CPS, where behaviors are typically represented as time series data. This work addresses the task of learning STL requirements from observed behaviors in a data-driven manner, focusing on binary classification, i.e. on inferring properties of the system which are able to discriminate between regular and anomalous behaviour, and that can be used both as classifiers and as monitors of the compliance of the CPS to desirable specifications. We present a novel framework that combines Bayesian Optimization (BO) and Information Retrieval (IR) techniques to simultaneously learn both the structure and the parameters of STL formulae, without restrictions on the STL grammar. Specifically, we propose a framework that leverages a dense vector database containing semantic-preserving continuous representations of millions of formulae, queried for facilitating the mining of requirements inside a BO loop. We demonstrate the effectiveness of our approach in several signal classification applications, showing its ability to extract interpretable insights from system executions and advance the state-of-the-art in requirement mining for CPS.

formulae, specification, trajectory, (12 more...)

arXiv.org Artificial Intelligence

2405.14355

Country:

Europe > Switzerland (0.04)
Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
Europe > Italy > Friuli Venezia Giulia > Trieste Province > Trieste (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback