Information Retrieval
Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language
Mahboub, Ali, Za'ter, Muhy Eddin, Al-Rfooh, Bashar, Estaitia, Yazan, Jaljuli, Adnan, Hakouz, Asma
The abundance of information has driven the development of semantic search technologies that surpass traditional keyword-based search engines by understanding the context and intent of user queries through natural language processing (NLP) and machine learning [1]. Unlike conventional search methods that focus on matching keywords, semantic search interprets the meaning and relationships between words, aiming to mimic human understanding. This advancement enhances user experience across various applications, including web search engines, knowledge discovery, and personalized content recommendation systems, and most recently Retriever-Augmented Generation (RAG) [2]. RAG represents an innovative approach at the crossroads of information retrieval and natural language generation, leveraging the strengths of both fields to refine Artificial Intelligence (AI) based systems ability to comprehend and generate human-like text [3, 4]. By combining a sophisticated retrieval mechanism with a powerful generation model, RAG systems can produce detailed, contextually relevant responses that significantly improve standalone language models limitations in terms of precision and human-like generation.
A Declarative System for Optimizing AI Workloads
Liu, Chunwei, Russo, Matthew, Cafarella, Michael, Cao, Lei, Chen, Peter Baille, Chen, Zui, Franklin, Michael, Kraska, Tim, Madden, Samuel, Vitagliano, Gerardo
A long-standing goal of data management systems has been to build systems which can compute quantitative insights over large corpora of unstructured data in a cost-effective manner. Until recently, it was difficult and expensive to extract facts from company documents, data from scientific papers, or metrics from image and video corpora. Today's models can accomplish these tasks with high accuracy. However, a programmer who wants to answer a substantive AI-powered query must orchestrate large numbers of models, prompts, and data operations. For even a single query, the programmer has to make a vast number of decisions such as the choice of model, the right inference method, the most cost-effective inference hardware, the ideal prompt design, and so on. The optimal set of decisions can change as the query changes and as the rapidly-evolving technical landscape shifts. In this paper we present Palimpzest, a system that enables anyone to process AI-powered analytical queries simply by defining them in a declarative language. The system uses its cost optimization framework to implement the query plan with the best trade-offs between runtime, financial cost, and output data quality. We describe the workload of AI-powered analytics tasks, the optimization methods that Palimpzest uses, and the prototype system itself. We evaluate Palimpzest on tasks in Legal Discovery, Real Estate Search, and Medical Schema Matching. We show that even our simple prototype offers a range of appealing plans, including one that is 3.3x faster and 2.9x cheaper than the baseline method, while also offering better data quality. With parallelism enabled, Palimpzest can produce plans with up to a 90.3x speedup at 9.1x lower cost relative to a single-threaded GPT-4 baseline, while obtaining an F1-score within 83.5% of the baseline. These require no additional work by the user.
Semantic In-Domain Product Identification for Search Queries
Sharma, Sanat, Kumar, Jayant, Naik, Twisha, Lu, Zhaoyu, Srikantan, Arvind, King, Tracy Holloway
Accurate explicit and implicit product identification in search queries is critical for enhancing user experiences, especially at a company like Adobe which has over 50 products and covers queries across hundreds of tools. In this work, we present a novel approach to training a product classifier from user behavioral data. Our semantic model led to >25% relative improvement in CTR (click through rate) across the deployed surfaces; a >50% decrease in null rate; a 2x increase in the app cards surfaced, which helps drive product visibility.
Navigable Graphs for High-Dimensional Nearest Neighbor Search: Constructions and Limits
Diwan, Haya, Gou, Jinrui, Musco, Cameron, Musco, Christopher, Suel, Torsten
There has been significant recent interest in graph-based nearest neighbor search methods, many of which are centered on the construction of navigable graphs over high-dimensional point sets. A graph is navigable if we can successfully move from any starting node to any target node using a greedy routing strategy where we always move to the neighbor that is closest to the destination according to a given distance function. The complete graph is navigable for any point set, but the important question for applications is if sparser graphs can be constructed. While this question is fairly well understood in low-dimensions, we establish some of the first upper and lower bounds for high-dimensional point sets. First, we give a simple and efficient way to construct a navigable graph with average degree $O(\sqrt{n \log n })$ for any set of $n$ points, in any dimension, for any distance function. We compliment this result with a nearly matching lower bound: even under the Euclidean metric in $O(\log n)$ dimensions, a random point set has no navigable graph with average degree $O(n^{\alpha})$ for any $\alpha < 1/2$. Our lower bound relies on sharp anti-concentration bounds for binomial random variables, which we use to show that the near-neighborhoods of a set of random points do not overlap significantly, forcing any navigable graph to have many edges.
KSW: Khmer Stop Word based Dictionary for Keyword Extraction
Thuon, Nimol, Zhang, Wangrui, Thuon, Sada
This paper introduces KSW, a Khmer-specific approach to keyword extraction that leverages a specialized stop word dictionary. Due to the limited availability of natural language processing resources for the Khmer language, effective keyword extraction has been a significant challenge. KSW addresses this by developing a tailored stop word dictionary and implementing a preprocessing methodology to remove stop words, thereby enhancing the extraction of meaningful keywords. Our experiments demonstrate that KSW achieves substantial improvements in accuracy and relevance compared to previous methods, highlighting its potential to advance Khmer text processing and information retrieval. The KSW resources, including the stop word dictionary, are available at the following GitHub repository: (https://github.com/back-kh/KSWv2-Khmer-Stop-Word-based-Dictionary-for-Keyword-Extraction.git).
XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser
Cheng, Xianfu, Zhang, Hang, Yang, Jian, Li, Xiang, Zhou, Weixiao, Wu, Kui, Liu, Fei, Zhang, Wei, Sun, Tao, Li, Tongliang, Li, Zhoujun
In the domain of document AI, semi-structured form parsing plays a crucial role. This task leverages techniques from key information extraction (KIE), dealing with inputs that range from plain text to intricate modal data comprising images and structural layouts. The advent of pre-trained multimodal models has driven the extraction of key information from form documents in different formats such as PDFs and images. Nonetheless, the endeavor of form parsing is still encumbered by notable challenges like subpar capabilities in multi-lingual parsing and diminished recall in contexts rich in text and visuals. In this work, we introduce a simple but effective \textbf{M}ultimodal and \textbf{M}ultilingual semi-structured \textbf{FORM} \textbf{PARSER} (\textbf{XFormParser}), which is anchored on a comprehensive pre-trained language model and innovatively amalgamates semantic entity recognition (SER) and relation extraction (RE) into a unified framework, enhanced by a novel staged warm-up training approach that employs soft labels to significantly refine form parsing accuracy without amplifying inference overhead. Furthermore, we have developed a groundbreaking benchmark dataset, named InDFormBench, catering specifically to the parsing requirements of multilingual forms in various industrial contexts. Through rigorous testing on established multilingual benchmarks and InDFormBench, XFormParser has demonstrated its unparalleled efficacy, notably surpassing the state-of-the-art (SOTA) models in RE tasks within language-specific setups by achieving an F1 score improvement of up to 1.79\%. Our framework exhibits exceptionally improved performance across tasks in both multi-language and zero-shot contexts when compared to existing SOTA benchmarks. The code is publicly available at https://github.com/zhbuaa0/layoutlmft.
Hybrid Context Retrieval Augmented Generation Pipeline: LLM-Augmented Knowledge Graphs and Vector Database for Accreditation Reporting Assistance
In higher education, accreditation is a quality assurance process, where an institution demonstrates a commitment to delivering high quality programs and services to their students. For business schools nationally and internationally the Association to Advance Collegiate Schools of Business (AACSB) accreditation is the gold standard. For a business school to receive and subsequently maintain accreditation, the school must undertake a rigorous, time consuming reporting and peer review process, to demonstrate alignment with the AACSB Standards. For this project we create a hybrid context retrieval augmented generation pipeline that can assist in the documentation alignment and reporting process necessary for accreditation. We implement both a vector database and knowledge graph, as knowledge stores containing both institutional data and AACSB Standard data. The output of the pipeline can be used by institution stakeholders to build their accreditation report, dually grounded by the context from the knowledge stores. To develop our knowledge graphs we utilized both a manual construction process as well as an LLM Augmented Knowledge Graph approach. We evaluated the pipeline using the RAGAs framework and observed optimal performance on answer relevancy and answer correctness metrics.
Cardinality Estimation on Hyper-relational Knowledge Graphs
Teng, Fei, Li, Haoyang, Di, Shimin, Chen, Lei
Cardinality Estimation (CE) for query is to estimate the number of results without execution, which is an effective index in query optimization. Recently, CE over has achieved great success in knowledge graphs (KGs) that consist of triple facts. To more precisely represent facts, current researchers propose hyper-relational KGs (HKGs) to represent a triple fact with qualifiers, where qualifiers provide additional context to the fact. However, existing CE methods over KGs achieve unsatisfying performance on HKGs due to the complexity of qualifiers in HKGs. Also, there is only one dataset for HKG query cardinality estimation, i.e., WD50K-QE, which is not comprehensive and only covers limited patterns. The lack of querysets over HKG also becomes a bottleneck to comprehensively investigate CE problems on HKGs. In this work, we first construct diverse and unbiased hyper-relational querysets over three popular HKGs for investigating CE. Besides, we also propose a novel qualifier-attached graph neural network (GNN) model that effectively incorporates qualifier information and adaptively combines outputs from multiple GNN layers, to accurately predict the cardinality. Our experiments illustrate that the proposed hyper-relational query encoder outperforms all state-of-the-art CE methods over three popular HKGs on the diverse and unbiased benchmark.
Amortized nonmyopic active search via deep imitation learning
Nguyen, Quan, Sarkar, Anindya, Garnett, Roman
Active search formalizes a specialized active learning setting where the goal is to collect members of a rare, valuable class. The state-of-the-art algorithm approximates the optimal Bayesian policy in a budget-aware manner, and has been shown to achieve impressive empirical performance in previous work. However, even this approximate policy has a superlinear computational complexity with respect to the size of the search problem, rendering its application impractical in large spaces or in real-time systems where decisions must be made quickly. We study the amortization of this policy by training a neural network to learn to search. To circumvent the difficulty of learning from scratch, we appeal to imitation learning techniques to mimic the behavior of the expert, expensive-to-compute policy. Our policy network, trained on synthetic data, learns a beneficial search strategy that yields nonmyopic decisions carefully balancing exploration and exploitation. Extensive experiments demonstrate our policy achieves competitive performance at real-world tasks that closely approximates the expert's at a fraction of the cost, while outperforming cheaper baselines.
Retrieval-Augmented Mining of Temporal Logic Specifications from Data
Saveri, Gaia, Bortolussi, Luca
The integration of cyber-physical systems (CPS) into everyday life raises the critical necessity of ensuring their safety and reliability. An important step in this direction is requirement mining, i.e. inferring formally specified system properties from observed behaviors, in order to discover knowledge about the system. Signal Temporal Logic (STL) offers a concise yet expressive language for specifying requirements, particularly suited for CPS, where behaviors are typically represented as time series data. This work addresses the task of learning STL requirements from observed behaviors in a data-driven manner, focusing on binary classification, i.e. on inferring properties of the system which are able to discriminate between regular and anomalous behaviour, and that can be used both as classifiers and as monitors of the compliance of the CPS to desirable specifications. We present a novel framework that combines Bayesian Optimization (BO) and Information Retrieval (IR) techniques to simultaneously learn both the structure and the parameters of STL formulae, without restrictions on the STL grammar. Specifically, we propose a framework that leverages a dense vector database containing semantic-preserving continuous representations of millions of formulae, queried for facilitating the mining of requirements inside a BO loop. We demonstrate the effectiveness of our approach in several signal classification applications, showing its ability to extract interpretable insights from system executions and advance the state-of-the-art in requirement mining for CPS.