AITopics | semantic operator

Collaborating Authors

semantic operator

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM-Native Query Optimization

Zhu, Junhao, Chen, Lu, Ke, Xiangyu, Fang, Ziquan, Li, Tianyi, Gao, Yunjun, Jensen, Christian S.

arXiv.org Artificial IntelligenceNov-26-2025

Multi-modal analytical processing has the potential to transform applications in e-commerce, healthcare, entertainment, and beyond. However, real-world adoption remains elusive due to the limited ability of traditional relational query operators to capture query semantics. The emergence of foundation models, particularly the large language models (LLMs), opens up new opportunities to develop flexible, semantic-aware data analytics systems that transcend the relational paradigm. We present Nirvana, a multi-modal data analytics framework that incorporates programmable semantic operators while leveraging both logical and physical query optimization strategies, tailored for LLM-driven semantic query processing. Nirvana addresses two key challenges. First, it features an agentic logical optimizer that uses natural language-specified transformation rules and random-walk-based search to explore vast spaces of semantically equivalent query plans -- far beyond the capabilities of conventional optimizers. Second, it introduces a cost-aware physical optimizer that selects the most effective LLM backend for each operator using a novel improvement-score metric. To further enhance efficiency, Nirvana incorporates computation reuse and evaluation pushdown techniques guided by model capability hypotheses. Experimental evaluations on three real-world benchmarks demonstrate that Nirvana is able to reduce end-to-end runtime by 10%--85% and reduces system processing costs by 76% on average, outperforming state-of-the-art systems at both efficiency and scalability.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.1983

Genre:

Research Report (0.64)
Workflow (0.46)

Industry:

Leisure & Entertainment (1.00)
Information Technology (1.00)
Media > Film (0.95)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)

Add feedback

SemBench: A Benchmark for Semantic Query Processing Engines

Lao, Jiale, Zimmerer, Andreas, Ovcharenko, Olga, Cong, Tianji, Russo, Matthew, Vitagliano, Gerardo, Cochez, Michael, Özcan, Fatma, Gupta, Gautam, Hottelier, Thibaud, Jagadish, H. V., Kissel, Kris, Schelter, Sebastian, Kipf, Andreas, Trummer, Immanuel

arXiv.org Artificial IntelligenceNov-4-2025

We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data. Our benchmark introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to medical question-answering. Within these scenarios, we cover different data modalities, including images, audio, and text. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators. We evaluated our benchmark on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.

artificial intelligence, large language model, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.01716

Country: North America > United States > New York (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.89)
Health & Medicine > Therapeutic Area (0.68)
Media > Film (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)

Add feedback

Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics

Russo, Matthew, Kraska, Tim

arXiv.org Artificial IntelligenceSep-4-2025

With advances in large language models (LLMs), researchers are creating new systems that can perform AI-driven analytics over large unstructured datasets. Recent work has explored executing such analytics queries using semantic operators -- a declarative set of AI-powered data transformations with natural language specifications. However, even when optimized, these operators can be expensive to execute on millions of records and their iterator execution semantics make them ill-suited for interactive data analytics tasks. In another line of work, Deep Research systems have demonstrated an ability to answer natural language question(s) over large datasets. These systems use one or more LLM agent(s) to plan their execution, process the dataset(s), and iteratively refine their answer. However, these systems do not explicitly optimize their query plans which can lead to poor plan execution. In order for AI-driven analytics to excel, we need a runtime which combines the optimized execution of semantic operators with the flexibility and more dynamic execution of Deep Research systems. As a first step towards this vision, we build a prototype which enables Deep Research agents to write and execute optimized semantic operator programs. We evaluate our prototype and demonstrate that it can outperform a handcrafted semantic operator program and open Deep Research systems on two basic queries. Compared to a standard open Deep Research agent, our prototype achieves up to 1.95x better F1-score. Furthermore, even if we give the agent access to semantic operators as tools, our prototype still achieves cost and runtime savings of up to 76.8% and 72.7% thanks to its optimized execution.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.02751

Country:

North America > United States > California (0.28)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (0.41)

Industry: Information Technology (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.90)

Add feedback

SABER: A SQL-Compatible Semantic Document Processing System Based on Extended Relational Algebra

Lee, Changjae, Zhao, Zhuoyue, Xiong, Jinjun

arXiv.org Artificial IntelligenceSep-3-2025

The emergence of large-language models (LLMs) has enabled a new class of semantic data processing systems (SDPSs) to support declarative queries against unstructured documents. Existing SDPSs are, however, lacking a unified algebraic foundation, making their queries difficult to compose, reason, and optimize. We propose a new semantic algebra, SABER (Semantic Algebra Based on Extended Relational algebra), opening the possibility of semantic operations' logical plan construction, optimization, and formal correctness guarantees. We further propose to implement SABER in a SQL-compatible syntax so that it natively supports mixed structured/unstructured data processing. With SABER, we showcase the feasibility of providing a unified interface for existing SDPSs so that it can effectively mix and match any semantically-compatible operator implementation from any SDPS, greatly enhancing SABER's applicability for community contributions.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.00277

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.93)
Government (0.68)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.59)

Add feedback

AgenticData: An Agentic Data Analytics System for Heterogeneous Data

Sun, Ji, Li, Guoliang, Zhou, Peiyao, Ma, Yihui, Xu, Jingzhe, Li, Yuan

arXiv.org Artificial IntelligenceAug-8-2025

Existing unstructured data analytics systems rely on experts to write code and manage complex analysis workflows, making them both expensive and time-consuming. To address these challenges, we introduce AgenticData, an innovative agentic data analytics system that allows users to simply pose natural language (NL) questions while autonomously analyzing data sources across multiple domains, including both unstructured and structured data. First, AgenticData employs a feedback-driven planning technique that automatically converts an NL query into a semantic plan composed of relational and semantic operators. We propose a multi-agent collaboration strategy by utilizing a data profiling agent for discovering relevant data, a semantic cross-validation agent for iterative optimization based on feedback, and a smart memory agent for maintaining short-term context and long-term knowledge. Second, we propose a semantic optimization model to refine and execute semantic plans effectively. Our system, AgenticData, has been tested using three benchmarks. Experimental results showed that AgenticData achieved superior accuracy on both easy and difficult tasks, significantly outperforming state-of-the-art methods.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.05002

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(3 more...)

Add feedback

Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems

Sun, Zhaoyan, Wang, Jiayi, Zhao, Xinyang, Wang, Jiachi, Li, Guoliang

arXiv.org Artificial IntelligenceJul-3-2025

Traditional Data+AI systems utilize data-driven techniques to optimize performance, but they rely heavily on human experts to orchestrate system pipelines, enabling them to adapt to changes in data, queries, tasks, and environments. For instance, while there are numerous data science tools available, developing a pipeline planning system to coordinate these tools remains challenging. This difficulty arises because existing Data+AI systems have limited capabilities in semantic understanding, reasoning, and planning. Fortunately, we have witnessed the success of large language models (LLMs) in enhancing semantic understanding, reasoning, and planning abilities. It is crucial to incorporate LLM techniques to revolutionize data systems for orchestrating Data+AI applications effectively. To achieve this, we propose the concept of a 'Data Agent' - a comprehensive architecture designed to orchestrate Data+AI ecosystems, which focuses on tackling data-related tasks by integrating knowledge comprehension, reasoning, and planning capabilities. We delve into the challenges involved in designing data agents, such as understanding data/queries/environments/tools, orchestrating pipelines/workflows, optimizing and executing pipelines, and fostering pipeline self-reflection. Furthermore, we present examples of data agent systems, including a data science agent, data analytics agents (such as unstructured data analytics agent, semantic structured data analytics agent, data lake analytics agent, and multi-modal data analytics agent), and a database administrator (DBA) agent. We also outline several open challenges associated with designing data agent systems.

data mining, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.01599

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.87)

Add feedback

Abacus: A Cost-Based Optimizer for Semantic Operator Systems

Russo, Matthew, Sudhir, Sivaprasad, Vitagliano, Gerardo, Liu, Chunwei, Kraska, Tim, Madden, Samuel, Cafarella, Michael

arXiv.org Artificial IntelligenceJun-18-2025

LLMs enable an exciting new class of data processing applications over large collections of unstructured documents. Several new programming frameworks have enabled developers to build these applications by composing them out of semantic operators: a declarative set of AI-powered data transformations with natural language specifications. These include LLM-powered maps, filters, joins, etc. used for document processing tasks such as information extraction, summarization, and more. While systems of semantic operators have achieved strong performance on benchmarks, they can be difficult to optimize. An optimizer for this setting must determine how to physically implement each semantic operator in a way that optimizes the system globally. Existing optimizers are limited in the number of optimizations they can apply, and most (if not all) cannot optimize system quality, cost, or latency subject to constraint(s) on the other dimensions. In this paper we present Abacus, an extensible, cost-based optimizer which searches for the best implementation of a semantic operator system given a (possibly constrained) optimization objective. Abacus estimates operator performance by leveraging a minimal set of validation examples and, if available, prior beliefs about operator performance. We evaluate Abacus on document processing workloads in the biomedical and legal domains (BioDEX; CUAD) and multi-modal question answering (MMQA). We demonstrate that systems optimized by Abacus achieve 18.7%-39.2% better quality and up to 23.6x lower cost and 4.2x lower latency than the next best system.

data mining, large language model, machine learning, (25 more...)

arXiv.org Artificial Intelligence

2505.14661

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre:

Research Report (1.00)
Workflow (0.93)

Industry: Government > Military (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Semantic Integrity Constraints: Declarative Guardrails for AI-Augmented Data Processing Systems

Lee, Alexander W., Chan, Justin, Fu, Michael, Kim, Nicolas, Mehta, Akshay, Raghavan, Deepti, Cetintemel, Ugur

arXiv.org Artificial IntelligenceMar-1-2025

The emergence of AI-augmented Data Processing Systems (DPSs) has introduced powerful semantic operators that extend traditional data management capabilities with LLM-based processing. However, these systems face fundamental reliability (a.k.a. trust) challenges, as LLMs can generate erroneous outputs, limiting their adoption in critical domains. Existing approaches to LLM constraints--ranging from user-defined functions to constrained decoding--are fragmented, imperative, and lack semantics-aware integration into query execution. To address this gap, we introduce Semantic Integrity Constraints (SICs), a novel declarative abstraction that extends traditional database integrity constraints to govern and optimize semantic operators within DPSs. SICs integrate seamlessly into the relational model, allowing users to specify common classes of constraints (e.g., grounding and soundness) while enabling query-aware enforcement and optimization strategies. In this paper, we present the core design of SICs, describe their formal integration into query execution, and detail our conception of grounding constraints, a key SIC class that ensures factual consistency of generated outputs. In addition, we explore novel enforcement mechanisms, combining proactive (constrained decoding) and reactive (validation and recovery) techniques to optimize efficiency and reliability. Our work establishes SICs as a foundational framework for trustworthy, high-performance AI-augmented data processing, paving the way for future research in constraint-driven optimizations, adaptive enforcement, and enterprise-scale deployments.

constraint, operator, semantic operator, (14 more...)

arXiv.org Artificial Intelligence

2503.006

Country:

North America > United States > New York > New York County > New York City (0.05)
North America > United States > Rhode Island (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(10 more...)

Genre: Research Report (0.65)

Industry:

Health & Medicine (1.00)
Information Technology > Software (0.81)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.54)

Add feedback

AnDB: Breaking Boundaries with an AI-Native Database for Universal Semantic Analysis

Wang, Tianqing, Xue, Xun, Li, Guoliang, Wang, Yong

arXiv.org Artificial IntelligenceFeb-19-2025

In this demonstration, we present AnDB, an AI-native database that supports traditional OLTP workloads and innovative AI-driven tasks, enabling unified semantic analysis across structured and unstructured data. While structured data analytics is mature, challenges remain in bridging the semantic gap between user queries and unstructured data. AnDB addresses these issues by leveraging cutting-edge AI-native technologies, allowing users to perform semantic queries using intuitive SQL-like statements without requiring AI expertise. This approach eliminates the ambiguity of traditional text-to-SQL systems and provides a seamless end-to-end optimization for analyzing all data types. AnDB automates query processing by generating multiple execution plans and selecting the optimal one through its optimizer, which balances accuracy, execution time, and financial cost based on user policies and internal optimizing mechanisms. AnDB future-proofs data management infrastructure, empowering users to effectively and efficiently harness the full potential of all kinds of data without starting from scratch.

operator, unstructured data, vector similarity, (14 more...)

arXiv.org Artificial Intelligence

2502.13805

Country:

North America > Canada > Ontario > Toronto (0.05)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre:

Workflow (0.71)
Research Report (0.66)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Databases (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.71)

Add feedback

Relational Generalized Few-Shot Learning

Shi, Xiahan, Salewski, Leonard, Schiegg, Martin, Akata, Zeynep, Welling, Max

arXiv.org Machine LearningJul-22-2019

Transferring learned models to novel tasks is a challenging problem, particularly if only very few labeled examples are available. Although this few-shot learning setup has received a lot of attention recently, most proposed methods focus on discriminating novel classes only. Instead, we consider the extended setup of generalized few-shot learning (GFSL), where the model is required to perform classification on the joint label space consisting of both previously seen and novel classes. We propose a graph-based framework that explicitly models relationships between all seen and novel classes in the joint label space. Our model Graph-convolutional Global Prototypical Networks (GcGPN) incorporates these inter-class relations using graph-convolution in order to embed novel class representations into the existing space of previously seen classes in a globally consistent manner. Our approach ensures both fast adaptation and global discrimination, which is the major challenge in GFSL. We demonstrate the benefits of our model on two challenging benchmark datasets.

artificial intelligence, label space, machine learning, (17 more...)

arXiv.org Machine Learning

1907.09557

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback