AITopics | Query Processing

Collaborating Authors

Query Processing

News Overviews Instructional Materials AI-Alerts Classics

Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation

Kang, SeongKu, Jin, Bowen, Kweon, Wonbin, Zhang, Yu, Lee, Dongha, Han, Jiawei, Yu, Hwanjo

arXiv.org Artificial IntelligenceFeb-16-2025

In specialized fields like the scientific domain, constructing large-scale human-annotated datasets poses a significant challenge due to the need for domain expertise. Recent methods have employed large language models to generate synthetic queries, which serve as proxies for actual user queries. However, they lack control over the content generated, often resulting in incomplete coverage of academic concepts in documents. We introduce Concept Coverage-based Query set Generation (CCQGen) framework, designed to generate a set of queries with comprehensive coverage of the document's concepts. A key distinction of CCQGen is that it adaptively adjusts the generation process based on the previously generated queries. We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation. This approach guides each new query to complement the previous ones, aiding in a thorough understanding of the document. Extensive experiments demonstrate that CCQGen significantly enhances query quality and retrieval performance.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.11181

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.46)

Add feedback

SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL

Lee, Jimin, Baek, Ingeol, Kim, Byeongjeong, Lee, Hwanhee

arXiv.org Artificial IntelligenceFeb-16-2025

Text-to-SQL aims to convert natural language questions into executable SQL queries. While previous approaches, such as skeleton-masked selection, have demonstrated strong performance by retrieving similar training examples to guide large language models (LLMs), they struggle in real-world scenarios where such examples are unavailable. To overcome this limitation, we propose Self-Augmentation in-context learning with Fine-grained Example selection for Text-to-SQL (SAFE-SQL), a novel framework that improves SQL generation by generating and filtering self-augmented examples. SAFE-SQL first prompts an LLM to generate multiple Text-to-SQL examples relevant to the test input. Then SAFE-SQL filters these examples through three relevance assessments, constructing high-quality in-context learning examples. Using self-generated examples, SAFE-SQL surpasses the previous zero-shot, and few-shot Text-to-SQL frameworks, achieving higher execution accuracy. Notably, our approach provides additional performance gains in extra hard and unseen scenarios, where conventional methods often fail.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.11438

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.46)

Add feedback

Jinyang Li

Neural Information Processing SystemsFeb-11-2025, 01:50:51 GMT

Text-to-SQL parsing, which aims at converting natural language questions into executable SQLs, has gained increasing attention in recent years. In particular, GPT-4 and Claude-2 have shown impressive results in this task. However, most of the prevalent benchmarks, i.e., Spider, and WikiSQL, focus on database schema with few rows of database values leaving the gap between academic study and real-world applications.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe (1.00)
Asia > China > Guangdong Province (0.28)
North America > United States > Illinois (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology (1.00)
Education (1.00)
Leisure & Entertainment > Sports (0.93)
(4 more...)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
(3 more...)

Add feedback

On the query complexity of sampling from non-log-concave distributions

He, Yuchen, Zhang, Chihao

arXiv.org Machine LearningFeb-11-2025

We study the problem of sampling from a $d$-dimensional distribution with density $p(x)\propto e^{-f(x)}$, which does not necessarily satisfy good isoperimetric conditions. Specifically, we show that for any $L,M$ satisfying $LM\ge d\ge 5$, $\epsilon\in \left(0,\frac{1}{32}\right)$, and any algorithm with query accesses to the value of $f(x)$ and $\nabla f(x)$, there exists an $L$-log-smooth distribution with second moment at most $M$ such that the algorithm requires $\left(\frac{LM}{d\epsilon}\right)^{\Omega(d)}$ queries to compute a sample whose distribution is within $\epsilon$ in total variation distance to the target distribution. We complement the lower bound with an algorithm requiring $\left(\frac{LM}{d\epsilon}\right)^{\mathcal O(d)}$ queries, thereby characterizing the tight (up to the constant in the exponent) query complexity for sampling from the family of non-log-concave distributions. Our results are in sharp contrast with the recent work of Huang et al. (COLT'24), where an algorithm with quasi-polynomial query complexity was proposed for sampling from a non-log-concave distribution when $M=\mathtt{poly}(d)$. Their algorithm works under the stronger condition that all distributions along the trajectory of the Ornstein-Uhlenbeck process, starting from the target distribution, are $\mathcal O(1)$-log-smooth. We investigate this condition and prove that it is strictly stronger than requiring the target distribution to be $\mathcal O(1)$-log-smooth. Additionally, we study this condition in the context of mixtures of Gaussians. Finally, we place our results within the broader theme of ``sampling versus optimization'', as studied in Ma et al. (PNAS'19). We show that for a wide range of parameters, sampling is strictly easier than optimization by a super-exponential factor in the dimension $d$.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

2502.062

Country: Asia (0.14)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.81)

Add feedback

Zeroth-Order Negative Curvature Finding: Escaping Saddle Points without Gradients Nanjing University of Information Science & Technology Harbin Institute of Technology

Neural Information Processing SystemsFeb-10-2025, 21:44:16 GMT

We consider escaping saddle points of nonconvex problems where only the function evaluations can be accessed. Although a variety of works have been proposed, the majority of them require either second or first-order information, and only a few of them have exploited zeroth-order methods, particularly the technique of negative curvature finding with zeroth-order methods which has been proven to be the most efficient method for escaping saddle points. To fill this gap, in this paper, we propose two zeroth-order negative curvature finding frameworks that can replace Hessian-vector product computations without increasing the iteration complexity. We apply the proposed frameworks to ZO-GD, ZO-SGD, ZO-SCSG, ZO-SPIDER and prove that these ZO algorithms can converge to (ϵ, δ)-approximate secondorder stationary points with less query complexity compared with prior zeroth-order works for finding local minima.

machine learning, natural language, saddle point, (17 more...)

Neural Information Processing Systems

Country: Asia > China > Heilongjiang Province > Harbin (0.40)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.31)

Add feedback

On Margin-Based Cluster Recovery with Oracle Queries Marco Bressan

Neural Information Processing SystemsFeb-10-2025, 21:23:26 GMT

We study an active cluster recovery problem where, given a set of n points and an oracle answering queries like "are these two points in the same cluster?", the task is to recover exactly all clusters using as few queries as possible. We begin by introducing a simple but general notion of margin between clusters that captures, as special cases, the margins used in previous works, the classic SVM margin, and standard notions of stability for center-based clusterings. Under our margin assumptions we design algorithms that, in a variety of settings, recover all clusters exactly using only O(log n) queries.

artificial intelligence, machine learning, natural language, (14 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.34)

Add feedback

The Query Complexity of Cake Cutting

Neural Information Processing SystemsFeb-10-2025, 21:14:22 GMT

We consider the query complexity of cake cutting in the standard query model and give lower and upper bounds for computing approximately envy-free, perfect, and equitable allocations with the minimum number of cuts. The lower bounds are tight for computing contiguous envy-free allocations among n = 3 players and for computing perfect and equitable allocations with minimum number of cuts between n = 2 players. For ϵ-envy-free allocations with contiguous pieces, we also give an upper bound of O(n/ϵ) and lower bound of Ω(log(1/ϵ)) queries for any number n 3 of players. We also formalize moving knife procedures and show that a large subclass of this family, which captures all the known moving knife procedures, can be simulated efficiently with arbitrarily small error in the Robertson-Webb query model.

allocation, artificial intelligence, natural language, (17 more...)

Neural Information Processing Systems

Country:

Europe (1.00)
North America > United States (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.61)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)

Add feedback

Optimal Query Complexities for Dynamic Trace Estimation

David P. Woodruff, Carnegie Mellon University, dwoodruf@cs.cmu.edu "3026 Fred Zhang, UC Berkeley, z0@berkeley.edu, "3026 Qiuyi (Richard) Zhang, Google Brain, qiuyiz@google.com

Neural Information Processing SystemsFeb-10-2025, 17:35:06 GMT

We consider the problem of minimizing the number of matrix-vector queries needed for accurate trace estimation in the dynamic setting where our underlying matrix is changing slowly, such as during an optimization process.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.52)

Add feedback

Optimal Query Complexities for Dynamic Trace Estimation

David P. Woodruff, Carnegie Mellon University, dwoodruf@cs.cmu.edu "3026 Fred Zhang, UC Berkeley, z0@berkeley.edu, "3026 Qiuyi (Richard) Zhang, Google Brain, qiuyiz@google.com

Neural Information Processing SystemsFeb-10-2025, 17:35:02 GMT

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.43)

Add feedback

Evaluating Entity Retrieval in Electronic Health Records: a Semantic Gap Perspective

Zhao, Zhengyun, Yuan, Hongyi, Liu, Jingjing, Chen, Haichao, Ying, Huaiyuan, Zhou, Songchi, Yu, Sheng

arXiv.org Artificial IntelligenceFeb-10-2025

Entity retrieval plays a crucial role in the utilization of Electronic Health Records (EHRs) and is applied across a wide range of clinical practices. However, a comprehensive evaluation of this task is lacking due to the absence of a public benchmark. In this paper, we propose the development and release of a novel benchmark for evaluating entity retrieval in EHRs, with a particular focus on the semantic gap issue. Using discharge summaries from the MIMIC-III dataset, we incorporate ICD codes and prescription labels associated with the notes as queries, and annotate relevance judgments using GPT-4. In total, we use 1,000 patient notes, generate 1,246 queries, and provide over 77,000 relevance annotations. To offer the first assessment of the semantic gap, we introduce a novel classification system for relevance matches. Leveraging GPT-4, we categorize each relevant pair into one of five categories: string, synonym, abbreviation, hyponym, and implication. Using the proposed benchmark, we evaluate several retrieval methods, including BM25, query expansion, and state-of-the-art dense retrievers. Our findings show that BM25 provides a strong baseline but struggles with semantic matches. Query expansion significantly improves performance, though it slightly reduces string match capabilities. Dense retrievers outperform traditional methods, particularly for semantic matches, and general-domain dense retrievers often surpass those trained specifically in the biomedical domain.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.06252

Country: