data complexity
Data Complexity of Querying Description Logic Knowledge Bases under Cost-Based Semantics
Bienvenu, Meghyn, Manière, Quentin
In this paper, we study the data complexity of querying inconsistent weighted description logic (DL) knowledge bases under recently-introduced cost-based semantics. In a nutshell, the idea is to assign each interpretation a cost based upon the weights of the violated axioms and assertions, and certain and possible query answers are determined by considering all (resp. some) interpretations having optimal or bounded cost. Whereas the initial study of cost-based semantics focused on DLs between $\mathcal{EL}_\bot$ and $\mathcal{ALCO}$, we consider DLs that may contain inverse roles and role inclusions, thus covering prominent DL-Lite dialects. Our data complexity analysis goes significantly beyond existing results by sharpening several lower bounds and pinpointing the precise complexity of optimal-cost certain answer semantics (no non-trivial upper bound was known). Moreover, while all existing results show the intractability of cost-based semantics, our most challenging and surprising result establishes that if we consider $\text{DL-Lite}^\mathcal{H}_\mathsf{bool}$ ontologies and a fixed cost bound, certain answers for instance queries and possible answers for conjunctive queries can be computed using first-order rewriting and thus enjoy the lowest possible data complexity ($\mathsf{TC}_0$).
- Europe > Germany > Saxony > Leipzig (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Europe > France > Nouvelle-Aquitaine > Gironde > Bordeaux (0.04)
Curriculum-Guided Layer Scaling for Language Model Pretraining
Singh, Karanpartap, Band, Neil, Adeli, Ehsan
As the cost of pretraining large language models grows, there is continued interest in strategies to improve learning efficiency during this core training stage. Motivated by cognitive development, where humans gradually build knowledge as their brains mature, we propose Curriculum-Guided Layer Scaling (CGLS), a framework for compute-efficient pretraining that synchronizes increasing data difficulty with model growth through progressive layer stacking (i.e., gradually adding layers during training). At the 100M parameter scale, using a curriculum transitioning from synthetic short stories to general web data, CGLS outperforms baseline methods on the question-answering benchmarks PIQA and ARC. Our results show that progressively increasing model depth alongside sample difficulty leads to better generalization and zero-shot performance on various downstream benchmarks. Altogether, our findings demonstrate that CGLS unlocks the potential of progressive stacking, offering a simple yet effective strategy for improving generalization on knowledge-intensive and reasoning tasks. Large language models (LLMs) are typically pretrained in a single, continuous pass, processing all tokens with a uniform amount of computation regardless of their complexity or relevance to downstream tasks of interest. While this approach has shown remarkable success in large-scale models like GPT -4 (OpenAI et al., 2023) and Llama 3 (Dubey et al., 2024), it differs significantly from how humans learn, often leading to models that excel in generating coherent text but struggle with long-context reasoning across varied tasks (Schnabel et al., 2025). Recent works like Phi-3 (Abdin et al., 2024), MiniCPM (Hu et al., 2024), and others (Feng et al., 2024) have explored midtraining, adjusting the training data distribution partway through training by incorporating higher-quality, multilingual, or long-form text. However, this coarse-grained curriculum is applied on fixed model architectures. Inspired by how humans progressively build knowledge alongside their physically growing brains, we explore whether gradually scaling a model in tandem with increasingly complex data can enable more efficient and effective learning.
- North America > United States (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
- Education (1.00)
- Health & Medicine (0.68)
Knowledge Distillation for Variational Quantum Convolutional Neural Networks on Heterogeneous Data
Yu, Kai, Cai, Binbin, Lin, Song
Distributed quantum machine learning faces significant challenges due to heterogeneous client data and variations in local model structures, which hinder global model aggregation. To address these challenges, we propose a knowledge distillation framework for variational quantum convolutional neural networks on heterogeneous data. The framework features a quantum gate number estimation mechanism based on client data, which guides the construction of resource-adaptive VQCNN circuits. Particle swarm optimization is employed to efficiently generate personalized quantum models tailored to local data characteristics. During aggregation, a knowledge distillation strategy integrating both soft-label and hard-label supervision consolidates knowledge from heterogeneous clients using a public dataset, forming a global model while avoiding parameter exposure and privacy leakage. Theoretical analysis shows that proposed framework benefits from quantum high-dimensional representation, offering advantages over classical approaches, and minimizes communication by exchanging only model indices and test outputs. Extensive simulations on the PennyLane platform validate the effectiveness of the gate number estimation and distillation-based aggregation. Experimental results demonstrate that the aggregated global model achieves accuracy close to fully supervised centralized training. These results shown that proposed methods can effectively handle heterogeneity, reduce resource consumption, and maintain performance, highlighting its potential for scalable and privacy-preserving distributed quantum learning.
- Asia > China > Fujian Province > Fuzhou (0.04)
- Asia > China > Beijing > Beijing (0.04)
- South America > Colombia > Huila Department > Neiva (0.04)
- Europe > Switzerland (0.04)
Analysing Temporal Reasoning in Description Logics Using Formal Grammars
Bourgaux, Camille, Gnatenko, Anton, Thomazo, Michaël
We establish a correspondence between (fragments of) $\mathcal{TEL}^\bigcirc$, a temporal extension of the $\mathcal{EL}$ description logic with the LTL operator $\bigcirc^k$, and some specific kinds of formal grammars, in particular, conjunctive grammars (context-free grammars equipped with the operation of intersection). This connection implies that $\mathcal{TEL}^\bigcirc$ does not possess the property of ultimate periodicity of models, and further leads to undecidability of query answering in $\mathcal{TEL}^\bigcirc$, closing a question left open since the introduction of $\mathcal{TEL}^\bigcirc$. Moreover, it also allows to establish decidability of query answering for some new interesting fragments of $\mathcal{TEL}^\bigcirc$, and to reuse for this purpose existing tools and algorithms for conjunctive grammars.
- North America > United States (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
Tractable Responsibility Measures for Ontology-Mediated Query Answering
Bienvenu, Meghyn, Figueira, Diego, Lafourcade, Pierre
Recent work on quantitative approaches to explaining query answers employs responsibility measures to assign scores to facts in order to quantify their respective contributions to obtaining a given answer. In this paper, we study the complexity of computing such responsibility scores in the setting of ontology-mediated query answering, focusing on a very recently introduced family of Shapley-value-based responsibility measures defined in terms of weighted sums of minimal supports (WSMS). By exploiting results from the database setting, we can show that such measures enjoy polynomial data complexity for classes of ontology-mediated queries that are first-order-rewritable, whereas the problem becomes "shP"-hard when the ontology language can encode reachability queries (via axioms like $\exists R. A \sqsubseteq A$). To better understand the tractability frontier, we next explore the combined complexity of WSMS computation. We prove that intractability applies already to atomic queries if the ontology language supports conjunction, as well as to unions of `well-behaved' conjunctive queries, even in the absence of an ontology. By contrast, our study yields positive results for common DL-Lite dialects: by means of careful analysis, we identify classes of structurally restricted conjunctive queries (which intuitively disallow undesirable interactions between query atoms) that admit tractable WSMS computation.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France (0.04)
CQE under Epistemic Dependencies: Algorithms and Experiments (extended version)
Marconi, Lorenzo, Ricci, Flavia, Rosati, Riccardo
We investigate Controlled Query Evaluation (CQE) over ontologies, where information disclosure is regulated by epistemic dependencies (EDs), a family of logical rules recently proposed for the CQE framework. In particular, we combine EDs with the notion of optimal GA censors, i.e. maximal sets of ground atoms that are entailed by the ontology and can be safely revealed. We focus on answering Boolean unions of conjunctive queries (BUCQs) with respect to the intersection of all optimal GA censors - an approach that has been shown in other contexts to ensure strong security guarantees with favorable computational behavior. First, we characterize the security of this intersection-based approach and identify a class of EDs (namely, full EDs) for which it remains safe. Then, for a subclass of EDs and for DL-Lite_R ontologies, we show that answering BUCQs in the above CQE semantics is in AC^0 in data complexity by presenting a suitable, detailed first-order rewriting algorithm. Finally, we report on experiments conducted in two different evaluation scenarios, showing the practical feasibility of our rewriting function.
A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models
High-dimensional data commonly lies on low-dimensional submanifolds, and estimating the local intrinsic dimension (LID) of a datum -- i.e. the dimension of the submanifold it belongs to -- is a longstanding problem. LID can be understood as the number of local factors of variation: the more factors of variation a datum has, the more complex it tends to be. Estimating this quantity has proven useful in contexts ranging from generalization in neural networks to detection of out-of-distribution data, adversarial examples, and AI-generated text. The recent successes of deep generative models present an opportunity to leverage them for LID estimation, but current methods based on generative models produce inaccurate estimates, require more than a single pre-trained model, are computationally intensive, or do not exploit the best available deep generative models: diffusion models (DMs). In this work, we show that the Fokker-Planck equation associated with a DM can provide an LID estimator which addresses the aforementioned deficiencies.
Uncovering Fairness through Data Complexity as an Early Indicator
Ferreira, Juliett Suárez, Slavkovik, Marija, Casillas, Jorge
Fairness constitutes a concern within machine learning (ML) applications. Currently, there is no study on how disparities in classification complexity between privileged and unprivileged groups could influence the fairness of solutions, which serves as a preliminary indicator of potential unfairness. In this work, we investigate this gap, specifically, we focus on synthetic datasets designed to capture a variety of biases ranging from historical bias to measurement and representational bias to evaluate how various complexity metrics differences correlate with group fairness metrics. We then apply association rule mining to identify patterns that link disproportionate complexity differences between groups with fairness-related outcomes, offering data-centric indicators to guide bias mitigation. Our findings are also validated by their application in real-world problems, providing evidence that quantifying group-wise classification complexity can uncover early indicators of potential fairness challenges. This investigation helps practitioners to proactively address bias in classification tasks.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Switzerland (0.04)
- Europe > Spain > Catalonia (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Your Model is Overconfident, and Other Lies We Tell Ourselves
Mickus, Timothee, Sinha, Aman, Vázquez, Raúl
The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal that while correlations exist among these metrics, their relationships are neither linear nor monotonic. By disentangling these dimensions of uncertainty, we aim to refine our understanding of data complexity and its implications for evaluating and improving NLP models.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- (14 more...)
On Deciding the Data Complexity of Answering Linear Monadic Datalog Queries with LTL Operators(Extended Version)
Artale, Alessandro, Gnatenko, Anton, Ryzhikov, Vladislav, Zakharyaschev, Michael
Our concern is the data complexity of answering linear monadic datalog queries whose atoms in the rule bodies can be prefixed by operators of linear temporal logic LTL. We first observe that, for data complexity, answering any connected query with operators $\bigcirc/\bigcirc^-$ (at the next/previous moment) is either in AC0, or in $ACC0\!\setminus\!AC0$, or $NC^1$-complete, or LogSpace-hard and in NLogSpace. Then we show that the problem of deciding LogSpace-hardness of answering such queries is PSpace-complete, while checking membership in the classes AC0 and ACC0 as well as $NC^1$-completeness can be done in ExpSpace. Finally, we prove that membership in AC0 or in ACC0, $NC^1$-completeness, and LogSpace-hardness are undecidable for queries with operators $\Diamond_f/\Diamond_p$ (sometime in the future/past) provided that $NC^1 \ne NLogSpace$, and $LogSpace \ne NLogSpace$.
- Europe > Italy (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > China (0.04)
- (14 more...)