Goto

Collaborating Authors

 Ontologies



Clinical semantics for lung cancer prediction

arXiv.org Artificial Intelligence

Background: Existing clinical prediction models often represent patient data using features that ignore the semantic relationships between clinical concepts. This study integrates domain-specific semantic information by mapping the SNOMED medical term hierarchy into a low-dimensional hyperbolic space using Poincarรฉ embeddings, with the aim of improving lung cancer onset prediction. Methods: Using a retrospective cohort from the Optum EHR dataset, we derived a clinical knowledge graph from the SNOMED taxonomy and generated Poincarรฉ embeddings via Riemannian stochastic gradient descent. These embeddings were then incorporated into two deep learning architectures, a ResNet and a Transformer model. Models were evaluated for discrimination (area under the receiver operating characteristic curve) and calibration (average absolute difference between observed and predicted probabilities) performance. Results: Incorporating pre-trained Poincarรฉ embeddings resulted in modest and consistent improvements in discrimination performance compared to baseline models using randomly initialized Euclidean embeddings. ResNet models, particularly those using a 10-dimensional Poincarรฉ embedding, showed enhanced calibration, whereas Transformer models maintained stable calibration across configurations. Discussion: Embedding clinical knowledge graphs into hyperbolic space and integrating these representations into deep learning models can improve lung cancer onset prediction by preserving the hierarchical structure of clinical terminologies used for prediction. This approach demonstrates a feasible method for combining data-driven feature extraction with established clinical knowledge.


Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

arXiv.org Artificial Intelligence

Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B LLM. Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average . We then correlate the conceptual averages with a ground truth mapping between ontology classes. Our results give a strong indication that the conceptual average aligns to the true relationship between classes when compared with a single language by itself. The result hints at a new technique which enables mechanistic interpretation of internal network states with higher accuracy.


Hierarchical Conformal Classification

arXiv.org Machine Learning

Conformal prediction (CP) is a powerful framework for quantifying uncertainty in machine learning models, offering reliable predictions with finite-sample coverage guarantees. When applied to classification, CP produces a prediction set of possible labels that is guaranteed to contain the true label with high probability, regardless of the underlying classifier. However, standard CP treats classes as flat and unstructured, ignoring domain knowledge such as semantic relationships or hierarchical structure among class labels. This paper presents hierarchical conformal classification (HCC), an extension of CP that incorporates class hierarchies into both the structure and semantics of prediction sets. We formulate HCC as a constrained optimization problem whose solutions yield prediction sets composed of nodes at different levels of the hierarchy, while maintaining coverage guarantees. To address the combinatorial nature of the problem, we formally show that a much smaller, well-structured subset of candidate solutions suffices to ensure coverage while upholding optimality. An empirical evaluation on three new benchmarks consisting of audio, image, and text data highlights the advantages of our approach, and a user study shows that annotators significantly prefer hierarchical over flat prediction sets.


Query Logs Analytics: A Aystematic Literature Review

arXiv.org Artificial Intelligence

In the digital era, user interactions with various resources such as databases, data warehouses, websites, and knowledge graphs (KGs) are increasingly mediated through digital platforms. These interactions leave behind digital traces, systematically captured in the form of logs. Logs, when effectively exploited, provide high value across industry and academia, supporting critical services (e.g., recovery and security), user-centric applications (e.g., recommender systems), and quality-of-service improvements (e.g., performance optimization). Despite their importance, research on log usage remains fragmented across domains, and no comprehensive study currently consolidates existing efforts. This paper presents a systematic survey of log usage, focusing on Database (DB), Data Warehouse (DW), Web, and KG logs. More than 300 publications were analyzed to address three central questions: (1) do different types of logs share common structural and functional characteristics? (2) are there standard pipelines for their usage? (3) which constraints and non-functional requirements (NFRs) guide their exploitation?. The survey reveals a limited number of end-to-end approaches, the absence of standardization across log usage pipelines, and the existence of shared structural elements among different types of logs. By consolidating existing knowledge, identifying gaps, and highlighting opportunities, this survey provides researchers and practitioners with a comprehensive overview of log usage and sheds light on promising directions for future research, particularly regarding the exploitation and democratization of KG logs.


Fitting Ontologies and Constraints to Relational Structures

arXiv.org Artificial Intelligence

We study the problem of fitting ontologies and constraints to positive and negative examples that take the form of a finite relational structure. As ontology and constraint languages, we consider the description logics $\mathcal{E\mkern-2mu L}$ and $\mathcal{E\mkern-2mu LI}$ as well as several classes of tuple-generating dependencies (TGDs): full, guarded, frontier-guarded, frontier-one, and unrestricted TGDs as well as inclusion dependencies. We pinpoint the exact computational complexity, design algorithms, and analyze the size of fitting ontologies and TGDs. We also investigate the related problem of constructing a finite basis of concept inclusions / TGDs for a given set of finite structures. While finite bases exist for $\mathcal{E\mkern-2mu L}$, $\mathcal{E\mkern-2mu LI}$, guarded TGDs, and inclusion dependencies, they in general do not exist for full, frontier-guarded and frontier-one TGDs.


Dispositions and Roles of Generically Dependent Entities

arXiv.org Artificial Intelligence

BFO 2020 does not support functions, dispositions, and roles of generically dependent continuants (like software or datasets). In this paper, we argue that this is a severe limitation, which prevents, for example, the adequate representation of the functions of computer models or the various roles of datasets during the execution of these models. We discuss the aspects of BFO 2020 that prevent the representation of realizable entities of generically dependent continuants. Two approaches to address the issue are presented: (a) the use of defined classes and (b) a proposal of changes that allow BFO to support functions, dispositions, and roles of generically dependent continuants. The latter also addresses limitations of BFO 2020 concerning the roles and dispositions of immaterial entities, particularly boundaries and sites.



An LLM Agent-Based Complex Semantic Table Annotation Approach

arXiv.org Artificial Intelligence

The Semantic Table Annotation (STA) task, which includes Column Type Annotation (CTA) and Cell Entity Annotation (CEA), maps table contents to ontology entities and plays important roles in various semantic applications. However, complex tables often pose challenges such as semantic loss of column names or cell values, strict ontological hierarchy requirements, homonyms, spelling errors, and abbreviations, which hinder annotation accuracy. To address these issues, this paper proposes an LLM-based agent approach for CTA and CEA. We design and implement five external tools with tailored prompts based on the ReAct framework, enabling the STA agent to dynamically select suitable annotation strategies depending on table characteristics. Experiments are conducted on the Tough Tables and BiodivTab datasets from the SemTab challenge, which contain the aforementioned challenges. Our method outperforms existing approaches across various metrics. Furthermore, by leveraging Levenshtein distance to reduce redundant annotations, we achieve a 70% reduction in time costs and a 60% reduction in LLM token usage, providing an efficient and cost-effective solution for STA.


Ontology-Guided Query Expansion for Biomedical Document Retrieval using Large Language Models

arXiv.org Artificial Intelligence

Effective Question Answering (QA) on large biomedical document collections requires effective document retrieval techniques. The latter remains a challenging task due to the domain-specific vocabulary and semantic ambiguity in user queries. We propose BMQExpander, a novel ontology-aware query expansion pipeline that combines medical knowledge - definitions and relationships - from the UMLS Metathesaurus with the generative capabilities of large language models (LLMs) to enhance retrieval effectiveness. We implemented several state-of-the-art baselines, including sparse and dense retrievers, query expansion methods, and biomedical-specific solutions. We show that BMQExpander has superior retrieval performance on three popular biomedical Information Retrieval (IR) benchmarks: NFCorpus, TREC-COVID, and SciFact - with improvements of up to 22.1% in NDCG@10 over sparse baselines and up to 6.5% over the strongest baseline. Further, BMQExpander generalizes robustly under query perturbation settings, in contrast to supervised baselines, achieving up to 15.7% improvement over the strongest baseline. As a side contribution, we publish our paraphrased benchmarks. Finally, our qualitative analysis shows that BMQExpander has fewer hallucinations compared to other LLM-based query expansion baselines.