Goto

Collaborating Authors

 cypher


Structured Interfaces for Automated Reasoning with 3D Scene Graphs

Ray, Aaron, Arkin, Jacob, Biggie, Harel, Fan, Chuchu, Carlone, Luca, Roy, Nicholas

arXiv.org Artificial Intelligence

In order to provide a robot with the ability to understand and react to a user's natural language inputs, the natural language must be connected to the robot's underlying representations of the world. Recently, large language models (LLMs) and 3D scene graphs (3DSGs) have become a popular choice for grounding natural language and representing the world. In this work, we address the challenge of using LLMs with 3DSGs to ground natural language. Existing methods encode the scene graph as serialized text within the LLM's context window, but this encoding does not scale to large or rich 3DSGs. Instead, we propose to use a form of Retrieval Augmented Generation to select a subset of the 3DSG relevant to the task. We encode a 3DSG in a graph database and provide a query language interface (Cypher) as a tool to the LLM with which it can retrieve relevant data for language grounding. We evaluate our approach on instruction following and scene question-answering tasks and compare against baseline context window and code generation methods. Our results show that using Cypher as an interface to 3D scene graphs scales significantly better to large, rich graphs on both local and cloud-based models. This leads to large performance improvements in grounded language tasks while also substantially reducing the token count of the scene graph content. A video supplement is available at https://www.youtube.com/watch?v=zY_YI9giZSA.


STRuCT-LLM: Unifying Tabular and Graph Reasoning with Reinforcement Learning for Semantic Parsing

Stoisser, Josefa Lia, Martell, Marc Boubnovski, Phillips, Lawrence, Hansen, Casper, Fauqueur, Julien

arXiv.org Artificial Intelligence

We propose STRuCT-LLM, a unified framework for training large language models (LLMs) to perform structured reasoning over both relational and graph-structured data. Our approach jointly optimizes Text-to-SQL and Text-to-Cypher tasks using reinforcement learning (RL) combined with Chain-of-Thought (CoT) supervision. To support fine-grained optimization in graph-based parsing, we introduce a topology-aware reward function based on graph edit distance. Unlike prior work that treats relational and graph formalisms in isolation, STRuCT-LLM leverages shared abstractions between SQL and Cypher to induce cross-formalism transfer, enabling SQL training to improve Cypher performance and vice versa--even without shared schemas. Our largest model (QwQ-32B) achieves substantial relative improvements across tasks: on semantic parsing, Spider improves by 13.5% and Text2Cypher by 73.1%. The model also demonstrates strong zero-shot generalization, improving performance on downstream tabular QA (TableBench: 8.5%) and knowledge graph QA (CR-LT-KGQA: 1.7%) without any QA-specific supervision. These results demonstrate both the effectiveness of executable queries as scaffolds for structured reasoning and the synergistic benefits of jointly training on SQL and Cypher (code available at https://github.com/bouv/ Listing order is random. 1 1 Introduction Large language models (LLMs) demonstrate impressive fluency in open-domain generation but often falter on structured reasoning tasks involving tables and graphs [12, 6]. Structured reasoning requires models to ground entities, compose symbolic constraints, and follow logical paths--skills crucial for interacting with real-world data systems such as relational databases and knowledge graphs (KGs) [16, 24]. We view executable semantic parsing--specifically, Text-to-SQL and Text-to-Cypher--as a gateway to this broader capability [32, 23]. While Text-to-SQL is well-studied, Text-to-Cypher remains underexplored, offering a valuable testbed for graph reasoning.


Managing FAIR Knowledge Graphs as Polyglot Data End Points: A Benchmark based on the rdf2pg Framework and Plant Biology Data

Brandizi, Marco, Bobed, Carlos, Garulli, Luca, de Klerk, Arné, Hassani-Pak, Keywan

arXiv.org Artificial Intelligence

Linked data and labelled property graphs (LPG) are two data management approaches with complementary strengths and weaknesses, making their integration beneficial for sharing datasets and supporting software ecosystems. In thi s paper, we introduce rdf2pg, an extensible framework for mapping RDF data to semantically equivalent LPG formats and databases. Utilising this framework, we perform a comparative analysis of three popular graph databases - Virtuoso, Neo4j, and ArcadeDB - and the well - known graph query languages SPARQL, Cypher, and Gremlin. Our qualitative and quantitative assessments underline the strengths and limitations of these graph database technologies. Additionally, we highlight the potent ial of rdf2pg as a versatile tool for enabling polyglot access to knowledge graphs, aligning with established standards of linked data and the semantic web.


CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era

Feng, Yanlin, Papicchio, Simone, Rahman, Sajjadur

arXiv.org Artificial Intelligence

Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (edge et al., 2024). Despite decades of research on knowledge graphs and knowledge base question answering, leading LLM frameworks (e.g. Langchain and LlamaIndex) have only minimal support for retrieval from modern encyclopedic knowledge graphs like Wikidata. In this paper, we analyze the root cause and suggest that modern RDF knowledge graphs (e.g. Wikidata, Freebase) are less efficient for LLMs due to overly large schemas that far exceed the typical LLM context window, use of resource identifiers, overlapping relation types and lack of normalization. As a solution, we propose property graph views on top of the underlying RDF graph that can be efficiently queried by LLMs using Cypher. We instantiated this idea on Wikidata and introduced CypherBench, the first benchmark with 11 large-scale, multi-domain property graphs with 7.8 million entities and over 10,000 questions. To achieve this, we tackled several key challenges, including developing an RDF-to-property graph conversion engine, creating a systematic pipeline for text-to-Cypher task generation, and designing new evaluation metrics.


CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research

Huang, Sian-Yao, Yang, Cheng-Lin, Lin, Che-Yu, Huang, Chun-Ying

arXiv.org Artificial Intelligence

This research addresses command-line embedding in cybersecurity, a field obstructed by the lack of comprehensive datasets due to privacy and regulation concerns. We propose the first dataset of similar command lines, named CyPHER, for training and unbiased evaluation. The training set is generated using a set of large language models (LLMs) comprising 28,520 similar command-line pairs. Our testing dataset consists of 2,807 similar command-line pairs sourced from authentic command-line data. In addition, we propose a command-line embedding model named CmdCaliper, enabling the computation of semantic similarity with command lines. Performance evaluations demonstrate that the smallest version of CmdCaliper (30 million parameters) suppresses state-of-the-art (SOTA) sentence embedding models with ten times more parameters across various tasks (e.g., malicious command-line detection and similar command-line retrieval). Our study explores the feasibility of data generation using LLMs in the cybersecurity domain. Furthermore, we release our proposed command-line dataset, embedding models' weights and all program codes to the public. This advancement paves the way for more effective command-line embedding for future researchers.


SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task

Zhong, Ziije, Zhong, Linqing, Sun, Zhaoze, Jin, Qingyun, Qin, Zengchang, Zhang, Xiaofan

arXiv.org Artificial Intelligence

Integrating Large Language Models (LLMs) with existing Knowledge Graph (KG) databases presents a promising avenue for enhancing LLMs' efficacy and mitigating their "hallucinations". Given that most KGs reside in graph databases accessible solely through specialized query languages (e.g., Cypher), there exists a critical need to bridge the divide between LLMs and KG databases by automating the translation of natural language into Cypher queries (commonly termed the "Text2Cypher" task). Prior efforts tried to bolster LLMs' proficiency in Cypher generation through Supervised Fine-Tuning. However, these explorations are hindered by the lack of annotated datasets of Query-Cypher pairs, resulting from the labor-intensive and domain-specific nature of annotating such datasets. In this study, we propose SyntheT2C, a methodology for constructing a synthetic Query-Cypher pair dataset, comprising two distinct pipelines: (1) LLM-based prompting and (2) template-filling. SyntheT2C facilitates the generation of extensive Query-Cypher pairs with values sampled from an underlying Neo4j graph database. Subsequently, SyntheT2C is applied to two medical databases, culminating in the creation of a synthetic dataset, MedT2C. Comprehensive experiments demonstrate that the MedT2C dataset effectively enhances the performance of backbone LLMs on the Text2Cypher task. Both the SyntheT2C codebase and the MedT2C dataset will be released soon.


CyPhERS: A Cyber-Physical Event Reasoning System providing real-time situational awareness for attack and fault response

Müller, Nils, Bao, Kaibin, Matthes, Jörg, Heussen, Kai

arXiv.org Machine Learning

Cyber-physical systems (CPSs) constitute the backbone of critical infrastructures such as power grids or water distribution networks. Operating failures in these systems can cause serious risks for society. To avoid or minimize downtime, operators require real-time awareness about critical incidents. However, online event identification in CPSs is challenged by the complex interdependency of numerous physical and digital components, requiring to take cyber attacks and physical failures equally into account. The online event identification problem is further complicated through the lack of historical observations of critical but rare events, and the continuous evolution of cyber attack strategies. This work introduces and demonstrates CyPhERS, a Cyber-Physical Event Reasoning System. CyPhERS provides real-time information pertaining the occurrence, location, physical impact, and root cause of potentially critical events in CPSs, without the need for historical event observations. Key novelty of CyPhERS is the capability to generate informative and interpretable event signatures of known and unknown types of both cyber attacks and physical failures. The concept is evaluated and benchmarked on a demonstration case that comprises a multitude of attack and fault events targeting various components of a CPS. The results demonstrate that the event signatures provide relevant and inferable information on both known and unknown event types.


Graph Analytics: Part 1

#artificialintelligence

In my past 3 years as a Data Science professional, I have worked extensively with both RDBMS (Postgres) & Cassandra (NoSQL) but didn't get a chance to explore Graph databases. So, it's time to jump onto graph databases & how they can be integrated into different data science solutions. Consider this: Observe Google Maps for any city. A graph is basically a collection of Nodes (the landmarks) & edges(the roads). Nodes are connected (or may not be connected at all)to each other using the edges. Neo4j is the most popular database for analyzing graph data.


Must-attend AI and ML conferences of 2022

#artificialintelligence

As we move forward in time, artificial intelligence's penetration into our work and lives is to only increase. Henceforth, it becomes essential with the clock to grasp the best utilization of the technology for universal betterment. Tech conferences are the right places to extend this very vision. With the effects of the pandemic slowly subsiding, in-person conferences are kicking off again (with most providing virtual access as well) for techies, enthusiasts and domain experts to come together in one place and exchange ideas and thoughts. Let us look at some of the top AI and ML conferences one must not miss in 2022.


A Novel Approach for Generating SPARQL Queries from RDF Graphs

Jabri, Emna

arXiv.org Artificial Intelligence

This work is done as part of a research master's thesis project. The goal is to generate SPARQL queries based on user-supplied keywords to query RDF graphs. To do this, we first transformed the input ontology into an RDF graph that reflects the semantics represented in the ontology. Subsequently, we stored this RDF graph in the Neo4j graphical database to ensure efficient and persistent management of RDF data. At the time of the interrogation, we studied the different possible and desired interpretations of the request originally made by the user. We have also proposed to carry out a sort of transformation between the two query languages SPARQL and Cypher, which is specific to Neo4j. This allows us to implement the architecture of our system over a wide variety of BD-RDFs providing their query languages, without changing any of the other components of the system. Finally, we tested and evaluated our tool using different test bases, and it turned out that our tool is comprehensive, effective, and powerful enough.