wikidata
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Czechia > Prague (0.04)
- (18 more...)
- Overview (0.68)
- Research Report (0.46)
- Information Technology > Databases (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Communications (1.00)
- (4 more...)
- Europe > United Kingdom > England (0.04)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (13 more...)
- Leisure & Entertainment (1.00)
- Information Technology > Services (0.68)
Subnational Geocoding of Global Disasters Using Large Language Models
Ronco, Michele, Delforge, Damien, Jäger, Wiebke S., Corbane, Christina
Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.
- Europe (1.00)
- North America > United States (0.46)
- Asia > India (0.28)
- Workflow (0.88)
- Research Report (0.64)
Gradient-Based Join Ordering
Join ordering is the NP-hard problem of selecting the most efficient sequence in which to evaluate joins (conjunctive, binary operators) in a database query. As the performance of query execution critically depends on this choice, join ordering lies at the core of query optimization. Traditional approaches cast this problem as a discrete combinatorial search over binary trees guided by a cost model, but they often suffer from high computational complexity and limited scalability. We show that, when the cost model is differentiable, the query plans can be continuously relaxed into a soft adjacency matrix representing a superposition of plans. This continuous relaxation, together with a Gumbel-Softmax parameterization of the adjacency matrix and differentiable constraints enforcing plan validity, enables gradient-based search for plans within this relaxed space. Using a learned Graph Neural Network as the cost model, we demonstrate that this gradient-based approach can find comparable and even lower-cost plans compared to traditional discrete local search methods on two different graph datasets. Furthermore, we empirically show that the runtime of this approach scales linearly with query size, in contrast to quadratic or exponential runtimes of classical approaches. We believe this first step towards gradient-based join ordering can lead to more effective and efficient query optimizers in the future.
- Europe (1.00)
- North America > United States (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning
Zhou, Hongkuan, Halilaj, Lavdim, Monka, Sebastian, Schmid, Stefan, Zhu, Yuqicheng, Wu, Jingcheng, Nazer, Nadeem, Staab, Steffen
Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. We propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks > Manufacturer (1.00)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.82)
DELICATE: Diachronic Entity LInking using Classes And Temporal Evidence
Santini, Cristian, Barzaghi, Sebastian, Sernani, Paolo, Frontoni, Emanuele, Alam, Mehwish
In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Why Isn't Relational Learning Taking Over the World?
Artificial intelligence seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can't be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world -- except in a few cases with restricted relations -- and what needs to be done to bring it to it's rightful prominence.
- North America > Canada (0.29)
- North America > United States (0.28)
- Leisure & Entertainment > Sports > Soccer (1.00)
- Health & Medicine (1.00)
- Media > Film (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
Zhang, Liangliang, Jiang, Zhuorui, Chi, Hongliang, Chen, Haoyang, Elkoumy, Mohammed, Wang, Fali, Wu, Qiong, Zhou, Zhengyi, Pan, Shirui, Wang, Suhang, Ma, Yao
Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
- Asia > Uzbekistan (0.28)
- North America > United States > Alabama (0.28)
- Europe > Austria (0.28)
MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations
Lavrinovics, Ernests, Biswas, Russa, Hose, Katja, Bjerva, Johannes
Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called MultiHal framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale improvement by approximately 0.12 to 0.36 points for the semantic similarity score, 0.16 to 0.36 for NLI entailment and 0.29 to 0.42 for hallucination detection in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.
- Europe > Austria > Vienna (0.14)
- Europe > France (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (15 more...)
- Media (0.47)
- Leisure & Entertainment (0.46)
- Aerospace & Defense (0.46)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Czechia > Prague (0.04)
- (18 more...)
- Overview (0.68)
- Research Report (0.46)
- Information Technology > Databases (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Communications (1.00)
- (3 more...)