AITopics | semi-structured data

Collaborating Authors

semi-structured data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Direct Natural Language Querying to Massive Heterogeneous Semi Structured Data

Neural Information Processing SystemsJun-17-2026, 20:21:17 GMT

Searching over semi-structured data with natural language (NL) queries has attracted sustained attention, enabling broader audiences to access information easily. As more applications, such as LLM agents and RAG systems, emerge to search and interact with semi-structured data, two major challenges have become evident: (1) the increasing diversity of domains and schema variations, making domain-customized solutions prohibitively costly; (2) the growing complexity of NL queries, which combine both exact field matching conditions and fuzzy semantic requirements, often involving multiple fields and implicit reasoning. These challenges make formal language querying or keyword-based search insufficient. In this work, we explore neural retrievers as a unified non-formal querying solution by directly index semi-structured collections and understand NL queries. We employ LLM-based automatic evaluation and build a large-scale semi-structured retrieval benchmark (SSRB) using LLM generation and filtering, containing 14M semi-structured objects from 99 different schemas across 6 domains, along with 8,485 test queries that combine both exact and fuzzy matching conditions. Our systematic evaluation of popular retrievers shows that current state-of-the-art models could achieve acceptable performance, yet they still lack precise understanding of matching constraints. While by in-domain training of dense retrievers, the performance can be significantly improved. We believe that our SSRBcould serve as a valuable resource for future research in this area, and we hope to inspire further exploration of semi-structured retrieval with complex queries.

artificial intelligence, large language model, natural language, (20 more...)

Neural Information Processing Systems

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Research Report > Promising Solution (0.87)

Industry:

Information Technology (0.67)
Media > Photography (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Knowledge Extraction on Semi-Structured Content: Does It Remain Relevant for Question Answering in the Era of LLMs?

Sun, Kai, Huang, Yin, Mehra, Srishti, Kachuee, Mohammad, Chen, Xilun, Tao, Renjie, Lin, Zhaojiang, Jessee, Andrea, Shah, Nirav, Betty, Alex, Liu, Yue, Kumar, Anuj, Yih, Wen-tau, Dong, Xin Luna

arXiv.org Artificial IntelligenceSep-30-2025

The advent of Large Language Models (LLMs) has significantly advanced web-based Question Answering (QA) systems over semi-structured content, raising questions about the continued utility of knowledge extraction for question answering. This paper investigates the value of triple extraction in this new paradigm by extending an existing benchmark with knowledge extraction annotations and evaluating commercial and open-source LLMs of varying sizes. Our results show that web-scale knowledge extraction remains a challenging task for LLMs. Despite achieving high QA accuracy, LLMs can still benefit from knowledge extraction, through augmentation with extracted triples and multi-task learning. These findings provide insights into the evolving role of knowledge triple extraction in web-based QA and highlight strategies for maximizing LLM effectiveness across different model sizes and resource settings.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.25107

Country:

Asia (1.00)
North America > United States > New York (0.28)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

HyST: LLM-Powered Hybrid Retrieval over Semi-Structured Tabular Data

Myung, Jiyoon, Park, Jihyeon, Han, Joohyung

arXiv.org Artificial IntelligenceAug-26-2025

User queries in real-world recommendation systems often combine structured constraints (e.g., category, attributes) with unstructured preferences (e.g., product descriptions or reviews). We introduce HyST (Hybrid retrieval over Semi-structured Tabular data), a hybrid retrieval framework that combines LLM-powered structured filtering with semantic embedding search to support complex information needs over semi-structured tabular data. HyST extracts attribute-level constraints from natural language using large language models (LLMs) and applies them as metadata filters, while processing the remaining unstructured query components via embedding-based retrieval. Experiments on a semi-structured benchmark show that HyST consistently outperforms tradtional baselines, highlighting the importance of structured filtering in improving retrieval precision, offering a scalable and accurate solution for real-world user queries.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2508.18048

Country: North America > United States > New York (0.15)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

PASemiQA: Plan-Assisted Agent for Question Answering on Semi-Structured Data with Text and Relational Information

Yang, Hansi, Zhang, Qi, Jiang, Wei, Li, Jianguo

arXiv.org Artificial IntelligenceFeb-28-2025

Large language models (LLMs) have shown impressive abilities in answering questions across various domains, but they often encounter hallucination issues on questions that require professional and up-to-date knowledge. To address this limitation, retrieval-augmented generation (RAG) techniques have been proposed, which retrieve relevant information from external sources to inform their responses. However, existing RAG methods typically focus on a single type of external data, such as vectorized text database or knowledge graphs, and cannot well handle real-world questions on semi-structured data containing both text and relational information. To bridge this gap, we introduce PASemiQA, a novel approach that jointly leverages text and relational information in semi-structured data to answer questions. PASemiQA first generates a plan to identify relevant text and relational information to answer the question in semi-structured data, and then uses an LLM agent to traverse the semi-structured data and extract necessary information. Our empirical results demonstrate the effectiveness of PASemiQA across different semi-structured datasets from various domains, showcasing its potential to improve the accuracy and reliability of question answering systems on semi-structured data.

information, node, semi-structured data, (15 more...)

arXiv.org Artificial Intelligence

2502.21087

Country:

North America > United States > District of Columbia > Washington (0.05)
Asia > China > Beijing > Beijing (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(4 more...)

Genre:

Research Report > Promising Solution (0.34)
Research Report > New Finding (0.34)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

GraphRank Pro+: Advancing Talent Analytics Through Knowledge Graphs and Sentiment-Enhanced Skill Profiling

Velampalli, Sirisha, Muniyappa, Chandrashekar

arXiv.org Artificial IntelligenceFeb-25-2025

The extraction of information from semi-structured text, such as resumes, has long been a challenge due to the diverse formatting styles and subjective content organization. Conventional solutions rely on specialized logic tailored for specific use cases. However, we propose a revolutionary approach leveraging structured Graphs, Natural Language Processing (NLP), and Deep Learning. By abstracting intricate logic into Graph structures, we transform raw data into a comprehensive Knowledge Graph. This innovative framework enables precise information extraction and sophisticated querying. We systematically construct dictionaries assigning skill weights, paving the way for nuanced talent analysis. Our system not only benefits job recruiters and curriculum designers but also empowers job seekers with targeted query-based filtering and ranking capabilities.

extraction, jobseeker, keyword, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-62269-4_21

2502.18315

Country:

North America > United States > North Dakota > Grand Forks County > Grand Forks (0.14)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > Canada (0.04)
(2 more...)

Genre:

Research Report > Promising Solution (0.48)
Overview > Innovation (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.97)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

ORIGAMI: A generative transformer architecture for predictions from semi-structured data

Rückstieß, Thomas, Huang, Alana, Vujanic, Robin

arXiv.org Artificial IntelligenceDec-23-2024

Despite the popularity and widespread use of semi-structured data formats such as JSON, end-to-end supervised learning applied directly to such data remains underexplored. We present ORIGAMI (Object RepresentatIon via Generative Autoregressive ModellIng), a transformer-based architecture that directly processes nested key/value pairs while preserving their hierarchical semantics. Our key technical contributions include: (1) a structure-preserving tokenizer, (2) a novel key/value position encoding scheme, and (3) a grammar-constrained training and inference framework that ensures valid outputs and accelerates training convergence. These enhancements enable efficient end-to-end modeling of semi-structured data. By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks without architectural modifications. Empirical evaluation across diverse domains demonstrates ORIGAMI's effectiveness: On standard tabular benchmarks converted to JSON, ORIGAMI remains competitive with classical and state-of-the-art approaches. On native JSON datasets, we outperform baselines on multi-label classification and specialized models such as convolutional and graph neural networks on a code classification task. Through extensive ablation studies, we validate the impact of each architectural component and establish ORIGAMI as a robust framework for end-to-end learning on semi-structured data.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.17348

Country: Oceania > Australia (0.28)

Genre: Research Report > Promising Solution (0.48)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Generate, Transform, Answer: Question Specific Tool Synthesis for Tabular Data

Gemmell, Carlos, Dalton, Jeffrey

arXiv.org Artificial IntelligenceMar-17-2023

Tabular question answering (TQA) presents a challenging setting for neural systems by requiring joint reasoning of natural language with large amounts of semi-structured data. Unlike humans who use programmatic tools like filters to transform data before processing, language models in TQA process tables directly, resulting in information loss as table size increases. In this paper we propose ToolWriter to generate query specific programs and detect when to apply them to transform tables and align them with the TQA model's capabilities. Focusing ToolWriter to generate row-filtering tools improves the state-of-the-art for WikiTableQuestions and WikiSQL with the most performance gained on long tables. By investigating headroom, our work highlights the broader potential for programmatic tools combined with neural components to manipulate large amounts of structured data.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2303.10138

Country:

Europe > France (0.04)
North America > United States (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

A new feature has been added to Google's review pages: Pros and Cons structured data

#artificialintelligenceOct-5-2022, 14:20:23 GMT

Google now supports the pros and cons of structured data for review pages in the search results as per editorial reviews using the new validated markup such as structured data markup. Google also prioritized the supply of structured data given by your content which extracts the data for pros and cons usually shown in Google search engines. So it is likely to create value for your blog post to tell Google what data types you want to show in Google Search over Google testing to assume. Meanwhile, you markup the pros and cons of your blog post, and Google might show the rich results markup for your featured snippets. To implement the pros and cons of structured data, you must be aware of categorized structures by Google: structured data, unstructured data, and semi-structured data markup, and how they work precisely.

data markup, google, markup, (15 more...)

#artificialintelligence

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.45)

Add feedback

Can we Use Unstructured Data in Intelligent Automation Activities?

#artificialintelligenceFeb-12-2022, 10:55:25 GMT

Structured and semi-structured data can be automated using Robotic Process Automation, whereas unstructured data can be automated using Artificial Intelligence. Artificial Intelligence can augment Robotic Process Automation as it can process unstructured data in intelligent automation and it can also learn/improve its performance. AI can read unstructured data with the help of OCR, Natural Language Processing, and Machine Learning or Deep Learning. With the help of Artificial Intelligence, we can increase the use of automation in corporations and streamline various processes to reduce turnaround time and increase efficiency with reduced cost. AI can be trained through various methods such as Supervised Learning and Continuous Learning.

artificial intelligence, process automation, robotic process automation, (11 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Schema Extraction on Semi-structured Data

Li, Panpan, Gong, Yikun, Wang, Chen

arXiv.org Artificial IntelligenceDec-15-2020

With the continuous development of NoSQL databases, more and more developers choose to use semi-structured data for development and data management, which puts forward requirements for schema management of semi-structured data stored in NoSQL databases. Schema extraction plays an important role in understanding schemas, optimizing queries, and validating data consistency. Therefore, in this survey we investigate structural methods based on tree and graph and statistical methods based on distributed architecture and machine learning to extract schemas. The schemas obtained by the structural methods are more interpretable, and the statistical methods have better applicability and generalization ability. Moreover, we also investigate tools and systems for schemas extraction. Schema extraction tools are mainly used for spark or NoSQL databases, and are suitable for small datasets or simple application environments. The system mainly focuses on the extraction and management of schemas in large data sets and complex application scenarios. Furthermore, we also compare these techniques to facilitate data managers' choice.

database, schema, semi-structured data, (13 more...)

arXiv.org Artificial Intelligence

2012.08105

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
(11 more...)

Genre:

Overview (0.54)
Research Report (0.50)
Workflow (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback