semi-structured data
Knowledge Extraction on Semi-Structured Content: Does It Remain Relevant for Question Answering in the Era of LLMs?
Sun, Kai, Huang, Yin, Mehra, Srishti, Kachuee, Mohammad, Chen, Xilun, Tao, Renjie, Lin, Zhaojiang, Jessee, Andrea, Shah, Nirav, Betty, Alex, Liu, Yue, Kumar, Anuj, Yih, Wen-tau, Dong, Xin Luna
The advent of Large Language Models (LLMs) has significantly advanced web-based Question Answering (QA) systems over semi-structured content, raising questions about the continued utility of knowledge extraction for question answering. This paper investigates the value of triple extraction in this new paradigm by extending an existing benchmark with knowledge extraction annotations and evaluating commercial and open-source LLMs of varying sizes. Our results show that web-scale knowledge extraction remains a challenging task for LLMs. Despite achieving high QA accuracy, LLMs can still benefit from knowledge extraction, through augmentation with extracted triples and multi-task learning. These findings provide insights into the evolving role of knowledge triple extraction in web-based QA and highlight strategies for maximizing LLM effectiveness across different model sizes and resource settings.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Oceania > Australia > Queensland (0.04)
- (10 more...)
HyST: LLM-Powered Hybrid Retrieval over Semi-Structured Tabular Data
Myung, Jiyoon, Park, Jihyeon, Han, Joohyung
User queries in real-world recommendation systems often combine structured constraints (e.g., category, attributes) with unstructured preferences (e.g., product descriptions or reviews). We introduce HyST (Hybrid retrieval over Semi-structured Tabular data), a hybrid retrieval framework that combines LLM-powered structured filtering with semantic embedding search to support complex information needs over semi-structured tabular data. HyST extracts attribute-level constraints from natural language using large language models (LLMs) and applies them as metadata filters, while processing the remaining unstructured query components via embedding-based retrieval. Experiments on a semi-structured benchmark show that HyST consistently outperforms tradtional baselines, highlighting the importance of structured filtering in improving retrieval precision, offering a scalable and accurate solution for real-world user queries.
- Europe > Czechia > Prague (0.05)
- Asia > South Korea > Seoul > Seoul (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (3 more...)
PASemiQA: Plan-Assisted Agent for Question Answering on Semi-Structured Data with Text and Relational Information
Yang, Hansi, Zhang, Qi, Jiang, Wei, Li, Jianguo
Large language models (LLMs) have shown impressive abilities in answering questions across various domains, but they often encounter hallucination issues on questions that require professional and up-to-date knowledge. To address this limitation, retrieval-augmented generation (RAG) techniques have been proposed, which retrieve relevant information from external sources to inform their responses. However, existing RAG methods typically focus on a single type of external data, such as vectorized text database or knowledge graphs, and cannot well handle real-world questions on semi-structured data containing both text and relational information. To bridge this gap, we introduce PASemiQA, a novel approach that jointly leverages text and relational information in semi-structured data to answer questions. PASemiQA first generates a plan to identify relevant text and relational information to answer the question in semi-structured data, and then uses an LLM agent to traverse the semi-structured data and extract necessary information. Our empirical results demonstrate the effectiveness of PASemiQA across different semi-structured datasets from various domains, showcasing its potential to improve the accuracy and reliability of question answering systems on semi-structured data.
- North America > United States (0.30)
- Asia > China (0.28)
- North America > Mexico > Mexico City (0.14)
- Asia > Thailand (0.14)
- Research Report > Promising Solution (0.34)
- Research Report > New Finding (0.34)
GraphRank Pro+: Advancing Talent Analytics Through Knowledge Graphs and Sentiment-Enhanced Skill Profiling
Velampalli, Sirisha, Muniyappa, Chandrashekar
The extraction of information from semi-structured text, such as resumes, has long been a challenge due to the diverse formatting styles and subjective content organization. Conventional solutions rely on specialized logic tailored for specific use cases. However, we propose a revolutionary approach leveraging structured Graphs, Natural Language Processing (NLP), and Deep Learning. By abstracting intricate logic into Graph structures, we transform raw data into a comprehensive Knowledge Graph. This innovative framework enables precise information extraction and sophisticated querying. We systematically construct dictionaries assigning skill weights, paving the way for nuanced talent analysis. Our system not only benefits job recruiters and curriculum designers but also empowers job seekers with targeted query-based filtering and ranking capabilities.
- Asia > India (0.28)
- North America > United States > North Dakota > Grand Forks County > Grand Forks (0.14)
- Research Report > Promising Solution (0.48)
- Overview > Innovation (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.97)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
ORIGAMI: A generative transformer architecture for predictions from semi-structured data
Rückstieß, Thomas, Huang, Alana, Vujanic, Robin
Despite the popularity and widespread use of semi-structured data formats such as JSON, end-to-end supervised learning applied directly to such data remains underexplored. We present ORIGAMI (Object RepresentatIon via Generative Autoregressive ModellIng), a transformer-based architecture that directly processes nested key/value pairs while preserving their hierarchical semantics. Our key technical contributions include: (1) a structure-preserving tokenizer, (2) a novel key/value position encoding scheme, and (3) a grammar-constrained training and inference framework that ensures valid outputs and accelerates training convergence. These enhancements enable efficient end-to-end modeling of semi-structured data. By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks without architectural modifications. Empirical evaluation across diverse domains demonstrates ORIGAMI's effectiveness: On standard tabular benchmarks converted to JSON, ORIGAMI remains competitive with classical and state-of-the-art approaches. On native JSON datasets, we outperform baselines on multi-label classification and specialized models such as convolutional and graph neural networks on a code classification task. Through extensive ablation studies, we validate the impact of each architectural component and establish ORIGAMI as a robust framework for end-to-end learning on semi-structured data.
- North America > United States (0.14)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- Asia > China > Beijing > Beijing (0.04)
Generate, Transform, Answer: Question Specific Tool Synthesis for Tabular Data
Gemmell, Carlos, Dalton, Jeffrey
Tabular question answering (TQA) presents a challenging setting for neural systems by requiring joint reasoning of natural language with large amounts of semi-structured data. Unlike humans who use programmatic tools like filters to transform data before processing, language models in TQA process tables directly, resulting in information loss as table size increases. In this paper we propose ToolWriter to generate query specific programs and detect when to apply them to transform tables and align them with the TQA model's capabilities. Focusing ToolWriter to generate row-filtering tools improves the state-of-the-art for WikiTableQuestions and WikiSQL with the most performance gained on long tables. By investigating headroom, our work highlights the broader potential for programmatic tools combined with neural components to manipulate large amounts of structured data.
- Europe > France (0.04)
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
A new feature has been added to Google's review pages: Pros and Cons structured data
Google now supports the pros and cons of structured data for review pages in the search results as per editorial reviews using the new validated markup such as structured data markup. Google also prioritized the supply of structured data given by your content which extracts the data for pros and cons usually shown in Google search engines. So it is likely to create value for your blog post to tell Google what data types you want to show in Google Search over Google testing to assume. Meanwhile, you markup the pros and cons of your blog post, and Google might show the rich results markup for your featured snippets. To implement the pros and cons of structured data, you must be aware of categorized structures by Google: structured data, unstructured data, and semi-structured data markup, and how they work precisely.
Can we Use Unstructured Data in Intelligent Automation Activities?
Structured and semi-structured data can be automated using Robotic Process Automation, whereas unstructured data can be automated using Artificial Intelligence. Artificial Intelligence can augment Robotic Process Automation as it can process unstructured data in intelligent automation and it can also learn/improve its performance. AI can read unstructured data with the help of OCR, Natural Language Processing, and Machine Learning or Deep Learning. With the help of Artificial Intelligence, we can increase the use of automation in corporations and streamline various processes to reduce turnaround time and increase efficiency with reduced cost. AI can be trained through various methods such as Supervised Learning and Continuous Learning.
Schema Extraction on Semi-structured Data
Li, Panpan, Gong, Yikun, Wang, Chen
With the continuous development of NoSQL databases, more and more developers choose to use semi-structured data for development and data management, which puts forward requirements for schema management of semi-structured data stored in NoSQL databases. Schema extraction plays an important role in understanding schemas, optimizing queries, and validating data consistency. Therefore, in this survey we investigate structural methods based on tree and graph and statistical methods based on distributed architecture and machine learning to extract schemas. The schemas obtained by the structural methods are more interpretable, and the statistical methods have better applicability and generalization ability. Moreover, we also investigate tools and systems for schemas extraction. Schema extraction tools are mainly used for spark or NoSQL databases, and are suitable for small datasets or simple application environments. The system mainly focuses on the extraction and management of schemas in large data sets and complex application scenarios. Furthermore, we also compare these techniques to facilitate data managers' choice.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- (11 more...)
- Overview (0.54)
- Research Report (0.50)
- Workflow (0.46)
A Graph Representation of Semi-structured Data for Web Question Answering
Zhang, Xingyao, Shou, Linjun, Pei, Jian, Gong, Ming, Wen, Lijie, Jiang, Daxin
The abundant semi-structured data on the Web, such as HTML-based tables and lists, provide commercial search engines a rich information source for question answering (QA). Different from plain text passages in Web documents, Web tables and lists have inherent structures, which carry semantic correlations among various elements in tables and lists. Many existing studies treat tables and lists as flat documents with pieces of text and do not make good use of semantic information hidden in structures. In this paper, we propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations. We also develop pre-training and reasoning techniques on the graph model for the QA task. Extensive experiments on several real datasets collected from a commercial engine verify the effectiveness of our approach. Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > United States > New York (0.04)
- Europe > Middle East > Cyprus > Nicosia > Nicosia (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)