AITopics | Information Extraction

Collaborating Authors

Information Extraction

News Overviews Instructional Materials AI-Alerts Classics

Analyzing German Parliamentary Speeches: A Machine Learning Approach for Topic and Sentiment Classification

Pätz, Lukas, Beyer, Moritz, Späth, Jannik, Bohlen, Lasse, Zschech, Patrick, Kraus, Mathias, Rosenberger, Julian

arXiv.org Artificial IntelligenceAug-6-2025

This study investigates political discourse in the German parliament, the Bundestag, by analyzing approximately 28,000 parliamentary speeches from the last five years. Two machine learning models for topic and sentiment classification were developed and trained on a manually labeled dataset. The models showed strong classification performance, achieving an area under the receiver operating characteristic curve (AUROC) of 0.94 for topic classification (average across topics) and 0.89 for sentiment classification. Both models were applied to assess topic trends and sentiment distributions across political parties and over time. The analysis reveals remarkable relationships between parties and their role in parliament. In particular, a change in style can be observed for parties moving from government to opposition. While ideological positions matter, governing responsibilities also shape discourse. The analysis directly addresses key questions about the evolution of topics, sentiment dynamics, and party-specific discourse strategies in the Bundestag.

machine learning, natural language, text classification, (21 more...)

arXiv.org Artificial Intelligence

2508.03181

Country: Europe > Germany > Saxony (0.14)

Genre: Research Report > New Finding (0.94)

Industry: Government > Regional Government > Europe Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)

Add feedback

Cross-Domain Web Information Extraction at Pinterest

Farag, Michael, Halina, Patrick, Zaytsev, Andrey, Munagala, Alekhya, Ahmed, Imtihan, Wang, Junhao

arXiv.org Artificial IntelligenceAug-5-2025

The internet offers a massive repository of unstructured information, but it's a significant challenge to convert this into a structured format. At Pinterest, the ability to accurately extract structured product data from e-commerce websites is essential to enhance user experiences and improve content distribution. In this paper, we present Pinterest's system for attribute extraction, which achieves remarkable accuracy and scalability at a manageable cost. Our approach leverages a novel webpage representation that combines structural, visual, and text modalities into a compact form, optimizing it for small model learning. This representation captures each visible HTML node with its text, style and layout information. We show how this allows simple models such as eXtreme Gradient Boosting (XGBoost) to extract attributes more accurately than much more complex Large Language Models (LLMs) such as Generative Pre-trained Transformer (GPT). Our results demonstrate a system that is highly scalable, processing over 1,000 URLs per second, while being 1000 times more cost-effective than the cheapest GPT alternatives.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3711896.3737207

2508.01096

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.54)

Industry: Information Technology > Services > e-Commerce Services (0.54)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Efficacy of AI RAG Tools for Complex Information Extraction and Data Annotation Tasks: A Case Study Using Banks Public Disclosures

Botti, Nicholas, Haberkorn, Flora, Hoopes, Charlotte, Khan, Shaun

arXiv.org Artificial IntelligenceJul-30-2025

We utilize a within-subjects design with randomized task assignments to understand the effectiveness of using an AI retrieval augmented generation (RAG) tool to assist analysts with an information extraction and data annotation task. We replicate an existing, challenging real-world annotation task with complex multi-part criteria on a set of thousands of pages of public disclosure documents from global systemically important banks (GSIBs) with heterogeneous and incomplete information content. We test two treatment conditions. First, a "naive" AI use condition in which annotators use only the tool and must accept the first answer they are given. And second, an "interactive" AI treatment condition where annotators use the tool interactively, and use their judgement to follow-up with additional information if necessary. Compared to the human-only baseline, the use of the AI tool accelerated task execution by up to a factor of 10 and enhanced task accuracy, particularly in the interactive condition. We find that when extrapolated to the full task, these methods could save up to 268 hours compared to the human-only approach. Additionally, our findings suggest that annotator skill, not just with the subject matter domain, but also with AI tools, is a factor in both the accuracy and speed of task performance.

data mining, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2507.2136

Country: North America > United States (1.00)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.85)
Information Technology > Data Science > Data Mining > Text Mining (0.71)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

FinDPO: Financial Sentiment Analysis for Algorithmic Trading through Preference Optimization of LLMs

Iacovides, Giorgos, Zhou, Wuyang, Mandic, Danilo

arXiv.org Artificial IntelligenceJul-25-2025

Opinions expressed in online finance-related textual data are having an increasingly profound impact on trading decisions and market movements. This trend highlights the vital role of sentiment analysis as a tool for quantifying the nature and strength of such opinions. With the rapid development of Generative AI (GenAI), supervised fine-tuned (SFT) large language models (LLMs) have become the de facto standard for financial sentiment analysis. However, the SFT paradigm can lead to memorization of the training data and often fails to generalize to unseen samples. This is a critical limitation in financial domains, where models must adapt to previously unobserved events and the nuanced, domain-specific language of finance. To this end, we introduce FinDPO, the first finance-specific LLM framework based on post-training human preference alignment via Direct Preference Optimization (DPO). The proposed FinDPO achieves state-of-the-art performance on standard sentiment classification benchmarks, outperforming existing supervised fine-tuned models by 11% on the average. Uniquely, the FinDPO framework enables the integration of a fine-tuned causal LLM into realistic portfolio strategies through a novel 'logit-to-score' conversion, which transforms discrete sentiment predictions into continuous, rankable sentiment scores (probabilities). In this way, simulations demonstrate that FinDPO is the first sentiment-based approach to maintain substantial positive returns of 67% annually and strong risk-adjusted performance, as indicated by a Sharpe ratio of 2.0, even under realistic transaction costs of 5 basis points (bps).

large language model, machine learning, sentiment analysis, (15 more...)

arXiv.org Artificial Intelligence

2507.18417

Genre: Research Report (1.00)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Mapping Technological Futures: Anticipatory Discourse Through Text Mining

Skorski, Maciej, Landowska, Alina, Rajda, Krzysztof

arXiv.org Artificial IntelligenceJul-25-2025

The volatility and unpredictability of emerging technologies, such as artificial intelligence (AI), generate significant uncertainty, which is widely discussed on social media. This study examines anticipatory discourse surrounding technological futures by analysing 1.5 million posts from 400 key opinion leaders (KOLs) published on the X platform (from 2021 to 2023). Using advanced text mining techniques, including BERTopic modelling, sentiment, emotion, and attitude analyses, the research identifies 100 distinct topics reflecting anticipated tech-driven futures. Our findings emphasize the dual role of KOLs in framing \textit{present futures} -- optimistic visions of transformative technologies like AI and IoT -- and influencing \textit{future presents}, where these projections shape contemporary societal and geopolitical debates. Positive emotions such as Hope dominate, outweighing Anxiety, particularly in topics like ``Machine Learning, Data Science, and Deep Learning,'' while discussions around ``Climate Change'' and ``War, Ukraine, and Trump People'' elicit \textit{Anxiety}. By framing technologies as solutions to societal challenges, KOLs act as mediators of societal narratives, bridging imagined futures and current realities. These insights underscore their pivotal role in directing public attention with emerging technologies during periods of heightened uncertainty, advancing our understanding of anticipatory discourse in technology-mediated contexts.

data mining, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.02853

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Middle East > UAE (0.28)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Government (1.00)
Banking & Finance (1.00)
Information Technology > Security & Privacy (0.93)
(4 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.68)
Information Technology > Data Science > Data Mining > Text Mining (0.61)
(3 more...)

Add feedback

GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface

Zaratiana, Urchade, Pasternak, Gil, Boyd, Oliver, Hurn-Maloney, George, Lewis, Ash

arXiv.org Artificial IntelligenceJul-25-2025

Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built pretrained transformer encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across extraction and classification tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source pip-installable library with pre-trained models and documentation at https://github.com/fastino-ai/GLiNER2.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2507.18546

Country:

North America > United States > Virginia (0.29)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.40)

Industry: Information Technology > Security & Privacy (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis

He, Xiaoqiang

arXiv.org Artificial IntelligenceJul-24-2025

Multimodal aspect-based sentiment analysis(MABSA) seeks to identify aspect terms within paired image-text data and determine their fine grained sentiment polarities, representing a fundamental task for improving the effectiveness of applications such as product review systems and public opinion monitoring. Existing methods face challenges such as cross modal alignment noise and insufficient consistency in fine-grained representations. While global modality alignment methods often overlook the connection between aspect terms and their corresponding local visual regions, bridging the representation gap between text and images remains a challenge. To address these limitations, this paper introduces an end to end Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion(CLAMP). The framework is composed of three novel modules: Progressive Attention Fusion network, Multi-task Contrastive Learning, and Adaptive Multi-loss Aggregation. The Progressive Attention Fusion network enhances fine-grained alignment between textual features and image regions via hierarchical, multi-stage cross modal interactions, effectively suppressing irrelevant visual noise. Secondly, multi-task contrastive learning combines global modal contrast and local granularity alignment to enhance cross modal representation consistency. Adaptive Multi-loss Aggregation employs a dynamic uncertainty based weighting mechanism to calibrate loss contributions according to each task's uncertainty, thereby mitigating gradient interference. Evaluation on standard public benchmarks demonstrates that CLAMP consistently outperforms the vast majority of existing state of the art methods.

machine learning, natural language, sentiment analysis, (23 more...)

arXiv.org Artificial Intelligence

2507.16854

Country: Asia (0.28)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
(3 more...)

Add feedback

PET: An Annotated Dataset for Process Extraction from Natural Language Text

Bellan, Patrizio, van der Aa, Han, Dragoni, Mauro, Ghidini, Chiara, Ponzetto, Simone Paolo

arXiv.org Artificial IntelligenceJul-22-2025

Process extraction from text is an important task of process discovery, for which various approaches have been developed in recent years. However, in contrast to other information extraction tasks, there is a lack of gold-standard corpora of business process descriptions that are carefully annotated with all the entities and relationships of interest. Due to this, it is currently hard to compare the results obtained by extraction approaches in an objective manner, whereas the lack of annotated texts also prevents the application of data-driven information extraction methodologies, typical of the natural language processing field. Therefore, to bridge this gap, we present the PET dataset, a first corpus of business process descriptions annotated with activities, gateways, actors, and flow information.

artificial intelligence, natural language, relation, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-25383-6_23

2203.0486

Country: Europe > Italy (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)

Add feedback

AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis

Rafiuddin, S M, Kamal, Sadia, Rakib, Mohammed, Bagavathi, Arunkumar, Sen, Atriya

arXiv.org Artificial IntelligenceJul-18-2025

We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA) that uses adaptive cross-modal attention mechanisms to improve sentiment classification and aspect term extraction from both text and images. Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information by focusing on how textual cues and visual context interact. We tested our approach against several baselines, including traditional text-based models and other multimodal methods. Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score, and is particularly effective in identifying nuanced inter-modal relationships that are crucial for accurate sentiment and aspect term extraction. This effectiveness comes from the model's ability to adjust its focus dynamically based on the context's relevance, improving the depth and accuracy of sentiment analysis across various multimodal data sets. AdaptiSent sets a new standard for MABSA, significantly outperforming current methods, especially in understanding complex multimodal information.

arXiv.org Artificial Intelligence

2507.12695

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.73)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.73)

Add feedback

Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker

Saxena, Rachna, Kumar, Abhijeet, Shanmugam, Suresh

arXiv.org Artificial IntelligenceJul-17-2025

Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented Q&A tasks. There are yet few challenges using it for RAG based multi-modal Q&A. Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval. Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption. Lastly, the current state of late interaction mechanism does not leverage the approximate neighbor search indexing methods for large speed ups in retrieval process. This paper explores a pragmatic approach to make vision retrieval process scalable and efficient without compromising on performance quality. We propose multi-step custom implementation utilizing widely adopted hybrid search (metadata & embedding) and state of the art late interaction re-ranker to retrieve best matching pages. Finally, MLLM are prompted as reader to generate answers from contextualized best matching pages. Through experiments, we observe that the proposed design is scalable (significant speed up) and stable (without degrading performance quality), hence can be used as production systems at enterprises.

large language model, question answering, vector database, (16 more...)

arXiv.org Artificial Intelligence

2507.12378

Country: Asia > India > Karnataka > Bengaluru (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.64)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.54)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.47)

Add feedback