Goto

Collaborating Authors

 Information Extraction


Automatic detection of relevant information, predictions and forecasts in financial news through topic modelling with Latent Dirichlet Allocation

arXiv.org Artificial Intelligence

Financial news items are unstructured sources of information that can be mined to extract knowledge for market screening applications. Manual extraction of relevant information from the continuous stream of finance-related news is cumbersome and beyond the skills of many investors, who, at most, can follow a few sources and authors. Accordingly, we focus on the analysis of financial news to identify relevant text and, within that text, forecasts and predictions. We propose a novel Natural Language Processing (NLP) system to assist investors in the detection of relevant financial events in unstructured textual sources by considering both relevance and temporality at the discursive level. Firstly, we segment the text to group together closely related text. Secondly, we apply co-reference resolution to discover internal dependencies within segments. Finally, we perform relevant topic modelling with Latent Dirichlet Allocation (LDA) to separate relevant from less relevant text and then analyse the relevant text using a Machine Learning-oriented temporal approach to identify predictions and speculative statements. We created an experimental data set composed of 2,158 financial news items that were manually labelled by NLP researchers to evaluate our solution. The ROUGE-L values for the identification of relevant text and predictions/forecasts were 0.662 and 0.982, respectively. To our knowledge, this is the first work to jointly consider relevance and temporality at the discursive level. It contributes to the transfer of human associative discourse capabilities to expert systems through the combination of multi-paragraph topic segmentation and co-reference resolution to separate author expression patterns, topic modelling with LDA to detect relevant text, and discursive temporality analysis to identify forecasts and predictions within this text.


Detection of financial opportunities in micro-blogging data with a stacked classification system

arXiv.org Artificial Intelligence

Micro-blogging sources such as the Twitter social network provide valuable real-time data for market prediction models. Investors' opinions in this network follow the fluctuations of the stock markets and often include educated speculations on market opportunities that may have impact on the actions of other investors. In view of this, we propose a novel system to detect positive predictions in tweets, a type of financial emotions which we term "opportunities" that are akin to "anticipation" in Plutchik's theory. Specifically, we seek a high detection precision to present a financial operator a substantial amount of such tweets while differentiating them from the rest of financial emotions in our system. We achieve it with a three-layer stacked Machine Learning classification system with sophisticated features that result from applying Natural Language Processing techniques to extract valuable linguistic information. Experimental results on a dataset that has been manually annotated with financial emotion and ticker occurrence tags demonstrate that our system yields satisfactory and competitive performance in financial opportunity detection, with precision values up to 83%. This promising outcome endorses the usability of our system to support investors' decision making.


RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

arXiv.org Artificial Intelligence

We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and legal data processing. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data and OCR outputs are available to download at https://indicodatasolutions.github.io/RealKIE/ code to reproduce the baselines will be available shortly.


Emotion Detection with Transformers: A Comparative Study

arXiv.org Artificial Intelligence

Analyzing the sentiment and emotion behind social media text data can help us understand the attitudes, preferences, and feelings of the users (Isah, Trundle, & Neagu, 2014). Unveiling these emotions goes beyond simply gauging sentiment, as it provides deeper insights into user motivations and psychological states. For instance, capturing the nuances of human emotion expressed through text remains a complex task due to the limitations of language itself. Sentiment analysis, while valuable, only provides a surface-level understanding. Identifying specific emotions expressed in text offers deeper insights into user behavior and motivations. Sentiment analysis is a natural language processing task that aims to classify the polarity of a text as positive, negative, or neutral (Khurana, Koli, Khatter, & Singh, 2023; Min et al., 2023). Emotion classification is a related task that aims to identify the specific emotion expressed in a text, such as sadness, joy, love, anger, fear, or surprise. Both tasks might be challenging due to the complexity and variability of natural language. Transfer learning is a technique that allows a model trained on one task to be used for another related task (Weiss, Khoshgoftaar, & Wang, 2016).


A Hybrid Approach To Aspect Based Sentiment Analysis Using Transfer Learning

arXiv.org Artificial Intelligence

Aspect-Based Sentiment Analysis (ABSA) aims to identify terms or multiword expressions (MWEs) on which sentiments are expressed and the sentiment polarities associated with them. The development of supervised models has been at the forefront of research in this area. However, training these models requires the availability of manually annotated datasets which is both expensive and time-consuming. Furthermore, the available annotated datasets are tailored to a specific domain, language, and text type. In this work, we address this notable challenge in current state-of-the-art ABSA research. We propose a hybrid approach for Aspect Based Sentiment Analysis using transfer learning. The approach focuses on generating weakly-supervised annotations by exploiting the strengths of both large language models (LLM) and traditional syntactic dependencies. We utilise syntactic dependency structures of sentences to complement the annotations generated by LLMs, as they may overlook domain-specific aspect terms. Extensive experimentation on multiple datasets is performed to demonstrate the efficacy of our hybrid method for the tasks of aspect term extraction and aspect sentiment classification.


Efficient Information Extraction in Few-Shot Relation Classification through Contrastive Representation Learning

arXiv.org Artificial Intelligence

Differentiating relationships between entity pairs with limited labeled instances poses a significant challenge in few-shot relation classification. Representations of textual data extract rich information spanning the domain, entities, and relations. In this paper, we introduce a novel approach to enhance information extraction combining multiple sentence representations and contrastive learning. While representations in relation classification are commonly extracted using entity marker tokens, we argue that substantial information within the internal model representations remains untapped. To address this, we propose aligning multiple sentence representations, such as the [CLS] token, the [MASK] token used in prompting, and entity marker tokens. Our method employs contrastive learning to extract complementary discriminative information from these individual representations. This is particularly relevant in low-resource settings where information is scarce. Leveraging multiple sentence representations is especially effective in distilling discriminative information for relation classification when additional information, like relation descriptions, are not available. We validate the adaptability of our approach, maintaining robust performance in scenarios that include relation descriptions, and showcasing its flexibility to adapt to different resource constraints.


Is There a One-Model-Fits-All Approach to Information Extraction? Revisiting Task Definition Biases

arXiv.org Artificial Intelligence

Definition bias is a negative phenomenon that can mislead models. Definition bias in information extraction appears not only across datasets from different domains but also within datasets sharing the same domain. We identify two types of definition bias in IE: bias among information extraction datasets and bias between information extraction datasets and instruction tuning datasets. To systematically investigate definition bias, we conduct three probing experiments to quantitatively analyze it and discover the limitations of unified information extraction and large language models in solving definition bias. To mitigate definition bias in information extraction, we propose a multi-stage framework consisting of definition bias measurement, bias-aware fine-tuning, and task-specific bias mitigation. Experimental results demonstrate the effectiveness of our framework in addressing definition bias. Resources of this paper can be found at https://github.com/EZ-hwh/definition-bias


Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding

arXiv.org Artificial Intelligence

Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-language model (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.


Extracting Emotion Phrases from Tweets using BART

arXiv.org Artificial Intelligence

Sentiment analysis is a natural language processing task that aims to identify and extract the emotional aspects of a text. However, many existing sentiment analysis methods primarily classify the overall polarity of a text, overlooking the specific phrases that convey sentiment. In this paper, we applied an approach to sentiment analysis based on a question-answering framework. Our approach leverages the power of Bidirectional Autoregressive Transformer (BART), a pre-trained sequence-to-sequence model, to extract a phrase from a given text that amplifies a given sentiment polarity. We create a natural language question that identifies the specific emotion to extract and then guide BART to pay attention to the relevant emotional cues in the text. We use a classifier within BART to predict the start and end positions of the answer span within the text, which helps to identify the precise boundaries of the extracted emotion phrase. Our approach offers several advantages over most sentiment analysis studies, including capturing the complete context and meaning of the text and extracting precise token spans that highlight the intended sentiment. We achieved an end loss of 87% and Jaccard score of 0.61.


Visualization of Unstructured Sports Data -- An Example of Cricket Short Text Commentary

arXiv.org Artificial Intelligence

Sports visualization focuses on the use of structured data, such as box-score data and tracking data. Unstructured data sources pertaining to sports are available in various places such as blogs, social media posts, and online news articles. Sports visualization methods either not fully exploited the information present in these sources or the proposed visualizations through the use of these sources did not augment to the body of sports visualization methods. We propose the use of unstructured data, namely cricket short text commentary for visualization. The short text commentary data is used for constructing individual player's strength rules and weakness rules. A computationally feasible definition for player's strength rule and weakness rule is proposed. A visualization method for the constructed rules is presented. In addition, players having similar strength rules or weakness rules is computed and visualized. We demonstrate the usefulness of short text commentary in visualization by analyzing the strengths and weaknesses of cricket players using more than one million text commentaries. We validate the constructed rules through two validation methods. The collected data, source code, and obtained results on more than 500 players are made publicly available.