Apostol, Elena-Simona
GETAE: Graph information Enhanced deep neural NeTwork ensemble ArchitecturE for fake news detection
Truică, Ciprian-Octavian, Apostol, Elena-Simona, Marogel, Marius, Paschke, Adrian
In today's digital age, fake news has become a major problem that has serious consequences, ranging from social unrest to political upheaval. To address this issue, new methods for detecting and mitigating fake news are required. In this work, we propose to incorporate contextual and network-aware features into the detection process. This involves analyzing not only the content of a news article but also the context in which it was shared and the network of users who shared it, i.e., the information diffusion. Thus, we propose GETAE, \underline{G}raph Information \underline{E}nhanced Deep Neural Ne\underline{t}work Ensemble \underline{A}rchitectur\underline{E} for Fake News Detection, a novel ensemble architecture that uses textual content together with the social interactions to improve fake news detection. GETAE contains two Branches: the Text Branch and the Propagation Branch. The Text Branch uses Word and Transformer Embeddings and a Deep Neural Network based on feed-forward and bidirectional Recurrent Neural Networks (\textsc{[Bi]RNN}) for learning novel contextual features and creating a novel Text Content Embedding. The Propagation Branch considers the information propagation within the graph network and proposes a Deep Learning architecture that employs Node Embeddings to create novel Propagation Embedding. GETAE Ensemble combines the two novel embeddings, i.e., Text Content Embedding and Propagation Embedding, to create a novel \textit{Propagation-Enhanced Content Embedding} which is afterward used for classification. The experimental results obtained on two real-world publicly available datasets, i.e., Twitter15 and Twitter16, prove that using this approach improves fake news detection and outperforms state-of-the-art models.
StopHC: A Harmful Content Detection and Mitigation Architecture for Social Media Platforms
Truică, Ciprian-Octavian, Constantinescu, Ana-Teodora, Apostol, Elena-Simona
The mental health of social media users has started more and more to be put at risk by harmful, hateful, and offensive content. In this paper, we propose \textsc{StopHC}, a harmful content detection and mitigation architecture for social media platforms. Our aim with \textsc{StopHC} is to create more secure online environments. Our solution contains two modules, one that employs deep neural network architecture for harmful content detection, and one that uses a network immunization algorithm to block toxic nodes and stop the spread of harmful content. The efficacy of our solution is demonstrated by experiments conducted on two real-world datasets.
Advancements in eHealth Data Analytics through Natural Language Processing and Deep Learning
Apostol, Elena-Simona, Truică, Ciprian-Octavian
The healthcare environment is commonly referred to as "information-rich" but also "knowledge poor". Healthcare systems collect huge amounts of data from various sources: lab reports, medical letters, logs of medical tools or programs, medical prescriptions, etc. These massive sets of data can provide great knowledge and information that can improve the medical services, and overall the healthcare domain, such as disease prediction by analyzing the patient's symptoms or disease prevention, by facilitating the discovery of behavioral factors for diseases. Unfortunately, only a relatively small volume of the textual eHealth data is processed and interpreted, an important factor being the difficulty in efficiently performing Big Data operations. In the medical field, detecting domain-specific multi-word terms is a crucial task as they can define an entire concept with a few words. A term can be defined as a linguistic structure or a concept, and it is composed of one or more words with a specific meaning to a domain. All the terms of a domain create its terminology. This chapter offers a critical study of the current, most performant solutions for analyzing unstructured (image and textual) eHealth data. This study also provides a comparison of the current Natural Language Processing and Deep Learning techniques in the eHealth context. Finally, we examine and discuss some of the current issues, and we define a set of research directions in this area.
Efficient Machine Learning Ensemble Methods for Detecting Gravitational Wave Glitches in LIGO Time Series
Apostol, Elena-Simona, Truică, Ciprian-Octavian
The phenomenon of Gravitational Wave (GW) analysis has grown in popularity as technology has advanced and the process of observing gravitational waves has become more precise. Although the sensitivity and the frequency of observation of GW signals are constantly improving, the possibility of noise in the collected GW data remains. In this paper, we propose two new Machine and Deep learning ensemble approaches (i.e., ShallowWaves and DeepWaves Ensembles) for detecting different types of noise and patterns in datasets from GW observatories. Our research also investigates various Machine and Deep Learning techniques for multi-class classification and provides a comprehensive benchmark, emphasizing the best results in terms of three commonly used performance metrics (i.e., accuracy, precision, and recall). We train and test our models on a dataset consisting of annotated time series from real-world data collected by the Advanced Laser Interferometer GW Observatory (LIGO). We empirically show that the best overall accuracy is obtained by the proposed DeepWaves Ensemble, followed close by the ShallowWaves Ensemble.
Semantic Change Detection for the Romanian Language
Truică, Ciprian-Octavian, Tudose, Victor, Apostol, Elena-Simona
Language is in a continuous process of change that occurs permanently, language change being the phenomenon that drives language evolution, as a process of adaptation to the environment and the ways other speakers use the language [3, 2]. The various instances of language change are classified into different categories, such as regular phonetic changes, changes in word usage, and changes in the way words appear together, i.e., syntactic changes. Although it is usually a continuous process that follows regular patterns, very abrupt changes in the meanings of words can still occur, usually motivated by a change in the context a community lives in [15, 9, 7]. Semantic change, as a phenomenon permanently present in language evolution, is an important aspect that should be taken into account when working with historical data[1]. Historical linguists, lexical typologists, and other humanities and social science experts have studied the meaning of words and how it changes over time.
ATESA-B{\AE}RT: A Heterogeneous Ensemble Learning Model for Aspect-Based Sentiment Analysis
Apostol, Elena-Simona, Pisică, Alin-Georgian, Truică, Ciprian-Octavian
The increasing volume of online reviews has made possible the development of sentiment analysis models for determining the opinion of customers regarding different products and services. Until now, sentiment analysis has proven to be an effective tool for determining the overall polarity of reviews. To improve the granularity at the aspect level for a better understanding of the service or product, the task of aspect-based sentiment analysis aims to first identify aspects and then determine the user's opinion about them. The complexity of this task lies in the fact that the same review can present multiple aspects, each with its own polarity. Current solutions have poor performance on such data. We address this problem by proposing ATESA-B{\AE}RT, a heterogeneous ensemble learning model for Aspect-Based Sentiment Analysis. Firstly, we divide our problem into two sub-tasks, i.e., Aspect Term Extraction and Aspect Term Sentiment Analysis. Secondly, we use the \textit{argmax} multi-class classification on six transformers-based learners for each sub-task. Initial experiments on two datasets prove that ATESA-B{\AE}RT outperforms current state-of-the-art solutions while solving the many aspects problem.
A Distributed Automatic Domain-Specific Multi-Word Term Recognition Architecture using Spark Ecosystem
Truică, Ciprian-Octavian, Istrate, Neculai-Ovidiu, Apostol, Elena-Simona
Automatic Term Recognition (ATR) is used to extract domain-specific terms that create the terminology of the domain. A term can be defined as a linguistic structure or a concept and it is composed of one or more words with a specific meaning to a domain. With the exponential growth of technical and scientific articles, new domain-specific terms appear daily as named entities (e.g., Apache Spark), idioms (e.g., Big Data), multi-word expressions (e.g., recurrent neural networks), or through semantic change and shifts (e.g., local neighborhood). Methods that can automatically recognize and extract these domain-specific terms are useful for both scientists and professionals to improve existing systems (i.e., WordNet [4], OntoLex-FRaC [3]) that deal with linguistics, terminology, and machine-readable technologies. ATR methods [6, 7, 5, 8, 9, 11] consist of two main phases. The first phase is extracting a list of candidate terms that will later be used by scoring metrics to rank their importance to a given domain. To extract this list, words are tagged with their part of speech (PoS), and candidate multi-word terms are extracted using language-dependent linguistic filters [10]. The second phase is specific to each method and involves computing a score of domain relevance by using different term statistics, e.g., frequency, context, number of similar terms, etc. Users can process large volumes of textual data when employing ATR methods. The extraction and recognition of domain-specific terms can be improved by developing the methods on top of distributed ecosystems such as Apache Hadoop and Apache Spark.
MisRoB{\AE}RTa: Transformers versus Misinformation
Truică, Ciprian-Octavian, Apostol, Elena-Simona
Misinformation is considered a threat to our democratic values and principles. The spread of such content on social media polarizes society and undermines public discourse by distorting public perceptions and generating social unrest while lacking the rigor of traditional journalism. Transformers and transfer learning proved to be state-of-the-art methods for multiple well-known natural language processing tasks. In this paper, we propose MisRoB{\AE}RTa, a novel transformer-based deep neural ensemble architecture for misinformation detection. MisRoB{\AE}RTa takes advantage of two transformers (BART \& RoBERTa) to improve the classification performance. We also benchmarked and evaluated the performances of multiple transformers on the task of misinformation detection. For training and testing, we used a large real-world news articles dataset labeled with 10 classes, addressing two shortcomings in the current research: increasing the size of the dataset from small to large, and moving the focus of fake news detection from binary classification to multi-class classification. For this dataset, we manually verified the content of the news articles to ensure that they were correctly labeled. The experimental results show that the accuracy of transformers on the misinformation detection problem was significantly influenced by the method employed to learn the context, dataset size, and vocabulary dimension. We observe empirically that the best accuracy performance among the classification models that use only one transformer is obtained by BART, while DistilRoBERTa obtains the best accuracy in the least amount of time required for fine-tuning and training. The proposed MisRoB{\AE}RTa outperforms the other transformer models in the task of misinformation detection. To arrive at this conclusion, we performed ample ablation and sensitivity testing with MisRoB{\AE}RTa on two datasets.
It's All in the Embedding! Fake News Detection Using Document Embeddings
Truică, Ciprian-Octavian, Apostol, Elena-Simona
With the current shift in the mass media landscape from journalistic rigor to social media, personalized social media is becoming the new norm. Although the digitalization progress of the media brings many advantages, it also increases the risk of spreading disinformation, misinformation, and malformation through the use of fake news. The emergence of this harmful phenomenon has managed to polarize society and manipulate public opinion on particular topics, e.g., elections, vaccinations, etc. Such information propagated on social media can distort public perceptions and generate social unrest while lacking the rigor of traditional journalism. Natural Language Processing and Machine Learning techniques are essential for developing efficient tools that can detect fake news. Models that use the context of textual data are essential for resolving the fake news detection problem, as they manage to encode linguistic features within the vector representation of words. In this paper, we propose a new approach that uses document embeddings to build multiple models that accurately label news articles as reliable or fake. We also present a benchmark on different architectures that detect fake news using binary or multi-labeled classification. We evaluated the models on five large news corpora using accuracy, precision, and recall. We obtained better results than more complex state-of-the-art Deep Neural Network models. We observe that the most important factor for obtaining high accuracy is the document encoding, not the classification model's complexity.
SimpLex: a lexical text simplification architecture
Truică, Ciprian-Octavian, Stan, Andrei-Ionut, Apostol, Elena-Simona
Text simplification (TS) is the process of generating easy-to-understand sentences from a given sentence or piece of text. The aim of TS is to reduce both the lexical (which refers to vocabulary complexity and meaning) and syntactic (which refers to the sentence structure) complexity of a given text or sentence without the loss of meaning or nuance. In this paper, we present \textsc{SimpLex}, a novel simplification architecture for generating simplified English sentences. To generate a simplified sentence, the proposed architecture uses either word embeddings (i.e., Word2Vec) and perplexity, or sentence transformers (i.e., BERT, RoBERTa, and GPT2) and cosine similarity. The solution is incorporated into a user-friendly and simple-to-use software. We evaluate our system using two metrics, i.e., SARI, and Perplexity Decrease. Experimentally, we observe that the transformer models outperform the other models in terms of the SARI score. However, in terms of Perplexity, the Word-Embeddings-based models achieve the biggest decrease. Thus, the main contributions of this paper are: (1) We propose a new Word Embedding and Transformer based algorithm for text simplification; (2) We design \textsc{SimpLex} -- a modular novel text simplification system -- that can provide a baseline for further research; and (3) We perform an in-depth analysis of our solution and compare our results with two state-of-the-art models, i.e., LightLS [19] and NTS-w2v [44]. We also make the code publicly available online.