Information Extraction
Analyzing the Impact of Sentiments of Scientific Articles on COVID-19 Vaccination Rates
Chua, Sean Eugene G., Sison, Kevin Anthony S.
At the peak of the COVID-19 pandemic, numerous countries worldwide sought to mobilize vaccination campaigns in an attempt to curb the spread and number of deaths caused by the virus. One avenue in which information regarding COVID vaccinations is propagated is that of scientific articles, which provide a certain level of credibility regarding this. Hence, this increases the probability that people who view these articles would get vaccinated if the articles convey a positive message on vaccinations and conversely decreases the probability of vaccinations if the articles convey a negative message. This being said, this study aims to investigate the correlation between article sentiments and the corresponding increase or decrease in vaccinations in the United States. To do this, a lexicon-based sentiment analysis was performed in two steps: first, article content was scraped via a Python library called BeautifulSoup, and second, VADER was used to obtain the sentiment analysis scores for each article based on the scraped text content. Results suggest that there was a relatively weak correlation between the average sentiment score of articles and the corresponding increase or decrease in COVID vaccination rates in the US.
NL2GDPR: Automatically Develop GDPR Compliant Android Application Features from Natural Language
Shezan, Faysal Hossain, Lao, Yingjie, Peng, Minlong, Wang, Xin, Sun, Mingming, Li, Ping
The recent privacy leakage incidences and the more strict policy regulations demand a much higher standard of compliance for companies and mobile apps. However, such obligations also impose significant challenges on app developers for complying with these regulations that contain various perspectives, activities, and roles, especially for small companies and developers who are less experienced in this matter or with limited resources. To address these hurdles, we develop an automatic tool, NL2GDPR, which can generate policies from natural language descriptions from the developer while also ensuring the app's functionalities are compliant with General Data Protection Regulation (GDPR). NL2GDPR is developed by leveraging an information extraction tool, OIA (Open Information Annotation), developed by Sun et al. (2020); Wang et al. (2022b) from Baidu Cognitive Computing Lab. At the core, NL2GDPR is a privacy-centric information extraction model, appended with a GDPR policy finder and a policy generator. We perform a comprehensive study to grasp the challenges in extracting privacy-centric information and generating privacy policies, while exploiting optimizations for this specific task. With NL2GDPR, we can achieve 92.9%, 95.2%, and 98.4% accuracy in correctly identifying GDPR policies related to personal data storage, process, and share types, respectively. To the best of our knowledge, NL2GDPR is the first tool that allows a developer to automatically generate GDPR compliant policies, with only the need of entering the natural language for describing the app features. Note that other non-GDPR-related features might be integrated with the generated features to build a complex app.
A Spanish dataset for Targeted Sentiment Analysis of political headlines
Salgueiro, Tomás Alves, Zapata, Emilio Recart, Furman, Damián, Pérez, Juan Manuel, Larrosa, Pablo Nicolás Fernández
Subjective texts have been especially studied by several works as they can induce certain behaviours in their users. Most work focuses on user-generated texts in social networks, but some other texts also comprise opinions on certain topics and could influence judgement criteria during political decisions. In this work, we address the task of Targeted Sentiment Analysis for the domain of news headlines, published by the main outlets during the 2019 Argentinean Presidential Elections. For this purpose, we present a polarity dataset of 1,976 headlines mentioning candidates in the 2019 elections at the target level. Preliminary experiments with state-of-the-art classification algorithms based on pre-trained linguistic models suggest that target information is helpful for this task. We make our data and pre-trained models publicly available.
Label-Efficient Self-Training for Attribute Extraction from Semi-Structured Web Documents
Sarkhel, Ritesh, Huang, Binxuan, Lockard, Colin, Shiralkar, Prashant
Extracting structured information from HTML documents is a long-studied problem with a broad range of applications, including knowledge base construction, faceted search, and personalized recommendation. Prior works rely on a few human-labeled web pages from each target website or thousands of human-labeled web pages from some seed websites to train a transferable extraction model that generalizes on unseen target websites. Noisy content, low site-level consistency, and lack of inter-annotator agreement make labeling web pages a time-consuming and expensive ordeal. We develop LEAST -- a Label-Efficient Self-Training method for Semi-Structured Web Documents to overcome these limitations. LEAST utilizes a few human-labeled pages to pseudo-annotate a large number of unlabeled web pages from the target vertical. It trains a transferable web-extraction model on both human-labeled and pseudo-labeled samples using self-training. To mitigate error propagation due to noisy training samples, LEAST re-weights each training sample based on its estimated label accuracy and incorporates it in training. To the best of our knowledge, this is the first work to propose end-to-end training for transferable web extraction models utilizing only a few human-labeled pages. Experiments on a large-scale public dataset show that using less than ten human-labeled pages from each seed website for training, a LEAST-trained model outperforms previous state-of-the-art by more than 26 average F1 points on unseen websites, reducing the number of human-labeled pages to achieve similar performance by more than 10x.
Court grants Elon Musk access to a small but important set of Twitter data
The judge presiding over Twitter's lawsuit against Elon Musk has mostly rejected the multi-company executive's request to access an "absurdly broad" amount of data. She did, however, agree that additional data from Twitter is warranted and has ordered the social network to produce a subset of what Musk's camp had requested. To be exact, Judge Kathaleen McCormick has ordered Twitter to hand over data from the 9,000 accounts it reviewed in the fourth quarter of 2021 to determine the number of spam accounts on the platform. Further, it must produce the documents showing how those accounts, which Twitter calls "historical snapshot," were selected for review. Twitter, if you'll recall, is suing Elon Musk to force him to complete his $44 billion acquisition of the website. Musk offered to buy Twitter for $54.20 per share back in April, and Twitter had quickly agreed.
Human-in-the-loop Text Extraction System
In this article, we will talk in-depth about an interactive, human-in-the-loop tool called SEER. SEER helps users who work with such text datasets extract relevant data from them. A user in SEER would highlight examples of text they wish to extract. Positive examples are texts they wish to extract. Negative examples are texts they do not wish to extract.
Cross-Modality Gated Attention Fusion for Multimodal Sentiment Analysis
Multimodal sentiment analysis is an important research task to predict the sentiment score based on the different modality data from a specific opinion video. Many previous pieces of research have proved the significance of utilizing the shared and unique information across different modalities. However, the high-order combined signals from multimodal data would also help extract satisfied representations. In this paper, we propose CMGA, a Cross-Modality Gated Attention fusion model for MSA that tends to make adequate interaction across different modality pairs. CMGA also adds a forget gate to filter the noisy and redundant signals introduced in the interaction procedure. We experiment on two benchmark datasets in MSA, MOSI, and MOSEI, illustrating the performance of CMGA over several baseline models. We also conduct the ablation study to demonstrate the function of different components inside CMGA.
A Hierarchical Interactive Network for Joint Span-based Aspect-Sentiment Analysis
Chen, Wei, Du, Jinglong, Zhang, Zhao, Zhuang, Fuzhen, He, Zhongshi
Recently, some span-based methods have achieved encouraging performances for joint aspect-sentiment analysis, which first extract aspects (aspect extraction) by detecting aspect boundaries and then classify the span-level sentiments (sentiment classification). However, most existing approaches either sequentially extract task-specific features, leading to insufficient feature interactions, or they encode aspect features and sentiment features in a parallel manner, implying that feature representation in each task is largely independent of each other except for input sharing. Both of them ignore the internal correlations between the aspect extraction and sentiment classification. To solve this problem, we novelly propose a hierarchical interactive network (HI-ASA) to model two-way interactions between two tasks appropriately, where the hierarchical interactions involve two steps: shallow-level interaction and deep-level interaction. First, we utilize cross-stitch mechanism to combine the different task-specific features selectively as the input to ensure proper two-way interactions. Second, the mutual information technique is applied to mutually constrain learning between two tasks in the output layer, thus the aspect input and the sentiment input are capable of encoding features of the other task via backpropagation. Extensive experiments on three real-world datasets demonstrate HI-ASA's superiority over baselines.
Emotion Analysis using Multi-Layered Networks for Graphical Representation of Tweets
Nguyen, Anna, Longa, Antonio, Luca, Massimiliano, Kaul, Joe, Lopez, Gabriel
Anticipating audience reaction towards a certain piece of text is integral to several facets of society ranging from politics, research, and commercial industries. Sentiment analysis (SA) is a useful natural language processing (NLP) technique that utilizes both lexical/statistical and deep learning methods to determine whether different sized texts exhibit a positive, negative, or neutral emotion. However, there is currently a lack of tools that can be used to analyse groups of independent texts and extract the primary emotion from the whole set. Therefore, the current paper proposes a novel algorithm referred to as the Multi-Layered Tweet Analyzer (MLTA) that graphically models social media text using multi-layered networks (MLNs) in order to better encode relationships across independent sets of tweets. Graph structures are capable of capturing meaningful relationships in complex ecosystems compared to other representation methods. State of the art Graph Neural Networks (GNNs) are used to extract information from the Tweet-MLN and make predictions based on the extracted graph features. Results show that not only does the MLTA predict from a larger set of possible emotions, delivering a more accurate sentiment compared to the standard positive, negative or neutral, it also allows for accurate group-level predictions of Twitter data.
CMSBERT-CLR: Context-driven Modality Shifting BERT with Contrastive Learning for linguistic, visual, acoustic Representations
Multimodal sentiment analysis has become an increasingly popular research area as the demand for multimodal online content is growing. For multimodal sentiment analysis, words can have different meanings depending on the linguistic context and non-verbal information, so it is crucial to understand the meaning of the words accordingly. In addition, the word meanings should be interpreted within the whole utterance context that includes nonverbal information. In this paper, we present a Context-driven Modality Shifting BERT with Contrastive Learning for linguistic, visual, acoustic Representations (CMSBERT-CLR), which incorporates the whole context's non-verbal and verbal information and aligns modalities more effectively through contrastive learning. First, we introduce a Context-driven Modality Shifting (CMS) to incorporate the non-verbal and verbal information within the whole context of the sentence utterance. Then, for improving the alignment of different modalities within a common embedding space, we apply contrastive learning. Furthermore, we use an exponential moving average parameter and label smoothing as optimization strategies, which can make the convergence of the network more stable and increase the flexibility of the alignment. In our experiments, we demonstrate that our approach achieves state-of-the-art results.