Information Extraction
Twitter data leak exposes over 5.4 million accounts
Earlier this year, Twitter confirmed that the private user data for 5.4 million users was stolen due to an API vulnerability, but the company said it had "no evidence" that it was exploited. Now, all of those accounts have been exposed on a hacker form, BleepingComputer has reported. On top of that, an additional 1.4 million Twitter profiles for suspended users was reportedly shared privately, and an even larger data dump with the data of "tens of millions" of other users may have come from the same vulnerability. The owner of hacking forum called Breached told BleepingComputer that it was responsible for exploiting the weakness (originally obtained from another hacker called "Devil") and dumping the user records. It said that it also obtained 1.4 million Twitter profiles for suspended accounts, obtained via another API, but only shared those privately among a few individuals.
AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
Dossou, Bonaventure F. P., Tonja, Atnafu Lambebo, Yousuf, Oreen, Osei, Salomey, Oppong, Abigail, Shode, Iyanuoluwa, Awoyomi, Oluwabusayo Olufunke, Emezue, Chris Chinenye
In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.
Incorporating Dynamic Semantics into Pre-Trained Language Model for Aspect-based Sentiment Analysis
Zhang, Kai, Zhang, Kun, Zhang, Mengdi, Zhao, Hongke, Liu, Qi, Wu, Wei, Chen, Enhong
Aspect-based sentiment analysis (ABSA) predicts sentiment polarity towards a specific aspect in the given sentence. While pre-trained language models such as BERT have achieved great success, incorporating dynamic semantic changes into ABSA remains challenging. To this end, in this paper, we propose to address this problem by Dynamic Re-weighting BERT (DR-BERT), a novel method designed to learn dynamic aspect-oriented semantics for ABSA. Specifically, we first take the Stack-BERT layers as a primary encoder to grasp the overall semantic of the sentence and then fine-tune it by incorporating a lightweight Dynamic Re-weighting Adapter (DRA). Note that the DRA can pay close attention to a small region of the sentences at each step and re-weigh the vitally important words for better aspect-aware sentiment understanding. Finally, experimental results on three benchmark datasets demonstrate the effectiveness and the rationality of our proposed model and provide good interpretable insights for future semantic modeling.
EU confirms multiple ongoing investigations into TikTok data practices
The president of the European Commission, the executive branch of the European Union, has confirmed there are multiple ongoing investigations into TikTok. The probes concern the transfer of EU citizens' data to China and targeted advertising aimed at minors. Investigators are seeking to ensure that TikTok meets General Data Protection Regulation ( GDPR) requirements. "The data practices of TikTok, including with respect to international data transfers, are the object of several ongoing proceedings," Ursula von der Leyden wrote in a letter shared by Federal Communications Commissioner Brendan Carr. "This includes an investigation by the Irish [Data Protection Commission] about TikTok's compliance with several GDPR requirements, including as regards data transfers to China and the processing of data of minors, and litigation before the Dutch courts (in particular concerning targeted advertising regarding minors and data transfers to China)."
Automatic extraction of materials and properties from superconductors scientific literature
Foppiano, Luca, de Castro, Pedro Baptista, Suarez, Pedro Ortiz, Terashima, Kensei, Takano, Yoshihiko, Ishii, Masashi
The automatic extraction of materials and related properties from the scientific literature is gaining attention in data-driven materials science (Materials Informatics). In this paper, we discuss Grobid-superconductors, our solution for automatically extracting superconductor material names and respective properties from text. Built as a Grobid module, it combines machine learning and heuristic approaches in a multi-step architecture that supports input data as raw text or PDF documents. Using Grobid-superconductors, we built SuperCon2, a database of 40324 materials and properties records from 37700 papers. The material (or sample) information is represented by name, chemical formula, and material class, and is characterized by shape, doping, substitution variables for components, and substrate as adjoined information. The properties include the Tc superconducting critical temperature and, when available, applied pressure with the Tc measurement method.
Smart Agriculture : A Novel Multilevel Approach for Agricultural Risk Assessment over Unstructured Data
Najmi, Hasna, Mikram, Mounia, Rhanoui, Maryem, Yousfi, Siham
Detecting opportunities and threats from massive text data is a challenging task for most. Traditionally, companies would rely mainly on structured data to detect and predict risks, losing a huge amount of information that could be extracted from unstructured text data. Fortunately, artificial intelligence came to remedy this issue by innovating in data extraction and processing techniques, allowing us to understand and make use of Natural Language data and turning it into structures that a machine can process and extract insight from. Uncertainty refers to a state of not knowing what will happen in the future. This paper aims to leverage natural language processing and machine learning techniques to model uncertainties and evaluate the risk level in each uncertainty cluster using massive text data.
Twitter turmoil and staff exodus aggravate security concerns
Washington โ Twitter's owner Elon Musk has pledged the platform will not become a "hellscape," but experts fear a staff exodus following mass layoffs may have devastated its ability to combat misinformation, impersonation and data theft. Twitter devolved into what campaigners described as a cesspit of falsehoods and hate speech after recent layoffs cut half the company's 7,500 staff and fake accounts proliferated following its botched rollout of a paid verification system. This could be due to a conflict with your ad-blocking or security software. Please add japantimes.co.jp and piano.io to your list of allowed sites. If this does not resolve the issue or you are unable to add the domains to your allowlist, please see this FAQ.
Unsupervised extraction, labelling and clustering of segments from clinical notes
Zelina, Petr, Halรกmkovรก, Jana, Novรกฤek, Vรญt
This work is motivated by the scarcity of tools for accurate, unsupervised information extraction from unstructured clinical notes in computationally underrepresented languages, such as Czech. We introduce a stepping stone to a broad array of downstream tasks such as summarisation or integration of individual patient records, extraction of structured information for national cancer registry reporting or building of semi-structured semantic patient representations for computing patient embeddings. More specifically, we present a method for unsupervised extraction of semantically-labelled textual segments from clinical notes and test it out on a dataset of Czech breast cancer patients, provided by Masaryk Memorial Cancer Institute (the largest Czech hospital specialising in oncology). Our goal was to extract, classify (i.e. label) and cluster segments of the free-text notes that correspond to specific clinical features (e.g., family background, comorbidities or toxicities). The presented results demonstrate the practical relevance of the proposed approach for building more sophisticated extraction and analytical pipelines deployed on Czech clinical notes.
UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition
Hu, Guimin, Lin, Ting-En, Zhao, Yi, Lu, Guangming, Wu, Yuchuan, Li, Yongbin
Multimodal sentiment analysis (MSA) and emotion recognition in conversation (ERC) are key research topics for computers to understand human behaviors. From a psychological perspective, emotions are the expression of affect or feelings during a short period, while sentiments are formed and held for a longer period. However, most existing works study sentiment and emotion separately and do not fully exploit the complementary knowledge behind the two. In this paper, we propose a multimodal sentiment knowledge-sharing framework (UniMSE) that unifies MSA and ERC tasks from features, labels, and models. We perform modality fusion at the syntactic and semantic levels and introduce contrastive learning between modalities and samples to better capture the difference and consistency between sentiments and emotions. Experiments on four public benchmark datasets, MOSI, MOSEI, MELD, and IEMOCAP, demonstrate the effectiveness of the proposed method and achieve consistent improvements compared with state-of-the-art methods.
Can an AI recognize my opinion from tweets?
To make a long story short: In principle; yes. And if my colleagues at the University of Edinburgh are to be believed, it even works in cases where an opinion is not explicitly expressed. In fact, the terms "sentiment analysis" or "opinion mining" are nothing new to people who deal with language technology. However, this is not infrequently a marketing ploy: because what sounds like opinion analysis is in fact usually nothing more than a polarity analysis of the feelings that are transported via a text. In other words, it analyzes whether a social media post has positive or negative vibes.