Information Extraction
Classifying COVID-19 Related Tweets for Fake News Detection and Sentiment Analysis with BERT-based Models
Bounaama, Rabia, Abderrahim, Mohammed El Amine
The present paper is about the participation of our team "techno" on CERIST'22 shared tasks. We used an available dataset "task1.c" related to covid-19 pandemic. It comprises 4128 tweets for sentiment analysis task and 8661 tweets for fake news detection task. We used natural language processing tools with the combination of the most renowned pre-trained language models BERT (Bidirectional Encoder Representations from Transformers). The results shows the efficacy of pre-trained language models as we attained an accuracy of 0.93 for the sentiment analysis task and 0.90 for the fake news detection task.
An Information Extraction Study: Take In Mind the Tokenization!
Theodoropoulos, Christos, Moens, Marie-Francine
Current research on the advantages and trade-offs of using characters, instead of tokenized text, as input for deep learning models, has evolved substantially. New token-free models remove the traditional tokenization step; however, their efficiency remains unclear. Moreover, the effect of tokenization is relatively unexplored in sequence tagging tasks. To this end, we investigate the impact of tokenization when extracting information from documents and present a comparative study and analysis of subword-based and character-based models. Specifically, we study Information Extraction (IE) from biomedical texts. The main outcome is twofold: tokenization patterns can introduce inductive bias that results in state-of-the-art performance, and the character-based models produce promising results; thus, transitioning to token-free IE models is feasible.
Energy-hungry TikTok data centre harming our Ukraine ammunition production plans, CEO says
One of Europe's largest ammunition manufacturers has said efforts to meet surging demand from the war in Ukraine have been stymied by a new TikTok data centre that is monopolising electricity in the region close to its biggest factory. The chief executive of Nammo, which is co-owned by the Norwegian government, said a planned expansion of its largest factory in central Norway hit a roadblock due to a lack of surplus energy, with the construction of TikTok's new data centre using up electricity in the local area. "We are concerned because we see our future growth is challenged by the storage of cat videos," Morten Brandtzæg told the Financial Times. Demand for artillery rounds is 15 times higher than normal and Europe's munitions industry needs to invest €2bn in new factories to keep up with Ukraine's needs, according to Brandtzæg. By some estimates, Ukraine is firing 6,000 to 7,000 artillery shells a day and is facing ammunition shortages after more than a year of war.
Sejarah dan Perkembangan Teknik Natural Language Processing (NLP) Bahasa Indonesia: Tinjauan tentang sejarah, perkembangan teknologi, dan aplikasi NLP dalam bahasa Indonesia
This study provides an overview of the history of the development of Natural Language Processing (NLP) in the context of the Indonesian language, with a focus on the basic technologies, methods, and practical applications that have been developed. This review covers developments in basic NLP technologies such as stemming, part-of-speech tagging, and related methods; practical applications in cross-language information retrieval systems, information extraction, and sentiment analysis; and methods and techniques used in Indonesian language NLP research, such as machine learning, statistics-based machine translation, and conflict-based approaches. This study also explores the application of NLP in Indonesian language industry and research and identifies challenges and opportunities in Indonesian language NLP research and development. Recommendations for future Indonesian language NLP research and development include developing more efficient methods and technologies, expanding NLP applications, increasing sustainability, further research into the potential of NLP, and promoting interdisciplinary collaboration. It is hoped that this review will help researchers, practitioners, and the government to understand the development of Indonesian language NLP and identify opportunities for further research and development. Designing an indonesian part of speech tagset and manually tagged indonesian corpus.
Evaluating the Role of Target Arguments in Rumour Stance Classification
Considering a conversation thread, stance classification aims to identify the opinion (e.g. agree or disagree) of replies towards a given target. The target of the stance is expected to be an essential component in this task, being one of the main factors that make it different from sentiment analysis. However, a recent study shows that a target-oblivious model outperforms target-aware models, suggesting that targets are not useful when predicting stance. This paper re-examines this phenomenon for rumour stance classification (RSC) on social media, where a target is a rumour story implied by the source tweet in the conversation. We propose adversarial attacks in the test data, aiming to assess the models robustness and evaluate the role of the data in the models performance. Results show that state-of-the-art models, including approaches that use the entire conversation thread, overly relying on superficial signals. Our hypothesis is that the naturally high occurrence of target-independent direct replies in RSC (e.g. "this is fake" or just "fake") results in the impressive performance of target-oblivious models, highlighting the risk of target instances being treated as noise during training.
Sentiment Analysis With BigQuery ML - Liwaiwai
We recently announced BigQuery support for sparse features which help users to store and process the sparse features efficiently while working with them. That functionality enables users to represent sparse tensors and train machine learning models directly in the BigQuery environment. Being able to represent sparse tensors is a useful feature because sparse tensors are used extensively in encoding schemes like TF-IDF as part of data pre-processing in NLP applications and for pre-processing images with a lot of dark pixels in computer vision applications. There are numerous applications of sparse features such as text generation and sentiment analysis. In this blog, we'll demonstrate how to perform sentiment analysis with the space features in BigQuery ML by training and inferencing machine learning models using a public dataset.
Tribe or Not? Critical Inspection of Group Differences Using TribalGram
Ahn, Yongsu, Yan, Muheng, Lin, Yu-Ru, Chung, Wen-Ting, Hwa, Rebecca
With the rise of big data, artificial intelligence (AI), and data mining techniques, group analysis has increasingly become a powerful tool in many applications, ranging from policy-making, direct marketing, education, to healthcare. For example, an important analysis strategy is group profiling, which extracts and describes the characteristics of groups of people [40]; it has been commonly used for customized recommendations to overcome sparse and missing personal data [25]. The same strategy is also used for mining social media, educational, and healthcare data to understand the shared characteristics of online communities or student/patient cohorts [15, 51, 100]. While it may help to support public and private services or product creations that are better tailored to different communities, group profiles resulted from mathematical inference are typically not valid for every individual regarded as a member in the group (this is known as non-distributive group profiles) [40]. The shared group characteristics extracted from data can have social ramifications such as stereotyping, stigmatization, or lead to pernicious consequences in decision making because individuals might be judged by group characteristics they do not posses [24, 56, 58].
Tollywood Emotions: Annotation of Valence-Arousal in Telugu Song Lyrics
Shanker, R Guru Ravi, Gupta, B Manikanta, Koushik, BV, Alluri, Vinoo
Emotion recognition from a given music track has heavily relied on acoustic features, social tags, and metadata but is seldom focused on lyrics. There are no datasets of Indian language songs that contain both valence and arousal manual ratings of lyrics. We present a new manually annotated dataset of Telugu songs' lyrics collected from Spotify with valence and arousal annotated on a discrete scale. A fairly high inter-annotator agreement was observed for both valence and arousal. Subsequently, we create two music emotion recognition models by using two classification techniques to identify valence, arousal and respective emotion quadrant from lyrics. Support vector machine (SVM) with term frequency-inverse document frequency (TF-IDF) features and fine-tuning the pre-trained XLMRoBERTa (XLM-R) model were used for valence, arousal and quadrant classification tasks. Fine-tuned XLMRoBERTa performs better than the SVM by improving macro-averaged F1-scores of 54.69%, 67.61%, 34.13% to 77.90%, 80.71% and 58.33% for valence, arousal and quadrant classifications, respectively, on 10-fold cross-validation. In addition, we compare our lyrics annotations with Spotify's annotations of valence and energy (same as arousal), which are based on entire music tracks. The implications of our findings are discussed. Finally, we make the dataset publicly available with lyrics, annotations and Spotify IDs.
What is sentiment analysis? Using NLP and ML to extract meaning
Sentiment analysis is analytical technique that uses statistics, natural language processing, and machine learning to determine the emotional meaning of communications. Companies use sentiment analysis to evaluate customer messages, call center interactions, online reviews, social media posts, and other content. Sentiment analysis can track changes in attitudes towards companies, products, or services, or individual features of those products or services. One of the most prominent examples of sentiment analysis on the Web today is the Hedonometer, a project of the University of Vermont's Computational Story Lab. The group analyzes more than 50 million English-language tweets every single day, about a tenth of Twitter's total traffic, to calculate a daily happiness store.
Cross-domain Sentiment Classification in Spanish
Estienne, Lautaro, Vera, Matias, Vega, Leonardo Rey
Sentiment Classification is a fundamental task in the field of Natural Language Processing, and has very important academic and commercial applications. It aims to automatically predict the degree of sentiment present in a text that contains opinions and subjectivity at some level, like product and movie reviews, or tweets. This can be really difficult to accomplish, in part, because different domains of text contains different words and expressions. In addition, this difficulty increases when text is written in a non-English language due to the lack of databases and resources. As a consequence, several cross-domain and cross-language techniques are often applied to this task in order to improve the results. In this work we perform a study on the ability of a classification system trained with a large database of product reviews to generalize to different Spanish domains. Reviews were collected from the MercadoLibre website from seven Latin American countries, allowing the creation of a large and balanced dataset. Results suggest that generalization across domains is feasible though very challenging when trained with these product reviews, and can be improved by pre-training and fine-tuning the classification model.