chronological split
A Framework for Monitoring and Retraining Language Models in Real-World Applications
Kasundra, Jaykumar, Schulz, Claudia, Mirsafian, Melicaalsadat, Skylaki, Stavroula
The typical model development lifecycle consists of four phases: 1) problem scoping, 2) data definition and collection, 3) model training and iterative improvement through error analysis, and 4) model deployment in production and implementation of continuous monitoring and retraining [1]. While the first three phases are typically performed in an offline setting, model deployment represents the critical step where the ML model becomes available in a production environment, a live application, where it needs to process live data and ideally sustain performance over time to keep delivering value. Model monitoring refers to the process of evaluating the quality of the production data and the performance of the model according to relevant metrics over time. When either data quality or model performance does not meet predefined criteria, a monitoring warning can be triggered, to alert the model owners. Defining an effective model monitoring and retraining strategy is key to successful ML model deployment since it can safeguard model quality over prolonged periods of time.
Examining Temporal Bias in Abusive Language Detection
Jin, Mali, Mu, Yida, Maynard, Diana, Bontcheva, Kalina
Previous work identified temporal bias in an Italian hate In recent years, researchers have developed a huge variety speech data set associated with immigrants (Florio et al. of machine learning models that can automatically detect 2020). However, they have yet to explore temporal factors abusive language (Mishra et al. 2019; Aurpa, Sadik, and affecting predictive performance from a multilingual perspective. Ahmed 2022; Das and Mukherjee 2023; Alrashidi, Jamal, In this paper, we explore temporal bias in 5 different and Alkhathlan 2023). However, these models may be subject abusive data sets that span varying time periods, in 4 to temporal bias, which can lead to a decrease in the languages (English, Spanish, Italian, and Chinese). Specifically, accuracy of abusive language detection models, potentially we investigate the following core research questions: allowing abusive language to be undetected or falsely detected. RQ1: How does the magnitude of temporal bias vary across different data sets such as language, time span and Temporal bias arises from differences in populations and collection methods?
Examining Temporalities on Stance Detection towards COVID-19 Vaccination
Mu, Yida, Jin, Mali, Bontcheva, Kalina, Song, Xingyi
Previous studies have highlighted the importance of vaccination as an effective strategy to control the transmission of the COVID-19 virus. It is crucial for policymakers to have a comprehensive understanding of the public's stance towards vaccination on a large scale. However, attitudes towards COVID-19 vaccination, such as pro-vaccine or vaccine hesitancy, have evolved over time on social media. Thus, it is necessary to account for possible temporal shifts when analysing these stances. This study aims to examine the impact of temporal concept drift on stance detection towards COVID-19 vaccination on Twitter. To this end, we evaluate a range of transformer-based models using chronological (split the training, validation and testing sets in the order of time) and random splits (randomly split these three sets) of social media data. Our findings demonstrate significant discrepancies in model performance when comparing random and chronological splits across all monolingual and multilingual datasets. Chronological splits significantly reduce the accuracy of stance classification. Therefore, real-world stance detection approaches need to be further refined to incorporate temporal factors as a key consideration.
It's about Time: Rethinking Evaluation on Rumor Detection Benchmarks using Chronological Splits
Mu, Yida, Bontcheva, Kalina, Aletras, Nikolaos
New events emerge over time influencing the topics of rumors in social media. Current rumor detection benchmarks use random splits as training, development and test sets which typically results in topical overlaps. Consequently, models trained on random splits may not perform well on rumor classification on previously unseen topics due to the temporal concept drift. In this paper, we provide a re-evaluation of classification models on four popular rumor detection benchmarks considering chronological instead of random splits. Our experimental results show that the use of random splits can significantly overestimate predictive performance across all datasets and models. Therefore, we suggest that rumor detection models should always be evaluated using chronological splits for minimizing topical overlaps.
Are Learned Molecular Representations Ready For Prime Time?
Yang, Kevin, Swanson, Kyle, Jin, Wengong, Coley, Connor, Eiden, Philipp, Gao, Hua, Guzman-Perez, Angel, Hopper, Timothy, Kelley, Brian, Mathea, Miriam, Palmer, Andrew, Settels, Volker, Jaakkola, Tommi, Jensen, Klavs, Barzilay, Regina
Advancements in neural machinery have led to a wide range of algorithmic solutions for molecular property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed molecular fingerprints or expert-crafted descriptors, and graph convolutional neural networks that construct a learned molecular representation by operating on the graph structure of the molecule. However, recent literature has yet to clearly determine which of these two methods is superior when generalizing to new chemical space. Furthermore, prior research has rarely examined these new models in industry research settings in comparison to existing employed models. In this paper, we benchmark models extensively on 19 public and 15 proprietary industrial datasets spanning a wide variety of chemical endpoints. In addition, we introduce a graph convolutional model that consistently outperforms models using fixed molecular descriptors as well as previous graph neural architectures on both public and proprietary datasets. Our empirical findings indicate that while approaches based on these representations have yet to reach the level of experimental reproducibility, our proposed model nevertheless offers significant improvements over models currently used in industrial workflows.