Mejova, Yelena
Language-Agnostic Modeling of Source Reliability on Wikipedia
D'Ignazi, Jacopo, Kaltenbrunner, Andreas, Mejova, Yelena, Tizzani, Michele, Kalimeri, Kyriaki, Beiró, Mariano, Aragón, Pablo
Over the last few years, content verification through reliable sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of sources across multiple language editions of Wikipedia. Utilizing editorial activity data, the model evaluates source reliability within different articles of varying controversiality such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts source reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65 while the performance of low-resource languages varies; in all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. This work contributes not only to Wikipedia's efforts in ensuring content verifiability but in ensuring reliability across diverse user-generated content in various language communities.
Leave no Place Behind: Improved Geolocation in Humanitarian Documents
Belliardo, Enrico M., Kalimeri, Kyriaki, Mejova, Yelena
Geographical location is a crucial element of humanitarian response, outlining vulnerable populations, ongoing events, and available resources. Latest developments in Natural Language Processing may help in extracting vital information from the deluge of reports and documents produced by the humanitarian sector. However, the performance and biases of existing state-of-the-art information extraction tools are unknown. In this work, we develop annotated resources to fine-tune the popular Named Entity Recognition (NER) tools Spacy and roBERTa to perform geotagging of humanitarian texts. We then propose a geocoding method FeatureRank which links the candidate locations to the GeoNames database. We find that not only does the humanitarian-domain data improves the performance of the classifiers (up to F1 = 0.92), but it also alleviates some of the bias of the existing tools, which erroneously favor locations in the Western countries. Thus, we conclude that more resources from non-Western documents are necessary to ensure that off-the-shelf NER systems are suitable for the deployment in the humanitarian sector.
Comfort Foods and Community Connectedness: Investigating Diet Change during COVID-19 Using YouTube Videos on Twitter
Mejova, Yelena, Manikonda, Lydia
Unprecedented lockdowns at the start of the COVID-19 pandemic have drastically changed the routines of millions of people, potentially impacting important health-related behaviors. In this study, we use YouTube videos embedded in tweets about diet, exercise and fitness posted before and during COVID-19 to investigate the influence of the pandemic lockdowns on diet and nutrition. In particular, we examine the nutritional profile of the foods mentioned in the transcript, description and title of each video in terms of six macronutrients (protein, energy, fat, sodium, sugar, and saturated fat). These macronutrient values were further linked to demographics to assess if there are specific effects on those potentially having insufficient access to healthy sources of food. Interrupted time series analysis revealed a considerable shift in the aggregated macronutrient scores before and during COVID-19. In particular, whereas areas with lower incomes showed decrease in energy, fat, and saturated fat, those with higher percentage of African Americans showed an elevation in sodium. Word2Vec word similarities and odds ratio analysis suggested a shift from popular diets and lifestyle bloggers before the lockdowns to the interest in a variety of healthy foods, communal sharing of quick and easy recipes, as well as a new emphasis on comfort foods. To the best of our knowledge, this work is novel in terms of linking attention signals in tweets, content of videos, their nutrients profile, and aggregate demographics of the users. The insights made possible by this combination of resources are important for monitoring the secondary health effects of social distancing, and informing social programs designed to alleviate these effects.
Reports of the Workshops Held at the 2018 International AAAI Conference on Web and Social Media
Editor, Managing (AAAI) | An, Jisun (Qatar Computing Research Institute) | Chunara, Rumi (New York University) | Crandall, David J. (Indiana University) | Frajberg, Darian (Politecnico di Milano) | French, Megan (Stanford University) | Jansen, Bernard J. (Qatar Computing Research Institute) | Kulshrestha, Juhi (GESIS - Leibniz Institute for the Social Sciences) | Mejova, Yelena (Qatar Computing Research Institute) | Romero, Daniel M. (University of Michigan) | Salminen, Joni (Qatar Computing Research Institute) | Sharma, Amit (Microsoft Research India) | Sheth, Amit (Wright State University) | Tan, Chenhao (University of Colorado Boulder) | Taylor, Samuel Hardman (Cornell University) | Wijeratne, Sanjaya (Wright State University)
The Workshop Program of the Association for the Advancement of Artificial Intelligence’s 12th International Conference on Web and Social Media (AAAI-18) was held at Stanford University, Stanford, California USA, on Monday, June 25, 2018. There were fourteen workshops in the program: Algorithmic Personalization and News: Risks and Opportunities; Beyond Online Data: Tackling Challenging Social Science Questions; Bridging the Gaps: Social Media, Use and Well-Being; Chatbot; Data-Driven Personas and Human-Driven Analytics: Automating Customer Insights in the Era of Social Media; Designed Data for Bridging the Lab and the Field: Tools, Methods, and Challenges in Social Media Experiments; Emoji Understanding and Applications in Social Media; Event Analytics Using Social Media Data; Exploring Ethical Trade-Offs in Social Media Research; Making Sense of Online Data for Population Research; News and Public Opinion; Social Media and Health: A Focus on Methods for Linking Online and Offline Data; Social Web for Environmental and Ecological Monitoring and The ICWSM Science Slam. Workshops were held on the first day of the conference. Workshop participants met and discussed issues with a selected focus — providing an informal setting for active exchange among researchers, developers, and users on topics of current interest. Organizers from nine of the workshops submitted reports, which are reproduced in this report. Brief summaries of the other five workshops have been reproduced from their website descriptions.
Crossing Media Streams with Sentiment: Domain Adaptation in Blogs, Reviews and Twitter
Mejova, Yelena (The University of Iowa) | Srinivasan, Padmini (The University of Iowa)
Most sentiment analysis studies address classification of a single source of data such as reviews or blog posts. However, the multitude of social media sources available for text analysis lends itself naturally to domain adaptation. In this study, we create a dataset spanning three social media sources -- blogs, reviews, and Twitter -- and a set of 37 common topics. We first examine sentiments expressed in these three sources while controlling for the change in topic. Then using this multi-dimensional data we show that when classifying documents in one source (a target source), models trained on other sources of data can be as good as or even better than those trained on the target data. That is, we show that models trained on some social media sources are generalizable to others. All source adaptation models we implement show reviews and Twitter to be the best sources of training data. It is especially useful to know that models trained on Twitter data are generalizable, since, unlike reviews, Twitter is more topically diverse.
Exploring Feature Definition and Selection for Sentiment Classifiers
Mejova, Yelena (University of Iowa) | Srinivasan, Padmini (University of Iowa)
In this paper, we systematically explore feature definition and selection strategies for sentiment polarity classification. We begin by exploring basic questions, such as whether to use stemming, term frequency versus binary weighting, negation-enriched features, n-grams or phrases. We then move onto more complex aspects including feature selection using frequency-based vocabulary trimming, part-of-speech and lexicon selection (three types of lexicons), as well as using expected Mutual Information (MI). Using three product and movie review datasets of various sizes, we show, for example, that some techniques are more beneficial for larger datasets than the smaller. A classifier trained on only few features ranked high by MI outperformed one trained on all features in large datasets, yet in small dataset this did not prove to be true. Finally, we perform a space and computation cost analysis to further understand the merits of various feature types.