You may think that social media is all about posting cute pictures of your kiddos, sharing the latest meme, or sharing your latest adventure. But, I'm here to tell you that it's actually part of a giant intricate web of data that you and I contribute to daily with everything we do online. We don't always like to look at charts and graphs, what people call data. We'd rather look at the cute picture of a puppy, right? While data may not be as sexy and viral of a topic as a puppy, it's a big deal.
My colleagues, friends, and students are at the point of rolling their eyes when I say it. But it is so important and the main reason your Twitter analysis sucks. What do I mean by context? If you are going to datafy something – turn a tweet, a representation of a thought, emotion, idea into data –, then you need to think about the context of a) the user and why they tweeted, b) the dataset you are looking at, c) the problem you are trying to solve by datafying that tweet in the first place, and d) the tools you are using. BECAUSE NATURAL LANGUAGE PROCESSING NEEDS TO BE BESPOKE AND YOUR PRECONCEIVED ASSUMPTIONS WILL TRIP YOU UP.
Sentiment analysis research has predominantly been on English texts. Thus there exist many sentiment resources for English, but less so for other languages. Approaches to improve sentiment analysis in a resource-poor focus language include: (a) translate the focus language text into a resource-rich language such as English, and apply a powerful English sentiment analysis system on the text, and (b) translate resources such as sentiment labeled corpora and sentiment lexicons from English into the focus language, and use them as additional resources in the focus-language sentiment analysis system. In this paper we systematically examine both options. We use Arabic social media posts as stand-in for the focus language text. We show that sentiment analysis of English translations of Arabic texts produces competitive results, w.r.t. Arabic sentiment analysis. We show that Arabic sentiment analysis systems benefit from the use of automatically translated English sentiment lexicons. We also conduct manual annotation studies to examine why the sentiment of a translation is different from the sentiment of the source word or text. This is especially relevant for building better automatic translation systems. In the process, we create a state-of-the-art Arabic sentiment analysis system, a new dialectal Arabic sentiment lexicon, and the first Arabic-English parallel corpus that is independently annotated for sentiment by Arabic and English speakers.
In the summer of 2013, Brazil experienced a period of conflict triggered by a series of protests. While the popular press covered the events, little empirical work has investigated how first-hand reporting of the protests occurred and evolved over social media and how such exposure in turn impacted the demonstrations themselves. In this study we examine over 42 million tweets shared during the three months of conflict in order to uncover patterns in online and offline protest-related activity as well as to explore relationships between language-use in tweets and the emotions and underlying motivations of protesters. Our findings show that peaks in Twitter activity coincide with days in which heavy protesting took place, that the words in tweets reflect emotional characteristics of protest-related events, and less expectedly, that these emotions convey both positive as well as negative sentiment.
David Robinson is a data scientist at Stack Overflow. His article (parts of it) was re-posted in the Washington Post, here. This is also a short version that summarizes his analysis. The details and source code can be found on David's website, here. In short, David found that Donald Trump's tweets are authored by two different people: Someone on his campaign staff is tweeting from an iPhone, and the billionaire himself is tweeting from his Android.