Mislove, Alan (Northeastern University) | Lehmann, Sune (Technical University of Denmark) | Ahn, Yong-Yeol (Northeastern University) | Onnela, Jukka-Pekka (Harvard University) | Rosenquist, J. Niels (Harvard University)
Every second, the thoughts and feelings of millions of people across the world are recorded in the form of 140-character tweets using Twitter. However, despite the enormous potential presented by this remarkable data source, we still do not have an understanding of the Twitter population itself: Who are the Twitter users? How representative of the overall population are they? In this paper, we take the first steps towards answering these questions by analyzing data on a set of Twitter users representing over 1% of the U.S. population. We develop techniques that allow us to compare the Twitter population to the U.S. population along three axes (geography, gender, and race/ethnicity), and find that the Twitter population is a highly non-uniform sample of the population.
This post attempts to consolidate information on tree algorithms and their implementations in Scikit-learn and Spark. In particular, it was written to provide clarification on how feature importance is calculated. There are many great resources online discussing how decision trees and random forests are created and this post is not intended to be that. Although it includes short definitions for context, it assumes the reader has a grasp on these concepts and wishes to know how the algorithms are implemented in Scikit-learn and Spark. Decision trees learn how to best split the dataset into smaller and smaller subsets to predict the target value.
As the amount of data continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for IT professionals and data-scientists. The scikit-learn library is one of the most popular platforms for everyday Machine Learning and data science because it is built upon Python, a fully featured programming language. This comprehensive 3-in-1 course is your one-stop solution to everything that matters in mastering machine learning algorithms and their implementation. Develop pipelines and process data through manipulation, extraction, and data-cleansing techniques. Learn clean coding techniques which are applicable to any scalable Machine Learning projects.
Understanding the demographics of users of online social networks has important applications for health, marketing, and public messaging. In this paper, we predict the demographics of Twitter users based on whom they follow. Whereas most prior approaches rely on a supervised learning approach, in which individual users are labeled with demographics, we instead create a distantly labeled dataset by collecting audience measurement data for 1,500 websites (e.g., 50% of visitors to gizmodo.com are estimated to have a bachelor's degree). We then fit a regression model to predict these demographics using information about the followers of each website on Twitter. The resulting average held-out correlation is .77 across six different variables (gender, age, ethnicity, education, income, and child status). We additionally validate the model on a smaller set of Twitter users labeled individually for ethnicity and gender, finding performance that is surprisingly competitive with a fully supervised approach.
The data features that we use to train our machine learning models have a huge influence on the performance we can achieve. Irrelevant or partially relevant features can negatively impact model performance. Feature selection is a process where we automatically select those features in our data that contribute most to the prediction variable or output in which we are interested. Having irrelevant features in our data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression. We can learn more about feature selection with scikit-learn in the article Feature selection.