Palguna

AAAI Conferences 

The daily volume of Tweets in Twitter is around 500 million, and the impact of this data on applications ranging from public safety, opinion mining, news broadcast, etc., is increasing day by day. Analyzing large volumes of Tweets for various applications would require techniques that scale well with the number of Tweets. In this work we come up with a theoretical formulation for sampling Twitter data. We introduce novel statistical metrics to quantify the statistical representativeness of the Tweet sample, and derive sufficient conditions on the number of samples needed for obtaining highly representative Tweet samples. These new statistical metrics quantify the representativeness or goodness of the sample in terms of frequent keyword identification and in terms of restoring public sentiments associated with these keywords.