Natural language question understanding has been one of the most important challenges in artificial intelligence. Indeed, eminent AI benchmarks such as the Turing test require an AI system to understand natural language questions, with various topics and complexity, and then respond appropriately. During the past few years, we have witnessed rapid progress in question answering technology, with virtual assistants like Siri, Google Now, and Cortana answering daily life questions, and IBM Watson winning over humans in Jeopardy!. Many questions the systems encounter are simple lookup questions (e.g., "Where is Chichen Itza?" or "Who's the manager of Man Utd?"). The answers can be found by searching the surface forms.
First, a couple of pointers to keep in mind when searching for datasets. Kaggle: A data science site that contains a variety of externally contributed interesting datasets. You can find all kinds of niche datasets in its master list, from ramen ratings to basketball data to and even seattle pet licenses. Although the data sets are user-contributed, and thus have varying levels of cleanliness, the vast majority are clean. VisualData: Discover computer vision datasets by category, it allows searchable queries.
Where's the best place to look for free online datasets for NLP? We combed the web to create the ultimate cheat sheet, broken down into datasets for text, audio speech, and sentiment analysis. Sentiment140: a popular dataset, which uses 160,000 tweets with emoticons pre-removed. Twitter US Airline Sentiment: Twitter data on US airlines from February 2015, classified as positive, negative, and neutral tweets. Yelp Reviews: An open dataset released by Yelp, contains more than 5 million reviews.