This is the second part in a series where we analyze thousands of articles from tech news sites in order to get insights and trends about startups. Last time around we scraped all the articles ever published in TechCrunch, VentureBeat and Recode using Scrapy. We then filtered out all the articles that weren't about startups, so we now have only the publications relevant to our analysis. Finally, we'll combine these classifiers to be ready to analyze all of our data. For the first part of this analysis, it'd be great to know for each piece of startup news what "event" it is describing.
On this new post series, we will analyze hundreds of thousands of articles from TechCrunch, VentureBeat and Recode to discover cool trends and insights about startups. These are the types of questions we aim to answer with this analysis. On this first post, we will cover how Scrapy can be used to get all the articles ever published on these tech news sites and how MonkeyLearn can be used for filtering these crawled articles by whether they are about startups or not. We want to create a dataset of startup news articles that can be used for studying trends later on. On the second post, we will create text classifiers that do analysis on the actual content of the startup articles. Is it a news about acquisition?
Hey, I'm a fairly inexperienced student currently working on a genre classifier and have encountered a bit of a problem. The data set I'm working with to train the classifier is unbalanced. I have various songs from all kinds of genre categories, but some categories have more sample data in them than others. For instance, "rock", "jazz", and "raphiphop" have hundreds of training samples, but another category called "funksoulrnb" only has a couple of dozen. I didn't think this would be a massive problem but I found that my classifier tends to classify songs as "rock", "jazz", or "raphiphop" more often than as other categories.
This is the final part in a series where we use machine learning and natural language processing to analyze articles published in tech news sites in order to gain insights about the state of the startup industry. On the first post, we collected all the articles published in TechCrunch, VentureBeat and Recode since 2007. We also filtered out all the ones that aren't about startups. On the second post, we trained machine learning models that can tell what event is described in a piece of news (product launch, acquisition, fundraising, etc) and what industries the startup the article is about belongs to (Fintech, Machine Learning, and so on). Now, we are finally ready to conduct our analysis on more than 270,000 articles, let's go over the results!
In image classification, visual separability between different object categories is highly uneven, and some categories are more difficult to distinguish than others. Such difficult categories demand more dedicated classifiers. However, existing deep convolutional neural networks (CNN) are trained as flat N-way classifiers, and few efforts have been made to leverage the hierarchical structure of categories. In this paper, we introduce hierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category hierarchy. An HD-CNN separates easy classes using a coarse category classifier while distinguishing difficult classes using fine category classifiers. During HD-CNN training, component-wise pretraining is followed by global finetuning with a multinomial logistic loss regularized by a coarse category consistency term. In addition, conditional executions of fine category classifiers and layer parameter compression make HD-CNNs scalable for large-scale visual recognition. We achieve state-of-the-art results on both CIFAR100 and large-scale ImageNet 1000-class benchmark datasets. In our experiments, we build up three different HD-CNNs and they lower the top-1 error of the standard CNNs by 2.65%, 3.1% and 1.1%, respectively.