All the news – Andrew Thompson – Medium

@machinelearnbot 

I recently curated this dataset to explore some algorithmic approximation of the categories that make up our news, a thing that at different times I have both read and created. If you had tens of thousands of articles from a spread of outlets that seem more or less representative of our national news landscape and you turned them into structured data, and you put a gun to that data's head and coerced it into groups, what would those groups be? I decided the best balance of simplicity and efficacy would be to use unsupervised clustering methods and let the data sort itself, however crudely (and categories, no matter what algorithm they're derived from, will almost always be crude, as there's no reason the media can't be infinitesimally taxonomized). For a variety of reasons (local memory constraints, ability, recommendations from those more learned), I chose to run a bag-of-words through KMeans -- in other words, if every word becomes its own dimension and each article a single datapoint, what clusters of articles will form? If those words already bore you and you're itching to skip to the "so what" and/or don't care about code, scroll down until you see bold letters telling you not to. The code is here if anyone wants to peer-review this and tell me if/where I screwed up and/or give me suggestions.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found