Clean your data with unsupervised machine learning – Towards Data Science

#artificialintelligence

In this example we are faced with thousands of text articles scraped from both HMTL and PDF files. The quality of text returned is very much dependent on the scraping process. From sample-checking some of the results we know there are issues ranging from bad links, unreadable PDFs to items which have been successfully read-in but the content itself is complete garbage. The articles relate to Company Modern Slavery returns from this database: https://www.modernslaveryregistry.org/ These now reside in a Pandas data frame with'meta data' on each item such as the company name and year of publication, alongside the text which has been scraped from the return: The python Missingno package is super-useful.


The Best Public Datasets for Machine Learning

#artificialintelligence

First, a couple of pointers to keep in mind when searching for datasets. Kaggle: A data science site that contains a variety of externally contributed interesting datasets. You can find all kinds of niche datasets in its master list, from ramen ratings to basketball data to and even seattle pet licenses. Although the data sets are user-contributed, and thus have varying levels of cleanliness, the vast majority are clean. VisualData: Discover computer vision datasets by category, it allows searchable queries.


Data is the New Oil – Hacker Noon

#artificialintelligence

Deep Learning is a revolutionary field, but for it to work as intended, it requires data. The area related to these big datasets is known as Big Data, which stands for the abundance of digital data. Data is as important for Deep Learning algorithms as the architecture of the network itself, i.e., the software. Acquiring and cleaning the data is one of the most valuable aspects of the work. Without data, the neural networks cannot learn.


The heroic Data Engineer - Lending a Helping Hand to Data Drowned Scientists - insideBIGDATA

@machinelearnbot

A recent Forbes article on the 10 Predictions for AI, Big Data, and Analytics in 2018 states that Data engineer will become the hot new job title, displacing its sibling role of Data Scientist. Gil Press goes on to write that Indeed.com


Web Scraping for Dataset Curation, Part 2: Tidying Craft Beer Data

@machinelearnbot

Editor's note: This post is the second in a 2 part series. In the scraping part, I didn't bother to clean up the data. There are a few reasons for this. First, pandas is my tool of choice to manipulate the data. Secondly, I wanted to separate the concerns: scraping and cleaning.