Collaborating Authors

Machine Learning Datasets in R (10 datasets you can use right now) - Machine Learning Mastery


You need standard datasets to practice machine learning. In this short post you will discover how you can load standard classification and regression datasets in R. This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R. It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniques and improve your skill with the platform. There are hundreds of standard test datasets that you can use to practice and get better at machine learning. Most of them are hosted for free on the UCI Machine Learning Repository.

Machine Learning Datasets: 250+ ML Repository Of Speech Datasets


While open data or public data sets are convenient, we offer an extensive catalog of'off-the-shelf', 250 licensable datasets across 80 languages across multiple dialects for a variety of common AI use cases. We are excited to announce 30 new datasets for 2020 that deliver immediate value to our customers. Among our offerings, you will find data sets for speech recognition, learning datasets for machine learning algorithms, all created with the most advanced available data science. Whether you are working on a text-to-speech system, a voice recognition system or another solution that relies on natural language, high-quality licensed speech and language datasets allow you to go to market faster and reach more potential customers. Should You Build or Buy a Data Annotation Tool?

Train a model on fashion dataset


Fashion MNIST is a direct drop-in replacement for the original MNIST dataset. The dataset is made up of 60,000 training examples and 10,000 testing examples, where each example is a 28 28 grayscaled picture of various articles of clothing. The Fashion MNIST dataset is more difficult than the original MNIST, and thus serves as a more complete benchmarking tool. The model being trained is a CNN with three convolutional layers followed by two dense layers. The job will run for 30 epochs, with a batch size of 128.

Label-less supervised learning? Enter self-supervised learning.


High-capacity networks are solving many different machine learning tasks, ranging from large-scale image classification, segmentation and image generation, to natural speech understanding and realistic text-to-speech, arguably passing some formulations of a Turing Test. A few general trends are easily identified in academia and industry: deeper networks show increasingly better results, as long as they are fed with ever bigger amounts of data, and labelled data in particular. Computational and economic costs increase linearly with the size of the dataset and for this reason, starting 2015 a number of unsupervised approaches aiming at the exploitation of unlabelled data are growing in popularity. The intuition behind many of these techniques is to emulate the ability of human brains to self determine the goal of a task and improve towards it. Starting 2015 advancements in algorithms able to exploit labels inherently contained within an unlabelled dataset gave rise to what is now referenced as self-supervised learning.

A Cartel of Influential Datasets Is Dominating Machine Learning Research, New Study Suggests


A new paper from the University of California and Google Research has found that a small number of'benchmark' machine learning datasets, largely from influential western institutions, and frequently from government organizations, are increasingly dominating the AI research sector. The researchers conclude that this tendency to'default' to highly popular open source datasets, such as ImageNet, brings up a number of practical, ethical and even political causes for concern. Among their findings – based on core data from the Facebook-led community project Papers With Code (PWC) – the authors contend that'widely-used datasets are introduced by only a handful of elite institutions', and that this'consolidation' has increased to 80% in recent years. '[We] find that there is increasing inequality in dataset usage globally, and that more than 50% of all dataset usages in our sample of 43,140 corresponded to datasets introduced by twelve elite, primarily Western, institutions.' Criteria for inclusion is where the institution or company accounts for more than 50% of known usages.