Goto

Collaborating Authors

Working with NLP datasets in Python

#artificialintelligence

In the field of Deep Learning, datasets are an essential part of every project. To train a neural network that can handle new situations, one has to use a dataset that represents the upcoming scenarios of the world. An image classification model trained on animal images will not perform well on a car classification task. Alongside training the best models, researchers use public datasets as a benchmark of their model performance. I personally think that easy-to-use public benchmarks are one of the most useful tools to help facilitate the research process.


YuruGAN: Yuru-Chara Mascot Generator Using Generative Adversarial Networks With Clustering Small Dataset

arXiv.org Machine Learning

A yuru-chara is a mascot character created by local governments and companies for publicizing information on areas and products. Because it takes various costs to create a yuruchara, the utilization of machine learning techniques such as generative adversarial networks (GANs) can be expected. In recent years, it has been reported that the use of class conditions in a dataset for GANs training stabilizes learning and improves the quality of the generated images. However, it is difficult to apply class conditional GANs when the amount of original data is small and when a clear class is not given, such as a yuruchara image. In this paper, we propose a class conditional GAN based on clustering and data augmentation. Specifically, first, we performed clustering based on K-means++ on the yuru-chara image dataset and converted it into a class conditional dataset. Next, data augmentation was performed on the class conditional dataset so that the amount of data was increased five times. In addition, we built a model that incorporates ResBlock and self-attention into a network based on class conditional GAN and trained the class conditional yuru-chara dataset. As a result of evaluating the generated images, the effect on the generated images by the difference of the clustering method was confirmed.


Machine Learning Datasets in R (10 datasets you can use right now) - Machine Learning Mastery

#artificialintelligence

You need standard datasets to practice machine learning. In this short post you will discover how you can load standard classification and regression datasets in R. This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R. It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniques and improve your skill with the platform. There are hundreds of standard test datasets that you can use to practice and get better at machine learning. Most of them are hosted for free on the UCI Machine Learning Repository.


Datasets for Machine Learning and Deep Learning

#artificialintelligence

Last month, I shared a short list of dataset repositories that I planned to recommend to students as inspiration for their class projects. Thanks to all the great suggestions via the Twitter thread above, this list has grown quite a bit! Now, with the semester being in full swing, I recently shared this set of dataset repositories with my deep learning class. However, beyond using this list to find inspiration for interesting student class projects, these are also good places to look for additional benchmark datasets for your model, so I am putting it out here, hoping you find it useful! It is hard to sort by priority or to pick favorites, so the following list is sorted alphabetically.


Introducing TensorFlow Datasets

#artificialintelligence

Public datasets fuel the machine learning research rocket (h/t Andrew Ng), but it's still too difficult to simply get those datasets into your machine learning pipeline. Every researcher goes through the pain of writing one-off scripts to download and prepare every dataset they work with, which all have different source formats and complexities. It does all the grungy work of fetching the source data and preparing it into a common format on disk, and it uses the [tf.data We're launching with 29 popular research datasets such as MNIST, Street View House Numbers, the 1 Billion Word Language Model Benchmark, and the Large Movie Reviews Dataset, and will add more in the months to come; we hope that you join in and add a dataset yourself. Try tfds out in a Colab notebook.