Collaborating Authors

Meet MS MARCO: A Dataset for the AI Research Community


Predictions 2017: Infosys "We have made enormous leaps forward" Stay up-to-date on the topics you care about. We'll send you an email alert whenever a news article matches your alert term. It's free, and you can add new alerts at any time. We won't share your personal information with anyone.

15 Open Datasets for Healthcare


Machine Learning is exploding into the world of healthcare. When we talk about the ways ML will revolutionize certain fields, healthcare is always one of the top areas seeing huge strides, thanks to the processing and learning power of machines. There's a good chance you either are or will soon be employed in the healthcare field. A while back, I wrote a list of 25 excellent open datasets for ML and included and MIMIC Critical Care Database. Here are 15 more excellent datasets specifically for healthcare.

STARDATA: A StarCraft AI Research Dataset

AAAI Conferences

We release a dataset of 65646 StarCraft replays that contains 1535 million frames and 496 million player actions. We provide full game state data along with the original replays that can be viewed in StarCraft. The game state data was recorded every 3 frames which ensures suitability for a wide variety of machine learning tasks such as strategy classification, inverse reinforcement learning, imitation learning, forward modeling, partial information extraction, and others. We use TorchCraft to extract and store the data, which standardizes the data format for both reading from replays and reading directly from the game. Furthermore, the data can be used on different operating systems and platforms. The dataset contains valid, non-corrupted replays only and its quality and diversity was ensured by a number of heuristics. We illustrate the diversity of the data with various statistics and provide examples of tasks that benefit from the dataset.

The life of a dataset in machine learning research – interview with Bernard Koch


Bernard Koch, Emily Denton, Alex Hanna and Jacob Foster won a best paper award, for Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research, in the datasets and benchmarks track at NeurIPS 2021. Here, Bernard tells us about the advantages and disadvantages of benchmarking, the findings of their paper, and plans for future work. Machine learning is a rather unusual science, partly because it straddles the space between science and engineering. The main way that progress is evaluated is through state-of-the-art benchmarking. The scientific community agrees on a shared problem, they pick a dataset which they think is representative of the data that you might see when you try to solve that problem in the real world, then they compare their algorithms on a score for that dataset.