datas-frame – Modern Pandas (Part 8): Scaling

#artificialintelligence

We can answer questions like "Which employer's employees donated the most?" Or "what is the average amount donated per occupation?" Since Dask is lazy, we haven't actually computed anything.


Ultimate guide to handle Big Datasets for Machine Learning using Dask (in Python)

#artificialintelligence

We will now have a look at some simple cases for creating arrays using Dask. As you can see here, I had 11 values in the array and I used the chunk size as 5. This distributed my array into three chunks, where the first and second blocks have 5 values each and the third one has 1 value. Dask arrays support most of the numpy functions. For instance, you can use .sum()


Fast GeoSpatial Analysis in Python

#artificialintelligence

This work is supported by Anaconda Inc., the Data Driven Discovery Initiative from the Moore Foundation, and NASA SBIR NNX16CG43P This work is a collaboration with Joris Van den Bossche. This blogpost builds on Joris's EuroSciPy talk (slides) on the same topic. You can also see Joris' blogpost on this same topic. Python's Geospatial stack is slow. Dask gives an additional 3-4x on a multi-core laptop.


XGBoost: Implementing the Winningest Kaggle Algorithm in Spark and Flink

@machinelearnbot

XGBoost is a library designed and optimized for tree boosting. Gradient boosting trees model is originally proposed by Friedman et al. By embracing multi-threads and introducing regularization, XGBoost delivers higher computational power and more accurate prediction. More than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost (Incomplete list). XGBoost has provided native interfaces for C, R, python, Julia and Java users.


datas-frame – Scalable Machine Learning (Part 1)

#artificialintelligence

This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation. Anaconda is interested in scaling the scientific python ecosystem. My current focus is on out-of-core, parallel, and distributed machine learning. This series of posts will introduce those concepts, explore what we have available today, and track the community's efforts to push the boundaries. I am (or was, anyway) an economist, and economists like to think in terms of constraints.