mmlspark
Microsoft Updates New Machine Learning Platform for Apache Spark -- Pure AI
This week Microsoft Announced that is has released version 0.16 of its new deep learning data science tool for Spark, Microsoft Machine Learning for Apache Spark, (MMLSpark) on Github. MMLSpark requires Scala, Spark and Python, and works with Microsoft Cognitive Services and Azure Databricks. It was originally released two years ago, with the most recent version before this -- .015 New features and improvements in version 0.16 include support for Spark deep learning pipelines, a new "ranking train validation splitter," better integration with Azure Search, support for name entry recognition cognitive service on Spark (for analytical text extraction), improved boosting capabilities with the gradient boosting tool for tree-based algorithms LightGBM, as well as many other changes. More information on MMLSpark can be found on the Microsoft product page here.
Microsoft revamps machine learning tools for Apache Spark
Microsoft has revamped its MMLSpark open source project, the better to integrate "many deep learning and data science tools to the Spark ecosystem," according to the notes on the project repository. MMLSpark, originally released last year, is a collection of projects intended to make Spark more useful in many contexts--mainly machine learning, but also in some general-purpose ways. Some of MMLSpark's features integrate Spark with Microsoft machine learning offerings such as the Microsoft Cognitive Toolkit (CNTK) and LightGBM, as well as with third-party projects such as OpenCV. Others are about turning Spark into a service or client--for example, allowing Spark computations (including machine learning predictions) to be easily served via the web, or allowing Spark to interact with other web services via HTTP. One function, LIME on Spark, provides annotated results for the predictions served by a given image classifier, an at-a-glance way to determine if the classifier is working right.
MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales
Hamilton, Mark, Raghunathan, Sudarshan, Matiach, Ilya, Schonhoffer, Andrew, Raman, Anand, Barzilay, Eli, Thigpen, Minsoo, Rajendran, Karthik, Mahajan, Janhavi Suresh, Cochrane, Courtney, Eswaran, Abhiram, Green, Ari
We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ecosystem of enhancements that expand the Apache Spark distributed computing library to tackle problems in Deep Learning, Micro-Service Orchestration, Gradient Boosting, Model Interpretability, and other areas of modern computation. Furthermore, we present a novel system called Spark Serving that allows users to run any Apache Spark program as a distributed, sub-millisecond latency web service backed by their existing Spark Cluster. All MMLSpark contributions have the same API to enable simple composition across frameworks and usage across batch, streaming, and RESTful web serving scenarios on static, elastic, or serverless clusters. We showcase MMLSpark by creating a method for deep object detection capable of learning without human labeled data and demonstrate its effectiveness for Snow Leopard conservation.
Azure/mmlspark
MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets. MMLSpark requires Scala 2.11, Spark 2.1, and either Python 2.7 or Python 3.5 . See our notebooks for all examples. Below is an excerpt from a simple example of using a pre-trained CNN to classify images in the CIFAR-10 dataset. See other sample notebooks as well as the MMLSpark documentation for Scala and PySpark.
Microsoft Upgrades Windows-Based Data Science Virtual Machine
Data Science Virtual Machine (DSVM), Microsoft's cloud-based offering for big data analytics, is now available in a new preview version based on Windows Server 2016 Datacenter Edition. Previously, the Windows version of DSVM only ran on a Windows Server 2012 image. Microsoft also makes DSVM available in Ubuntu and CentOS Linux flavors. In upgrading to Windows Server 2016, DSVM users now have access to additional tools and functionality, including Docker container support, noted Microsoft software engineer Udayan Kumar in a June 6 announcement. The new virtual machine also comes bundled with Office ProPlus and includes an upgrade to Microsoft R Server 9.1, which now features sentiment analysis and other cognitive models.
Announcing Microsoft Machine Learning Library for Apache Spark
This post is authored by Roope Astala, Senior Program Manager, and Sudarshan Raghunathan, Principal Software Engineering Manager, at Microsoft. We're excited to announce the Microsoft Machine Learning library for Apache Spark โ a library designed to make data scientists more productive on Spark, increase the rate of experimentation, and leverage cutting-edge machine learning techniques โ including deep learning โ on very large datasets. We've learned a lot by working with customers using SparkML, both internal and external to Microsoft. Customers have found Spark to be a powerful platform for building scalable ML models. However, they struggle with low-level APIs, for example to index strings, assemble feature vectors and coerce data into a layout expected by machine learning algorithms.