The art and science of training neural networks from large data sets in order to make predictions or classifications has experienced a major transition over the past several years. Through popular and growing interest from scientists and engineers, this field of data analysis has come to be called deep learning. Put succinctly, deep learning is the ability of machine learning algorithms to acquire feature hierarchies from data and then persist those features within multiple non-linear layers which comprise the machine's learning center, or neural network. Two years ago, questions were mainly about what deep learning is, and how it might be applied to problems in science, engineering, and finance. Over the past year, however, the climate of interest has changed from a curiosity about what deep learning is, and into a focus on acquiring hardware and software in order to apply deep learning frameworks to specific problems across a wide range of disciplines.
Meng, Xiangrui, Bradley, Joseph, Yavuz, Burak, Sparks, Evan, Venkataraman, Shivaram, Liu, Davies, Freeman, Jeremy, Tsai, DB, Amde, Manish, Owen, Sean, Xin, Doris, Xin, Reynold, Franklin, Michael J., Zadeh, Reza, Zaharia, Matei, Talwalkar, Ameet
Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.
Register for Hardcore Data Science Day at Strata Hadoop World NYC 2015, which takes place September 29 to October 1. There are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). In fact, machine learning experts are fond of pointing out: if you can pose your problem as a simple optimization problem then you're almost done. Of course, in practice, most machine learning projects can't be reduced to simple optimization problems. Data scientists have to manage and maintain complex data projects, and the analytic problems they need to tackle usually involve specialized machine learning pipelines.
Subscribe to the O'Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS. In this episode I spoke with Michael Franklin, co-director of UC Berkeley's AMPLab and chair of the Department of Computer Science at the University of Chicago. AMPLab is well-known in the data community for having originated Apache Spark, Alluxio (formerly Tachyon) and many other open source tools. Today marks the start of a two-day symposium commemorating the end of AMPLab, and we took the opportunity to reflect on its impressive accomplishments.
As we put the finishing touches on what promises to be another outstanding Hardcore Data Science Day at Strata Hadoop World in New York, I sat down with my co-organizer Ben Recht for the the latest episode of the O'Reilly Data Show Podcast. Recht is a UC Berkeley faculty member and member of AMPLab, and his research spans many areas of interest to data scientists including optimization, compressed sensing, statistics, and machine learning. At the 2014 Strata Hadoop World in NYC, Recht gave an overview of a nascent AMPLab research initiative into machine learning pipelines. The research team behind the project recently released an alpha version of a new software framework called KeystoneML, which gives developers a chance to test out some of the ideas that Recht outlined in his talk last year. We devoted a portion of this Data Show episode to machine learning pipelines in general, and a discussion of KeystoneML in particular.