Data lakes offer a number of advantages for machine learning, but it takes an experienced partner to unlock their full benefit. AI News caught up with Brian Flüg, Solutions Architect at Qubole, to find out how the company is helping data scientists with their workloads. What are the advantages of using a data lake for machine learning? The advantages of using a secure and open data lake for machine learning are numerous. It is simple to deploy and companies can reduce risk while decreasing costs.
A new partnership that was announced between Qubole, the cloud big data-as-a-service company, and Snowflake Computing, the only data warehouse built for the cloud, will enable organisations to use Apache Spark in Qubole with data stored in Snowflake. This new integration between cloud services allows data teams to build, train and put in production powerful machine learning (ML) and artificial intelligence (AI) models in Spark using information stored in Snowflake. It also enables data engineers to use Qubole to read and write data in Snowflake for advanced data preparation such as data wrangling, data augmentation and advanced ETL to refine existing Snowflake data sets.
An integrated set of data management, analytics, and insight application development and management components, offered as a platform the enterprise does not own or control, may sound scary, or cryptic. The scary part has to do with the lack of ownership of control. Many enterprises would be put off, as the need to exercise precisely that is engraved in their DNA. The move to the cloud however, much debated in its early years, is pretty much a given now. Ownership and control have been central issues there, and yet somehow the pro-cloud arguments have prevailed and the majority of enterprises is now on that camp.
Data science tools are evolving. Becoming data scientist is hard. In any hard task, focus is critical. As a data scientist, Python should probably be the first tool you should master. Kaggle, the community for data science competitions, publishes surveys of data scientist such as their "2017 the State of Data Science" report.
Last week, during the Deep Learning Summit at AWS re:Invent 2017, Terrence Sejnowski (a pioneer of deep learning) succinctly said "Whoever has more data wins". He was echoing a premise that has been repeated many times in many ways by many people: machine learning requires big data to work. Without large, well maintained training sets, machine learning algorithms--especially deep learning algorithms--fall far short of their potential. That's why here at Qubole we believe that enabling data scientists starts with giving them a platform to quickly select, clean, and aggregate datasets on a massive scale.