There is no workload in the datacenter that can't, in theory and in practice, be supplied as a service from a public cloud. Big Data as a Service, or BDaaS for short, is an emerging category of services that delivers data processing for analytics in the cloud and it is getting a lot of buzz these days – and for good reason. These BDaaS products vary in features, functions, and target use cases, but all address the same basic problem: Big data and data warehousing in the cloud is deceptively challenging and customers want to abstract away the complexity. Data analytics in the cloud is especially tough for companies with extensive datacenter investments hoping to create a hybrid architecture. Enterprises rarely ponder about wholesale migration of their IT infrastructure to the cloud, regardless of what Amazon Web Services would have you believe.
To be a real "full-stack" data scientist, or what many bloggers and employers call a "unicorn," you have to master every step of the data science process -- all the way from storing your data, to putting your finished product (typically a predictive model) in production. But the bulk of data science training focuses on machine/deep learning techniques; data management knowledge is often treated as an afterthought. Data science students usually learn modeling skills with processed and cleaned data in text files stored on their laptop, ignoring how the data sausage is made. Students often don't realize that in industry settings, getting the raw data from various sources to be ready for modeling is usually 80% of the work. And because enterprise projects usually involve a massive amount of data that their local machine is not equipped to handle, the entire modeling process often takes place in the cloud, with most of the applications and databases hosted on servers in data centers elsewhere. Even after the student landed a job as a data scientist, data management often becomes something that a separate data engineering team takes care of. As a result, too many data scientists know too little about data storage and infrastructure, often to the detriment of their ability to make the right decisions at their jobs. The goal of this article is to provide a roadmap of what a data scientist in 2019 should know about data management -- from types of databases, where and how data is stored and processed, to the current commercial options -- so the aspiring "unicorns" could dive deeper on their own, or at least learn enough to sound like one at interviews and cocktail parties.
In 2016, many Organizations began storing, processing, and extracting value from data of all forms and sizes. Going ahead, systems that support large volumes of both structured and unstructured data will continue to rise. Circa 2017 - the market will demand platforms that help data custodians govern and secure big data while empowering end users to analyse that data. These systems will mature to operate well inside of enterprise IT systems and standards. Besides the convergence of IoT, cloud, and big data will create new opportunities for self-service analytics.
Employees up and down the value chain are eager to dive into big data, hunting for golden nuggets of intelligence to help them make smarter decisions, grow customer relationships and improve business efficiency. To do this, they've been faced with a dizzying array of technologies – from open source projects to commercial software products – as they try to wrestle big data to the ground. Today, a lot of the headlines and momentum focus around some combination of Hadoop, Spark and Redshift – all of which can be springboards for big data work. It's important to step back, though, and look at where we are in big data's evolution. In many ways, big data is in the midst of transition.
This post is by Joseph Sirosh, Corporate Vice President of the Data Group at Microsoft. This week I'm joining thousands of people attending Strata Hadoop World in San Jose to explore the technology and business of big data and data science. We are pleased to announce R Server for Azure HDInsight, a 100% open source R implementation. This lets you train and run ML models on larger datasets than previously possible and make more accurate predictions. It also reduces the time to move ideas into production, eliminating time-consuming installation, set up and procurement cycles for new hardware.