As the global business landscape is increasingly digitalised, and new technologies like 5G drive the exponential expansion of the Internet of Things (IoT), the amount of data created on a daily basis is growing exponentially. Business intelligence and research firm Raconteur found this year that, on an average day, 500mn tweets, 65bn WhatsApp messages and 294bn emails are sent, while four petabytes of data are created on Facebook and 5bn searches are made online. By 2025, it's estimated that 463 exabytes of data will be created each day globally – the equivalent of 212,765,957 DVDs per day. In order to keep pace and stay afloat, modern businesses need to gather, store, analyse and draw insights from a mind-bending amount of raw data. Determining what information is valuable, how to extract it and where to keep it are the challenges that every business in the current landscape must overcome.
Should you strive to centralize your data, or leave it scattered about? It seems like it should be a simple question, but it's actually a tough one to answer, particularly because it has so many ramifications for how data systems are architected, particularly with the rise of cloud data lakes. In the old days, data was a relatively scarce commodity, and so it made sense to invest the time and money to centralize it. Companies paid millions of dollars to ensure their data warehouses were filled with the cleanest and freshest data possible, for historical reporting and analytics use cases. As the big data boom unfolded, companies gained more options.
Data Warehouses are designed to support the decision-making process through data collection, consolidation, analytics, and research. They can be used in analyzing a specific subject area, such as "sales," and are an important part of modern Business Intelligence. The architecture for Data Warehouses was developed in the 1980s to assist in transforming data from operational systems to decision-making support systems. Normally, a Data Warehouse is part of a business's mainframe server or in the Cloud. In a Data Warehouse, data from many different sources is brought to a single location and then translated into a format the Data Warehouse can process and store.
Big Data is the ocean of information we swim in every day – vast zettabytes of data flowing from our computers, mobile devices, and machine sensors. This data is used by organizations to drive decisions, improve processes and policies, and create customer-centric products, services, and experiences. Big Data is defined as "big" not just because of its volume, but also due to the variety and complexity of its nature. Typically, it exceeds the capacity of traditional databases to capture, manage, and process it. And, Big Data can come from anywhere or anything on earth that we're able to monitor digitally.
To be a real "full-stack" data scientist, or what many bloggers and employers call a "unicorn," you have to master every step of the data science process -- all the way from storing your data, to putting your finished product (typically a predictive model) in production. But the bulk of data science training focuses on machine/deep learning techniques; data management knowledge is often treated as an afterthought. Data science students usually learn modeling skills with processed and cleaned data in text files stored on their laptop, ignoring how the data sausage is made. Students often don't realize that in industry settings, getting the raw data from various sources to be ready for modeling is usually 80% of the work. And because enterprise projects usually involve a massive amount of data that their local machine is not equipped to handle, the entire modeling process often takes place in the cloud, with most of the applications and databases hosted on servers in data centers elsewhere. Even after the student landed a job as a data scientist, data management often becomes something that a separate data engineering team takes care of. As a result, too many data scientists know too little about data storage and infrastructure, often to the detriment of their ability to make the right decisions at their jobs. The goal of this article is to provide a roadmap of what a data scientist in 2019 should know about data management -- from types of databases, where and how data is stored and processed, to the current commercial options -- so the aspiring "unicorns" could dive deeper on their own, or at least learn enough to sound like one at interviews and cocktail parties.