Everything a Data Scientist Should Know About Data Management - KDnuggets
To be a real "full-stack" data scientist, or what many bloggers and employers call a "unicorn," you have to master every step of the data science process -- all the way from storing your data, to putting your finished product (typically a predictive model) in production. But the bulk of data science training focuses on machine/deep learning techniques; data management knowledge is often treated as an afterthought. Data science students usually learn modeling skills with processed and cleaned data in text files stored on their laptop, ignoring how the data sausage is made. Students often don't realize that in industry settings, getting the raw data from various sources to be ready for modeling is usually 80% of the work. And because enterprise projects usually involve a massive amount of data that their local machine is not equipped to handle, the entire modeling process often takes place in the cloud, with most of the applications and databases hosted on servers in data centers elsewhere. Even after the student landed a job as a data scientist, data management often becomes something that a separate data engineering team takes care of. As a result, too many data scientists know too little about data storage and infrastructure, often to the detriment of their ability to make the right decisions at their jobs. The goal of this article is to provide a roadmap of what a data scientist in 2019 should know about data management -- from types of databases, where and how data is stored and processed, to the current commercial options -- so the aspiring "unicorns" could dive deeper on their own, or at least learn enough to sound like one at interviews and cocktail parties.
Oct-28-2019, 06:14:00 GMT
- Country:
- North America > United States > New York (0.04)
- Industry:
- Banking & Finance (1.00)
- Information Technology > Services (1.00)
- Technology: