Goto

Collaborating Authors

Big Data File Formats Demystified

#artificialintelligence

"But if you want to do row-by-row, you have to fetch millions of rows and do the operation on each of the rows," Shahdadpuri says. "In those situations where you're trying to do a projection on very few columns from your entire data set, columnar is much, much better as opposed to a row-based format." Which file format you use all depends on the specific use case. While column-oriented stores like Parquet and ORC excel in some cases, in others a row-based storage mechanism like Avro might be the better choice. "Let's say you're presenting available flights to a user on a Web page," Williams says. "That just screams out for using use row-based storage because you're going to want to get a lot of information about each particular entry and you're probably going to want to get a lot of contiguous entries, like say all the entries from 9 o'clock this morning to 1 o'clock this afternoon.


Big Data File Formats Demystified

#artificialintelligence

"But if you want to do row-by-row, you have to fetch millions of rows and do the operation on each of the rows," Shahdadpuri says. "In those situations where you're trying to do a projection on very few columns from your entire data set, columnar is much, much better as opposed to a row-based format." Which file format you use all depends on the specific use case. While column-oriented stores like Parquet and ORC excel in some cases, in others a row-based storage mechanism like Avro might be the better choice. "Let's say you're presenting available flights to a user on a Web page," Williams says. "That just screams out for using use row-based storage because you're going to want to get a lot of information about each particular entry and you're probably going to want to get a lot of contiguous entries, like say all the entries from 9 o'clock this morning to 1 o'clock this afternoon.


Data Pipeline Automation: The Next Step Forward in DataOps

#artificialintelligence

The industry has largely settled on the notion of a data pipeline as a means of encapsulating the engineering work that goes into collecting, transforming, and preparing data for downstream advanced analytics and machine learning workloads. Now the next step forward is to automate that pipeline work, which is a cause that several DataOps vendors are rallying around. Data engineers are some of the most in-demand people in organizations that are leveraging big data. While data scientists (or machine learning engineers, as many of them are calling themselves nowadays) get most of the glory, it's the data engineers who do much of the hands-on-keyboard work that makes the magic of data science possible. Just as data science platforms have emerged to automate the most common data science tasks, we are also seeing new software tools emerging to handle much of the repetitive data pipeline work that is typically handled by data engineers.


DataOps and the Problem with 'Ops' Terminology - The New Stack

#artificialintelligence

Almost every trend regarding how IT operations are handled gets an "Ops" moniker: DevOps, DevSecOps, AIOps, MLOps, GitOps, NoOps, FinOps, etc. We wholeheartedly believe that many of these terms explain real phenomenon. However, as is the case with DataOps, the rush to rebrand existing products obfuscates the degree to which these trends are getting traction. Although there are differences, at its core DataOps is DevOps processes applied to data operations. Automation of data pipelines and collaboration between teams are two of the key characteristics of "modern" DataOps.


2017: The Year of DataOps – data-ops – Medium

@machinelearnbot

Data analytics has become increasingly important over the past several years as organizations increasingly find that data is the key to creating and sustaining a competitive advantage. The single most important innovation in data analytics this past year was DataOps.