Strata Hadoopworld Fall 2016 postmortem: Cloud and streaming drive real differentiation


The fall Strata conference is when Big Data makes it to Broadway. And the week was very much a blur. We used to come away from Strata with the memory of one or two overriding themes; last year it was machine learning and the new infatuation with Spark, before that it was about Hadoop opening up the opportunity for exploratory analytics and for Hadoop to disappear behind a veneer of familiar SQL. It's easy to get excited by the idealism around the shiny new thing. But let's set something straight: Spark ain't going to replace Hadoop.

Real-time streaming predictions using Google Cloud Dataflow and Google Cloud Machine Learning


Real-time streaming predictions using Google Cloud Dataflow and Google Cloud Machine Learning Google Cloud Dataflow is probably already embedded somewhere in your daily life, and enables companies to process huge amounts of data in real-time. But imagine that you could combine this - in real-time as well - with the prediction power of neural networks. This is exactly what we will talk about in our latest blogpost! It all started with some fiddling around with Apache Beam, an incubating Apache project that provides a programming model that handles both batch and stream processing jobs. We wanted to test the streaming capabilities running a pipeline on Google Cloud Dataflow, a Google managed service to run such pipelines.

What is Data Engineering?


This is the first in a series of posts on Data Engineering. If you like this and want to know when the next post in the series is released, you can subscribe at the bottom of the page. From helping cars drive themselves to helping Facebook tag you in photos, data science has attracted a lot of buzz recently. Data scientists have become extremely sought after, and for good reason – a skilled data scientist can add incredible value to a business. Data scientists and engineers help power self-driving cars.

Azure.Source - Volume 68


Scale out read-heavy workloads on Azure Database for PostgreSQL with read replicas, which enable continuous, asynchronous replication of data from one Azure Database for PostgreSQL master server to up to five Azure Database for PostgreSQL read replica servers in the same region. Replica servers are read-only except for writes replicated from data changes on the master. Stopping replication to a replica server causes it to become a standalone server that accepts reads and writes. Replicas are new servers that can be managed in similar ways as normal standalone Azure Database for PostgreSQL servers. For each read replica, you are billed for the provisioned compute in vCores and provisioned storage in GB/month.

A MongoDB Secret Weapon: Aggregation Pipeline


MongoDB is best known for creating a document database that Web and mobile developers love to use. But developers and analysts alike may be interested in a little-known MongoDB feature called the aggregation pipeline. What's more, the aggregation pipeline just got easier to use with MongoDB 4.0. The aggregation pipeline presents a powerful abstraction for working with and analyzing data stored in the MongoDB database. According to MongoDB CTO and co-founder Eliot Horowitz, the composability of the aggregation pipeline is one of the keys to its power.