The future of the future: Spark, big data insights, streaming and deep learning in the cloud


You probably did not hear it here first. Spark has been making waves in big data for a while now, and 2017 has not disappointed anyone who has bet on its meteoric rise. That was a pretty safe bet actually, as interpreting market signals, speaking with pundits and monitoring data all pointed to the same direction. Its community is growing, and all major big data platforms make a point of interoperating with Spark. If you look at its core contributors and project management committee (PMC) you will see Hadoop heavyweights Cloudera and Hortonworks, and all-round powerhouses such as IBM, Facebook and Microsoft.

Databricks claims its new product Delta is the missing link to enterprise AI


You can't really give a conference keynote in 2017 without staking some sort of claim on AI, so Databricks was smart to keep its credibility, and let its geeky co-founder and CTO Matei Zaharia officially open the Spark Summit Europe 2017 yesterday. Instead, Zaharia spoke about streaming data and deep learning -- the engineers and developers in the room ate it up. The conference, organized by Databricks, creators of Apache Spark, brought more than 1200 enthusiasts to Dublin, Ireland this week to learn about what new features and functions will be added to the open source project. The short answer, according to Zaharia is cost based optimization, Python and R improvements, Kubernetes support and more. But that's not what the C-Suite executives whose companies leverage Databricks enterprise-ready, feature-filled version of Spark wanted to know about.

Spark Summit 17: Databricks launches Delta as purified data lake


Databricks, the inventor and commercial distributor of the Apache Spark processing platform, has announced a system called Delta, which it believes will appeal to CIOs as a data lake, a data warehouse and a "streaming ingest system". It is said to eliminate the need for extract, transform and load (ETL) processes. Every conference this year contains a dead human genius reincarnated as software system or a robot. Yes, there is a lot of hype, but there is real worth in AI and Machine Learning. Read our counseling on how to avoid adopting "black box" approach.

Coding up a Neural Network classifier from scratch – Towards Data Science – Medium


High-level deep learning libraries such as TensorFlow, Keras, and Pytorch do a wonderful job in making the life of a deep learning practitioner easier by hiding many of the tedious inner-working details of neural networks. As great as this is for deep learning, it comes with the minor downside of leaving many new-comers with less foundational understanding to be learned elsewhere. Our goal here is to simply provide a 1 hidden-layer fully-connected neural network classifier written from scratch (no deep learning libraries) to help chip away that mysterious black-box feeling you might have with neural networks. The provided neural network classifies a dataset describing geometrical properties of kernels belonging to three classes of wheat (you can easily replace this with your own custom dataset). An L2-loss function is assumed, and a sigmoid transfer function is used on every node in the hidden and output layers.

A Tour of Gotchas When Implementing Deep Q Networks with Keras and OpenAi Gym


Starting with the Google DeepMind paper, there has been a lot of new attention around training models to play video games. You, the data scientist/engineer/enthusiast, may not work in reinforcement learning but probably are interested in teaching neural networks to play video games. With that in mind, here's a list of nuances that should jumpstart your own implementation. The lessons below were gleaned from working on my own implementation of the Nature paper. The lessons are aimed at people who work with data but may run into some issues with some of the non-standard approaches used in the reinforcement learning community when compared with typical supervised learning use cases.

Young startups go full throttle

MIT News

From June to August each year, MIT delta v, hosted in the Martin Trust Center for MIT Entrepreneurship, provides a cohort of startups with the wherewithal to launch: office and lab space for prototyping, mentorship from veteran entrepreneurs, $20,000 in funding, and $2,000 in living expenses. The diverse range of ideas included robots that analyze sewerage to track opioid consumption in populations, portable weight-lifting equipment that adjusts resistance in real time, a "Netflix" service for autonomous-vehicle data, augmented reality for recording and sharing knowledge of frontline workers in hospitals and care facilities, an online market that helps indigenous people digitize and sell their art, a battery for soldiers that recharges with fuel, cooking classes that donate meals to the needy, and advanced filtration systems that better remove heavy metals from drinking water. These milestones include partnerships and agreements with big-name companies, pilot programs, working prototypes or early product iterations, launched websites or apps, earned revenue, and -- perhaps most importantly -- customers. This year also saw the launch of a pilot program, the MIT NYC Summer Startup Studio, in New York City, where seven additional startups were offered the same perks that delta v provides.

The airports of the future are here


One reason airports tend to look and function remarkably alike is that they're designed to accommodate air travel infrastructure--security, passenger ticketing, baggage, ground transport--with the primary concerns being safety and minimal overhead for their tenant airlines. "It's like having a Super Bowl worth of people every single day." "It's like having a Super Bowl worth of people every single day." At Changi, concession revenues rose 5 percent last year to a record S$2.16 billion ($1.6 billion), while the world's busiest airport, Atlanta's Hartsfield-Jackson International, topped $1 billion in concession sales in 2016, also a record.

The Fundamental Statistics Theorem Revisited


It turned out that putting more weight on close neighbors, and increasingly lower weight on far away neighbors (with weights slowly decaying to zero based on the distance to the neighbor in question) was the solution to the problem. For those interested in the theory, the fact that cases 1, 2 and 3 yield convergence to the Gaussian distribution is a consequence of the Central Limit Theorem under the Liapounov condition. More specifically, and because the samples produced here come from uniformly bounded distributions (we use a random number generator to simulate uniform deviates), all that is needed for convergence to the Gaussian distribution is that the sum of the squares of the weights -- and thus Stdev(S) as n tends to infinity -- must be infinite. More generally, we can work with more complex auto-regressive processes with a covariance matrix as general as possible, then compute S as a weighted sum of the X(k)'s, and find a relationship between the weights and the covariance matrix, to eventually identify conditions on the covariance matrix that guarantee convergence to the Gaussian destribution.

The Death of the Statistical Tests of Hypotheses


It is part of a data science framework (see section 2 in this article), in which many statistical procedures have been revisited to make them simple, scalable, accurate enough without aiming for perfection but instead for speed, and usable by engineers, machine learning practitioners, computer scientists, software engineers, AI and IoT experts, big data practitioners, business analysts, lawyers, doctors, journalists, even in some cases by the layman, and even by machines and API's (as in machine-to-machine communications). Over years, I have designed a new, unified statistical framework for big data, data science, machine learning, and related disciplines. I have also written quite a bit on time series (detection of accidental high correlations in big data, change point detection, multiple periodicities), correlation and causation, clustering for big data, random numbers, simulation, ridge regression (approximate solutions) and synthetic metrics (new variances, bumpiness coefficient, robust correlation metric and robust R-squared non sensitive to outliers.) Vincent also manages his own self-funded research lab, focusing on simplifying, unifying, modernizing, automating, scaling, and dramatically optimizing statistical techniques.

What's wrong with this pic?

FOX News

The Atlanta-based airline has recently teamed up with Tinder to transform the exterior of Brooklyn building into a "dating wall" covered in worldly murals depicting nine different Delta destinations. According to a press release, the idea is for Brooklynites to snap photos near the murals, upload them to their dating profiles, and trick unsuspecting Tinder dates into thinking they're more well-traveled than they actually are. "So this summer, Delta and Tinder are offering New York singles an opportunity to snap profile pictures that will make you look like a jet-setter via a series of painted walls on display on Wythe Avenue in Williamsburg, Brooklyn." The airline has also placed another large mural -- the second in its Painted Wall Series -- a few blocks away at the site of Brooklyn's weekly Smorgasburg food festival.