Collaborating Authors


Mark Cuban: Here's how to give your kids 'an edge'


The way to set your children up for success in this day and age is to ensure they learn about artificial intelligence, according to the billionaire tech entrepreneur Mark Cuban. "Give your kids an edge, have them sign up [and] learn the basics of Artificial Intelligence," Cuban tweeted on Monday. Cuban, who is a star on the hit ABC show "Shark Tank" and the owner of the Dallas Mavericks NBA basketball team, was promoting a free, one-hour virtual class his foundation is teaching an introduction to artificial intelligence in collaboration with A.I. For Anyone, a nonprofit organization that aims to improve literacy of artificial understanding. "Parents, want your kids to learn about artificial intelligence while you're stuck in quarantine," Cuban says on his LinkedIn account.

Machine Learning Tutorial with Python, Jupyter, KSQL and TensorFlow


When Michelangelo started, the most urgent and highest impact use cases were some very high scale problems, which led us to build around Apache Spark (for large-scale data processing and model training) and Java (for low latency, high throughput online serving). This structure worked well for production training and deployment of many models but left a lot to be desired in terms of overhead, flexibility, and ease of use, especially during early prototyping and experimentation [where Notebooks and Python shine]. Uber expanded Michelangelo "to serve any kind of Python model from any source to support other Machine Learning and Deep Learning frameworks like PyTorch and TensorFlow [instead of just using Spark for everything]." So why did Uber (and many other tech companies) build its own platform and framework-independent machine learning infrastructure? The posts How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka and Using Apache Kafka to Drive Cutting-Edge Machine Learning describe the benefits of leveraging the Apache Kafka ecosystem as a central, scalable, and mission-critical nervous system. It allows real-time data ingestion, processing, model deployment, and monitoring in a reliable and scalable way. This post focuses on how the Kafka ecosystem can help solve the impedance mismatch between data scientists, data engineers, and production engineers. By leveraging it to build your own scalable machine learning infrastructure and also make your data scientists happy, you can solve the same problems for which Uber built its own ML platform, Michelangelo.

Streaming Machine Learning with Tiered Storage


Both approaches have their pros and cons. The blog post Machine Learning and Real-Time Analytics in Apache Kafka Applications and the Kafka Summit presentation Event-Driven Model Serving: Stream Processing vs. RPC with Kafka and TensorFlow discuss this in detail. There are more and more applications where the analytic model is directly embedded into the event streaming application, making it robust, decoupled, and optimized for performance and latency. The model can be loaded into the application when starting it up (e.g., using the TensorFlow Java API). Model management (including versioning) depends on your build pipeline and DevOps strategy. For example, new models can be embedded into a new Kubernetes pod which simply replaces the old pod. Another commonly used option is to send newly trained models (or just the updated weights or hyperparameters) as a Kafka message to a Kafka topic.

Investigating the interaction between gradient-only line searches and different activation functions Machine Learning

Gradient-only line searches (GOLS) adaptively determine step sizes along search directions for discontinuous loss functions resulting from dynamic mini-batch sub-sampling in neural network training. Step sizes in GOLS are determined by localizing Stochastic Non-Negative Associated Gradient Projection Points (SNN-GPPs) along descent directions. These are identified by a sign change in the directional derivative from negative to positive along a descent direction. Activation functions are a significant component of neural network architectures as they introduce non-linearities essential for complex function approximations. The smoothness and continuity characteristics of the activation functions directly affect the gradient characteristics of the loss function to be optimized. Therefore, it is of interest to investigate the relationship between activation functions and different neural network architectures in the context of GOLS. We find that GOLS are robust for a range of activation functions, but sensitive to the Rectified Linear Unit (ReLU) activation function in standard feedforward architectures. The zero-derivative in ReLU's negative input domain can lead to the gradient-vector becoming sparse, which severely affects training. We show that implementing architectural features such as batch normalization and skip connections can alleviate these difficulties and benefit training with GOLS for all activation functions considered.

Top 9 Libraries You Can Use In Large-Scale AI Projects


Using machine learning to solve hard problems and building profitable businesses is almost mainstream now. This rise was accompanied by the introduction of several toolkits, frameworks and libraries, which made the developers' job easy. In the first case, there are tools and approaches, often tedious, to scrape and gather data. However, in the latter case, a data surge will bring its own set of problems. These problems can range from feature engineering to storage to computational overkill.

Forecasting air quality with Dremio, Python and Kafka


Forecasting air quality is a worthwhile investment on many different levels, not only to individuals but also communities in general, having an idea of what the quality of air will be at a certain point in time allows people to plan ahead, and as a result decreases the effects on health and costs associated with it. When predicting air quality, there is a large number of variables to take into account, using a machine learning model that allows us to use all these variables to predict future air quality based on current readings, brings a lot of value to the scenario. In this tutorial, we will create a machine learning model using historical air quality data stored in Amazon S3 buckets and also ADLS, we will use Dremio to link both data sources and also curate the data before creating the model using Python, additionally we will use the resulting model along with Kafka to predict air quality values to a data stream. In Amazon S3 the data should be stored inside buckets. To create a bucket (we want to call it (dremioairbucket), go to the AWS portal and select S3 from the list of services.

Apache Spark Streaming Tutorial for Beginners


In a world where we generate data at an extremely fast rate, the correct analysis of the data and providing useful and meaningful results at the right time can provide helpful solutions for many domains dealing with data products. We can apply this in Health Care and Finance to Media, Retail, Travel Services and etc. some solid examples include Netflix providing personalized recommendations at real-time, Amazon tracking your interaction with different products on its platform and providing related products immediately, or any business that needs to stream a large amount of data at real-time and implement different analysis on it. One of the amazing frameworks that can handle big data in real-time and perform different analysis, is Apache Spark. In this blog, we are going to use spark streaming to process high-velocity data at scale. Apache Spark is a lightning-fast cluster computing technology, designed for fast computation.

StreamSets: Where DevOps Meets Data Integration


Apache Kafka is a scalable and fault tolerant messaging system common in publish and subscribe (pub/sub) architectures. Apache Kafka is used for a range of use cases including message bus modernization, microservices architectures and ETL over streaming data. High throughput -- Each server is capable of handling 100s MB/sec of data. High availability -- Data can be stored redundantly in multiple servers and can survive individual server failure. High scalability -- New servers can be added over time to scale out the system.

What machine learning engineers need to know


Subscribe to the O'Reilly Data Show Podcast to explore the opportunities and techniques driving big data, data science, and AI. Find us on Stitcher, TuneIn, iTunes, SoundCloud, RSS. In this episode of the Data Show, I spoke with Jesse Anderson, managing director of the Big Data Institute, and my colleague Paco Nathan, who recently became co-chair of Jupytercon. This conversation grew out of a recent email thread the three of us had on machine learning engineers, a new job role that LinkedIn recently pegged as the fastest growing job in the U.S. In our email discussion, there was some disagreement on whether such a specialized job role/title was needed in the first place. As Eric Colson pointed out in his beautiful keynote at Strata Data San Jose, when done too soon, creating specialized roles can slow down your data team.