hopswork
Do you really need a Feature Store?
"Feature store" has been around for a few years. There are both open-source solutions (such as Feast and Hopsworks), and commercial offerings (such as Tecton, Hopsworks, Databricks Feature Store) for "feature store". There have been a lot of articles and blogs published around what "feature store" is, and why "feature store" is valuable. Some organizations have also already adopted "feature store" to be part of their ML applications. However, it is worthwhile to point out that "feature store" is another component added to your overall ML infrastructure, which requires extra investment and effort to both build and operate. Therefore it is necessary to truly understand and discuss "is Feature Store really necessary for every organization?".
Hopsworks 3.0: The Python-Centric Feature Store
Feature stores began in the world of Big Data, with Spark being the feature engineering platform for Michelangelo (the first feature store) and Hopsworks (the first open-source feature store). Nowadays, the modern data stack has assumed the role of Spark for feature stores - feature engineering code can be written that seamlessly scales to large data volumes in Snowflake, BigQuery, or Redshift. However, Python developers know that feature engineering is so much more than the aggregations and data validation you can do in SQL and DBT. Dimensionality reduction, whether using PCA or Embeddings, and transformations are fundamental steps in feature engineering that are not available in SQL, even with UDFs (user-defined functions), today. Over the last few years, we have had an increasing number of customers who prefer working with Python for feature engineering.
How Optimizing MLOps can Revolutionize Enterprise AI
Machine learning has now entered its business heyday. Almost half of CIOs were predicted to have implemented AI by 2020, a number that is expected to grow significantly in the next five years. Because creating a machine learning model and putting it into operation in an enterprise environment are two very different things. The biggest challenge for companies looking to use AI is operationalizing machine learning, the same way DevOps operationalized software development in the 2000's. Simplifying the data science workflow by providing necessary architecture and automating feature serving with feature stores are two of the most important ways to make machine learning easy, accurate, and fast at scale.
- Information Technology (0.48)
- Government (0.30)
Introducing the Hopsworks 1.x series! - Logical Clocks
Hopsworks 1.x series brings many new features and improvements, ranging from services such as the Feature Store and Experiments, to enhanced support for distributed stream processing and analytics with Apache Flink and Apache Beam, to building Deep Learning pipelines with TensorFlow Extended (TFX), to code versioning support for Jupyter notebooks with Git, to all-new provenance/lineage of data across all steps of a data engineering and data science. We are also excited that Hopsworks 1.x is the back-bone of the all new Managed Hopsworks platform for AWS, Hopsworks.ai Hopsworks 1.x brings significant Feature Store improvements ranging from updated UI components to connectivity with external systems and feature discovery. Users of Hopsworks Enterprise can now easily connect to the Feature Store from their Databricks notebooks and Amazon Sagemaker. Documentation for connecting with these two platforms can be found at hopsworks.readthedocs.io
Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store
The most feature complete and language independent and scalable of the file formats for training data for deep learning is petastorm. Not only does it support high-dimensional data and have native readers in TensorFlow and PyTorch, but it also scales for parallel workers, but it also supports push-down index scans (only read those columns from disk that you request and even skip files where the values in that file are outside the range of values requested) and scales to store many TBs of data. For model serving, we cannot really find any file format superior to the others. The easiest model serving solution to deploy and operate is protocol buffers and TensorFlow serving server. While both ONNX and Torch Script have potential, the open-source model serving servers are not there yet for them.