Collaborating Authors

Feature Engineering: Data scientist's Secret Sauce !


It is very tempting for data science practitioners to opt for the best known algorithms for a given problem.However It's not the algorithm alone, which can provide the best solution; Model built on carefully engineered and selected features can provide far better results. "Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction."- The complex models are not easily interpretable and tougher to tune. Simpler algorithms, with better features or more data can perform far better than a weak assumption accompanied with a complex model.


AAAI Conferences

Potential heuristics for state-space search are defined as weighted sums over simple state features. Atomic features consider the value of a single state variable in a factored state representation, while binary features consider joint assignments to two state variables. Previous work showed that the set of all admissible and consistent potential heuristics using atomic features can be characterized by a compact set of linear constraints. We generalize this result to binary features and prove a hardness result for features of higher dimension. Furthermore, we prove a tractability result based on the treewidth of a new graphical structure we call the context-dependency graph. Finally, we study the relationship of potential heuristics to transition cost partitioning. Experimental results show that binary potential heuristics are significantly more informative than the previously considered atomic ones.

Predicting real-time availability of 200 million grocery items in North American stores


Ever wished there was a way to know if your favorite Ben and Jerry's ice cream flavor is currently available in a grocery store near you? Instacart's machine learning team has built tools to figure that out! Our marketplace's scale lets us build sophisticated prediction models. Our community of over 70,000 personal shoppers scans millions of items per day across 15,000 physical stores and delivers them to the customers. These stores belong to our grocery retail partners like Aldi, Costco, Krogers, Safeway, and Wegmans.

Hands-on with Feature Engineering Techniques: Common Issues in Datasets


This article is a part of a series about feature engineering techniques. In machine learning, a dataset can have many features; and among these features, many issues can appear, ranging from missing data to outliers. These concerns need to be addressed before feeding features into a machine learning model. In this post, we'll explore feature characteristics that need to be addressed before building any machine learning model; We'll also learn how to identify these kinds of characteristics and how they can significantly impact machine learning models. Missing data occurs when no data is stored for a particular observation in a variable--so basically, missing data is just what it sounds like, the absence of data.

K Means Clustering - Effect of random seed


When the k-means clustering algorithm runs, it uses a randomly generated seed to determine the starting centroids of the clusters. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. An example for such a behavior is shown. R is used for the experiment. The code to load the data and the contents of the data are as follows.