Deploying Large Scale Classification Algorithms for Attribute Prediction

#artificialintelligence

In our last post we talked about automated product attribute classification using advanced text based machine learning techniques using the given product features like title, description etc. & predicting product attribute values from the defined set of values. As discussed as the catalogue size and no. of suppliers keep growing the problem of maintaining the catalogue accurately grows exponentially and there are thousands of attribute values and millions of products per day to classify. In this post, we are going to highlight some of the keys steps we utilized to deploy machine learning algorithms to classify thousands of attributes and deploying them on dataX, CrowdANALYTIX's proprietary big data curation and veracity optimization platform. As shown in the figure below - client product catalog is extracted, curated and a list of products (new products which need classification or old product refreshes) is sent to dataX . The dataX ecosystem is designed to onboard millions of products each day to make high precision predictions.


From Data Analysis to Machine Learning

#artificialintelligence

This article was originally posted here, by Mubashir Qasim. "In my last article, I stated that for practitioners (as opposed to theorists), the real prerequisite for machine learning is data analysis, not math. One of the main reasons for making this statement, is that data scientists spend an inordinate amount of time on data analysis. The traditional statement is that data scientists "spend 80% of their time on data preparation." While I think that this statement is essentially correct, a more precise statement is that you'll spend 80% of your time on getting data, cleaning data, aggregating data, reshaping data, and exploring data using exploratory data analysis and data visualization.


From Data Analysis to Machine Learning

#artificialintelligence

This article was originally posted here, by Mubashir Qasim. In my last article, I stated that for practitioners (as opposed to theorists), the real prerequisite for machine learning is data analysis, not math. One of the main reasons for making this statement, is that data scientists spend an inordinate amount of time on data analysis. The traditional statement is that data scientists "spend 80% of their time on data preparation." While I think that this statement is essentially correct, a more precise statement is that you'll spend 80% of your time on getting data, cleaning data, aggregating data, reshaping data, and exploring data using exploratory data analysis and data visualization.


Making data science accessible – Data Munging

@machinelearnbot

By Data Munging we mean the process of taking raw data, understanding it, cleaning it and preparing it for analysis or modelling. It is by no means the glamorous part of data science however if done well it plays a more important role in getting to powerful models and insights than what algorithm you use. So, you've been given a new dataset and are looking to model some behaviors in the data. It is really easy to jump straight in and start running regression or machine learning but this is a mistake. The first step is to really understand the data, starting from a univariate view and slowly building out.


Microsoft aims to take the work out of data wrangling with coming 'Pendleton' tool

ZDNet

With its growing emphasis on all things AI -- coupled with its history as a tool vendor -- it's not surprising that Microsoft is working on tools not just for traditional programmers, but also data scientists. According to a Microsoft Research presentation from earlier this year, data scientists currently spend 80 percent of their time extracting and cleaning data -- a k a "data wrangling." Microsoft wants to fix this. A year ago, I first heard from a contact of mine about a new machine-learning-related tool under development by Microsoft that was codenamed "Pendleton." But it wasn't until The Walking Cat (@h0x0d on Twitter) unearthed some more information and documents that I had enough information to write about Pendleton.