In a data science project, almost always the most time consuming and messy part is the data gathering and cleaning. Everyone likes to build a cool deep neural network (or XGboost) model or two and show off one's skills with cool 3D interactive plots. But the models need raw data to start with and they don't come easy and clean. But why gather data or build model anyway? The fundamental motivation is to answer a business or scientific or social question.
In a data science project, almost always the most time consuming and messy part is the data gathering and cleaning. Everyone likes to build a cool deep neural network (or XGboost) model or two and show off one's skills with cool 3D interactive plots. But the models need raw data to start with and they don't come easy and clean.
In our last post we talked about automated product attribute classification using advanced text based machine learning techniques using the given product features like title, description etc. & predicting product attribute values from the defined set of values. As discussed as the catalogue size and no. of suppliers keep growing the problem of maintaining the catalogue accurately grows exponentially and there are thousands of attribute values and millions of products per day to classify. In this post, we are going to highlight some of the keys steps we utilized to deploy machine learning algorithms to classify thousands of attributes and deploying them on dataX, CrowdANALYTIX's proprietary big data curation and veracity optimization platform. As shown in the figure below - client product catalog is extracted, curated and a list of products (new products which need classification or old product refreshes) is sent to dataX . The dataX ecosystem is designed to onboard millions of products each day to make high precision predictions.
The global influence of Big Data is not only growing but seemingly endless. The trend is leaning towards knowledge that is attained easily and quickly from massive pools of Big Data. Today we are living in the technological world that Dr. Usama Fayyad and his distinguished research fellows discussed in the introductory explanations of Knowledge Discovery in Databases (KDD) predicted nearly two decades ago. Indeed, they were precise in their outlook on Big Data analytics. In fact, the continued improvement of the interoperability of machine learning, statistics, database building and querying fused to create this increasingly popular science- Data Mining and Knowledge Discovery. The next generation computational theories are geared towards helping to extract insightful knowledge from even larger volumes of data at higher rates of speed. As the trend increases in popularity, the need for a highly adaptive solution for knowledge discovery will be necessary. In this research paper, we are introducing the investigation and development of 23 bit-questions for a Metaknowledge template for Big Data Processing and clustering purposes. This research aims to demonstrate the construction of this methodology and proves the validity and the beneficial utilization that brings Knowledge Discovery from Big Data.
As a research scientist at the German online retail giant Zalando, Dr. Alan Akbik is an expert in Natural Language Processing and Data Extraction. In his work for the company, which at any given moment is handling massive numbers of online transactions in multiple languages, Akbik helps unveil unique insights into the very structure of human language by observing and analyzing huge sets of multilingual text data. Here's what he had to say about the possibilities for both business and the study of language that NLP is bringing online.