Microsoft, Machine Learning, And "Data Wrangling": ML Leverages Business Intelligence For B2B

#artificialintelligence

"Data wrangling" was an interesting phrase to hear in the machine learning (ML) presentations at Microsoft Ignite. Interesting because data wrangling is from business intelligence (BI), not from artificial intelligence (AI). Microsoft understands ML incorporates concepts from both disciplines. Further discussions point to another key point: Microsoft understands that business-to-business (B2B) is just as fertile for ML as business-to-consumer (B2C). ML applications with the most press are voice, augmented reality and autonomous vehicles.


Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study

arXiv.org Artificial Intelligence

Neural sequence-to-sequence (seq2seq) approaches have proven to be successful in grammatical error correction (GEC). Based on the seq2seq framework, we propose a novel fluency boost learning and inference mechanism. Fluency boosting learning generates diverse error-corrected sentence pairs during training, enabling the error correction model to learn how to improve a sentence's fluency from more instances, while fluency boosting inference allows the model to correct a sentence incrementally with multiple inference steps. Combining fluency boost learning and inference with convolutional seq2seq models, our approach achieves the state-of-the-art performance: 75.72 (F_{0.5}) on CoNLL-2014 10 annotation dataset and 62.42 (GLEU) on JFLEG test set respectively, becoming the first GEC system that reaches human-level performance (72.58 for CoNLL and 62.37 for JFLEG) on both of the benchmarks.


Deploying Large Scale Classification Algorithms for Attribute Prediction

#artificialintelligence

In our last post we talked about automated product attribute classification using advanced text based machine learning techniques using the given product features like title, description etc. & predicting product attribute values from the defined set of values. As discussed as the catalogue size and no. of suppliers keep growing the problem of maintaining the catalogue accurately grows exponentially and there are thousands of attribute values and millions of products per day to classify. In this post, we are going to highlight some of the keys steps we utilized to deploy machine learning algorithms to classify thousands of attributes and deploying them on dataX, CrowdANALYTIX's proprietary big data curation and veracity optimization platform. As shown in the figure below - client product catalog is extracted, curated and a list of products (new products which need classification or old product refreshes) is sent to dataX . The dataX ecosystem is designed to onboard millions of products each day to make high precision predictions.


Co-Occurrence-Based Error Correction Approach to Word Segmentation

AAAI Conferences

To overcome the problems in Thai word segmentation, a number of word segmentation has been proposed during the long period of time until today. We propose a novel Thai word segmentation approach so called Co-occurrence-Based Error Correction (CBEC). CBEC generates all possible segmentation candidates using the classical maximal matching algorithm and then selects the most accurate segmentation based on co-occurrence and an error correction algorithm. CBEC was trained and evaluated on BEST 2009 corpus.


Data cleaning in Python: some examples from cleaning Airbnb data

#artificialintelligence

I previously worked for a year and a half at an Airbnb property management company, as head of the team responsible for pricing, revenue and analysis. One thing I find particularly interesting is how to figure out what price to charge for a listing on the site. Although'it's a two bedroom in Manchester' will get you reasonably far, there are actually a huge number of factors that can influence a listing's price. As part of a bigger project on using deep learning to predict Airbnb prices, I found myself thrown back into the murky world of property data. Geospatial data can be very complex and messy -- and user-entered geospatial data doubly so.