Statistical Learning


A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors

#artificialintelligence

This paper introduces a la carte embed-ding, a simple and general alternative to the usual word2vec-based approaches for building such representations that is based upon recent theoretical results for GloVe-like embeddings. Our method relies mainly on a linear transfor-mation that is efficiently learnable using pretrained word vectors and linear regression. This transform is applicable on the fly in the future when a new text feature or rare word is encountered, even if only a single usage example is available. We introduce a new dataset showing how the a la carte method requires fewer examples of words in con-text to learn high-quality embeddings and we obtain state-of-the-art results on a nonce task and some unsupervised document classification tasks.


XGBoost and Random Forest with Bayesian Optimisation

#artificialintelligence

Instead of only comparing XGBoost and Random Forest in this post we will try to explain how to use those two very popular approaches with Bayesian Optimisation and that are those models main pros and cons. XGBoost (XGB) and Random Forest (RF) both are ensemble learning methods and predict (classification or regression) by combining the outputs from individual decision trees (we assume tree-based XGB or RF). XGBoost build decision tree one each time. Each new tree corrects errors which were made by previously trained decision tree. At Addepto we use XGBoost models to solve anomaly detection problems e.g. in supervised learning approach.


Top Machine Learning and Data Science Methods Used at Work

#artificialintelligence

The practice of data science requires the use algorithms and data science methods to help data professionals extract insights and value from data. A recent survey by Kaggle revealed that data professionals used data visualization, logistic regression, cross-validation and decision trees more than other data science methods in 2017. Looking ahead to 2018, data professionals are most interested in learning deep learning (41%). Kaggle conducted a survey in August 2017 of over 16,000 data professionals (2017 State of Data Science and Machine Learning). Their survey included a variety of questions about data science, machine learning, education and more.


Scaling tree-based automated machine learning to biomedical big data with a feature set selector

#artificialintelligence

Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programing (GP) to recommend an optimized analysis pipeline for the data scientist's prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. We introduce two new features implemented in TPOT that helps increase the system's scalability: Feature Set Selector (FSS) and Template. FSS provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT's efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing GP to select the best subset in the final pipeline. Template enforces type constraints with strongly typed GP and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results.


Having Fun with Self-Organizing Maps

#artificialintelligence

Self-Organizing Maps (SOM), or Kohonen Networks ([1]), is an unsupervised learning method that can be applied to a wide range of problems such as: data visualization, dimensionality reduction or clustering. It was introduced in the 80' by computer scientist Teuvo Kohonen as a type of neural network ([Kohonen 82],[Kohonen 90]). In this post we are going to present the basics of the SOM model and build a minimal python implementation based on numpy. There is a huge litterature on SOMs (see [2]), theoretical and applied, this post only aims at having fun with this model over a tiny implementation. The approach is very much inspired by this post ([3]).



What's wrong with the approach to Data Science?

#artificialintelligence

Data science is the application of statistics, programming and domain knowledge to generate insights into a problem that needs to be solved. The Harvard Business Review said Data Scientist is the sexiest job of the 21st century. How often has that article been referenced to convince people? The job'Data Scientist' has been around for decades, it was just not called "Data Scientist". Statisticians have used their knowledge and skills using machine learning techniques such as Logistic Regression and Random Forest for prediction and insights for decades.


Ten Machine Learning Algorithms You Should Know to Become a Data Scientist

#artificialintelligence

Let's say I am given an Excel sheet with data about various fruits and I have to tell which look like Apples. What I will do is ask a question "Which fruits are red and round?" and divide all fruits which answer yes and no to the question. Now, All Red and Round fruits might not be apples and all apples won't be red and round. So I will ask a question "Which fruits have red or yellow colour hints on them? " on red and round fruits and will ask "Which fruits are green and round?" on not red and round fruits. Based on these questions I can tell with considerable accuracy which are apples. This cascade of questions is what a decision tree is. However, this is a decision tree based on my intuition.


R-Squared Explained for Indian Grandma - Reskilling IT

#artificialintelligence

In this post, you will learn about the concept of R-Squared in relation to assess the performance of multilinear regression machine learning model with the help of some real-world examples explained in simple manner. Once we have built a multilinear regression model, the next thing is to find out the model performance. The model performance can be found out by calculating the value of the Residual Standard Error (RSE) or the value of R-Squared. Residual Standard Error can be defined as the difference between the mean value of the prediction made by the model and the population mean value. In this article, we will learn the technique of evaluating the model performance using the value of R-Squared.


Learning Predictive Analytics with Python - Programmer Books

#artificialintelligence

Social Media and the Internet of Things have resulted in an avalanche of data. Data is powerful but not in its raw form – It needs to be processed and modeled, and Python is one of the most robust tools out there to do so. It has an array of packages for predictive modeling and a suite of IDEs to choose from. Learning to predict who would win, lose, buy, lie, or die with Python is an indispensable skill set to have in this data age. This book is your guide to getting started with Predictive Analytics using Python.