Goto

Collaborating Authors

 Performance Analysis


40 Interview Questions asked at Startups in Machine Learning / Data Science

@machinelearnbot

This article was posted by Manish Saraswat on Analytics Vidhya. Manish who works in marketing and Data Science at Analytics Vidhya believes that education can change this world. R, Data Science and Machine Learning keep him busy. Machine learning and data science are being looked as the drivers of the next industrial revolution happening in the world today. This also means that there are numerous exciting startups looking for data scientists.


Learning Feature Nonlinearities with Non-Convex Regularized Binned Regression

arXiv.org Machine Learning

Recently, substantial progress has been made on the problem of high-dimensional sparse linear models [22]. In particular, Lasso has been shown to be remarkably successful, and is statistically well-behaved and generates interpretable solutions. However, in the presence of non-linearity (i.e., the relation between the covariates and response is nonlinear), boosted decision trees, deep learning models, and kernel methods are regarded as the most effective models that deliver substantial performance boost over linear models; however, their interpretability is limited. As a result, there is a significant gap between the statistical performance and the interpretability, and it is often desirable to have computationally efficient algorithms that learn interpretable models without sacrificing statistical guarantees. This raises a natural question that we aim to tackle: Is there any algorithm which has similar statistical performance to complex models, while still retaining much of the interpretability of Lasso? In this paper, we answer the above question affirmatively and propose a novel way of learning the feature non-linearities with provable statistical and computational guarantees.


CDS Rate Construction Methods by Machine Learning Techniques

arXiv.org Machine Learning

Regulators require financial institutions to estimate counterparty default risks from liquid CDS quotes for the valuation and risk management of OTC derivatives. However, the vast majority of counterparties do not have liquid CDS quotes and need proxy CDS rates. Existing methods cannot account for counterparty-specific default risks; we propose to construct proxy CDS rates by associating to illiquid counterparty liquid CDS Proxy based on Machine Learning Techniques. After testing 156 classifiers from 8 most popular classifier families, we found that some classifiers achieve highly satisfactory accuracy rates. Furthermore, we have rank-ordered the performances and investigated performance variations amongst and within the 8 classifier families. This paper is, to the best of our knowledge, the first systematic study of CDS Proxy construction by Machine Learning techniques, and the first systematic classifier comparison study based entirely on financial market data. Its findings both confirm and contrast existing classifier performance literature. Given the typically highly correlated nature of financial data, we investigated the impact of correlation on classifier performance. The techniques used in this paper should be of interest for financial institutions seeking a CDS Proxy method, and can serve for proxy construction for other financial variables. Some directions for future research are indicated.


CardiacNET: Segmentation of Left Atrium and Proximal Pulmonary Veins from MRI Using Multi-View CNN

arXiv.org Machine Learning

Anatomical and biophysical modeling of left atrium (LA) and proximal pulmonary veins (PPVs) is important for clinical management of several cardiac diseases. Magnetic resonance imaging (MRI) allows qualitative assessment of LA and PPVs through visualization. However, there is a strong need for an advanced image segmentation method to be applied to cardiac MRI for quantitative analysis of LA and PPVs. In this study, we address this unmet clinical need by exploring a new deep learning-based segmentation strategy for quantification of LA and PPVs with high accuracy and heightened efficiency. Our approach is based on a multi-view convolutional neural network (CNN) with an adaptive fusion strategy and a new loss function that allows fast and more accurate convergence of the backpropagation based optimization. After training our network from scratch by using more than 60K 2D MRI images (slices), we have evaluated our segmentation strategy to the STACOM 2013 cardiac segmentation challenge benchmark. Qualitative and quantitative evaluations, obtained from the segmentation challenge, indicate that the proposed method achieved the state-of-the-art sensitivity (90%), specificity (99%), precision (94%), and efficiency levels (10 seconds in GPU, and 7.5 minutes in CPU).


Data Science Dictionary

@machinelearnbot

The idea of cross-validation is to split the data into N subsets, to put one subset aside, to estimate parameters of the model from the remaining N-1 subsets, and to use the retained subset to estimate the error of the model. Such a process is repeated N times - with each of the N subsets being used as the validation set . Then the values of the errors obtained in such N steps are combined to provide the final estimate of the model error. The cross-validation is used in various classification and prediction procedures, such as regression analysis, discriminant analysis, neural networks and classification and regression trees (CART) . The goal is to improve the quality of the decision that is made from the outcome of the study on the basis of statistical methods, and to ensure that maximum information is obtained from scarce experimental data.


Email Spam Filtering: An Implementation with Python and Scikit-learn

@machinelearnbot

Text mining (deriving information from text) is a wide field which has gained popularity with the huge text data being generated. Automation of a number of applications like sentiment analysis, document classification, topic classification, text summarization, machine translation, etc has been done using machine learning models. Spam filtering is a beginner's example of document classification task which involves classifying an email as spam or non-spam (a.k.a. Spam box in your Gmail account is the best example of this. So lets get started in building a spam filter on a publicly available mail corpus.


WWE Backlash 2017: Predictions, Match Card For 'SmackDown Live' PPV

International Business Times

For the first time in more than three months, a "SmackDown Live" pay-per-view is on the schedule. WWE Backlash 2017 is set for Sunday night in Rosemont, Illinois at the Allstate Arena with a few new faces set to compete in some of the card's biggest matches. Below are WWE Backlash predictions for every match on the card. Eight matches are scheduled, and three championships will be on the line. It was pretty surprising when Mahal became the No.1 contender for the top belt on "SmackDown Live," and it would be even more shocking to see him win the title.


Imbalanced Datasets

@machinelearnbot

Imagine you are a medical professional who is training a classifier to detect whether an individual has an extremely rare disease. You train your classifier, and it yields 99.9% accuracy on your test set. You're overcome with joy by these results, but when you check the labels outputted by the classifier, you see it always outputted "No Disease," regardless of the patient data. Because the disease is extremely rare, there were only a handful of patients with the disease in your dataset compared the thousands of patients without the disease. Because over 99.9% of the patients in your dataset don't have the disease, any classifier can achieve an impressively high accuracy simply by returning "No Disease" to every new patient.


How to create text classifiers with Machine Learning

@machinelearnbot

Building a quality machine learning model for text classification can be a challenging process. You need to build a training dataset, test different parameters for your model, fix the confusions, among other things. On this post, we will describe the process on how you can successfully train text classifiers with machine learning using MonkeyLearn. What are the categories or tags that you want to assign to your texts? This is the first question you need to answer when you start working on your text classifier.


On ROC Curve Analysis of Artificial Neural Network Classifiers

AAAI Conferences

Receiver operating characteristic or ROC curves are of great interest in evaluating many security systems such as biometric authentication. They visualize the trade-off between the number of security breaches and the level of convenience. In the earlier work, ROC curves and their decision boundaries were studied for various classifiers. Here, further studies are conducted to identify problems of ROC curve analysis when artificial neural network (ANN) classifiers' net values are used. Graphical decision boundaries and experimental results on the IRIS biometric authentication system reveal the over-fitting in the ROC curve analysis. This graphical decision boundaries suggest that ANN classifiers with two output units are more desirable than those with a single output unit for two class classification problems.