Goto

Collaborating Authors

 Accuracy


RubixML/RubixML

#artificialintelligence

A high-level machine learning library that allows you to build programs that learn from data using the PHP language. Machine learning is the process by which a computer program is able to progressively improve performance on a certain task through training and data without explicitly being programmed. There are two types of machine learning that Rubix supports out of the box, Supervised and Unsupervised. Machine learning projects typically begin with a question. For example, you might want to answer the question "who of my friends are most likely to stay married to their spouse?" One way to go about answering this question with machine learning would be to go out and ask a bunch of happily married and divorced couples the same set of questions about their partner and then use that data to build a model of what a successful marriage looks like. Later, you can use that model to make predictions based on the answers you get from your friends. Specifically, the answers you collect are ...


An Adaptive Weighted Deep Forest Classifier

arXiv.org Machine Learning

A modification of the confidence screening mechanism based on adaptive weighing of every training instance at each cascade level of the Deep Forest is proposed. The idea underlying the modification is very simple and stems from the confidence screening mechanism idea proposed by Pang et al. to simplify the Deep Forest classifier by means of updating the training set at each level in accordance with the classification accuracy of every training instance. However, if the confidence screening mechanism just removes instances from training and testing processes, then the proposed modification is more flexible and assigns weights by taking into account the classification accuracy. The modification is similar to the AdaBoost to some extent. Numerical experiments illustrate good performance of the proposed modification in comparison with the original Deep Forest proposed by Zhou and Feng.


Fast Multi-Class Probabilistic Classifier by Sparse Non-parametric Density Estimation

arXiv.org Machine Learning

The model interpretation is essential in many application scenarios and to build a classification model with a ease of model interpretation may provide useful information for further studies and improvement. It is common to encounter with a lengthy set of variables in modern data analysis, especially when data are collected in some automatic ways. This kinds of datasets may not collected with a specific analysis target and usually contains redundant features, which have no contribution to a the current analysis task of interest. Variable selection is a common way to increase the ability of model interpretation and is popularly used with some parametric classification models. There is a lack of studies about variable selection in nonparametric classification models such as the density estimation-based methods and this is especially the case for multiple-class classification situations. In this study we study multiple-class classification problems using the thought of sparse non-parametric density estimation and propose a method for identifying high impacts variables for each class. We present the asymptotic properties and the computation procedure for the proposed method together with some suggested sample size. We also repost the numerical results using both synthesized and some real data sets.


Big Tech Deploys AI to Combat Hackers

#artificialintelligence

Last year, Microsoft Corp.'s Azure security team detected suspicious activity in the cloud computing usage of a large retailer: One of the company's administrators, who usually logs on from New York, was trying to gain entry from Romania. A hacker had broken in. Microsoft quickly alerted its customer, and the attack was foiled before the intruder got too far. Inc. and various startups are moving away from solely using older "rules-based" technology designed to respond to specific kinds of intrusion and deploying machine-learning algorithms that crunch massive amounts of data on logins, behavior and previous attacks to ferret out and stop hackers. "Machine learning is a very powerful technique for security--it's dynamic, while rules-based systems are very rigid," says Dawn Song, a professor at the University of California at Berkeley's Artificial Intelligence Research Lab. "It's a very manual intensive process to change them, whereas machine learning is automated, dynamic and you can retrain it easily."


A Model for Learned Bloom Filters, and Optimizing by Sandwiching

arXiv.org Machine Learning

Recent work has suggested enhancing Bloom filters by using a pre-filter, based on applying machine learning to determine a function that models the data set the Bloom filter is meant to represent. Here we model such learned Bloom filters,, with the following outcomes: (1) we clarify what guarantees can and cannot be associated with such a structure; (2) we show how to estimate what size the learning function must obtain in order to obtain improved performance; (3) we provide a simple method, sandwiching, for optimizing learned Bloom filters; and (4) we propose a design and analysis approach for a learned Bloomier filter, based on our modeling approach.


Sparse Learning in reproducing kernel Hilbert space

arXiv.org Machine Learning

Sparse learning aims to learn the sparse structure of the true target function from the collected data, which plays a crucial role in high dimensional data analysis. This article proposes a unified and universal method for learning sparsity of M-estimators within a rich family of loss functions in a reproducing kernel Hilbert space (RKHS). The family of loss functions interested is very rich, including most commonly used ones in literature. More importantly, the proposed method is motivated by some nice properties in the induced RKHS, and is computationally efficient for large-scale data, and can be further improved through parallel computing. The asymptotic estimation and selection consistencies of the proposed method are established for a general loss function under mild conditions. It works for general loss function, admits general dependence structure, allows for efficient computation, and with theoretical guarantee. The superior performance of our proposed method is also supported by a variety of simulated examples and a real application in the human breast cancer study (GSE20194).


Top 100 Data science interview questions

#artificialintelligence

Data science, also known as data-driven decision, is an interdisciplinery field about scientific methods, process and systems to extract knowledge from data in various forms, and take descision based on this knowledge. A data scientist should not only be evaluated only on his/her knowledge on machine learning, but he/she should also have good expertise on statistics. I will try to start from very basics of data science and then slowly move to expert level. Supervised machine learning requires training labeled data. Unsupervised machine learning doesn't required labeled data. "Bias is error introduced in your model due to over simplification of machine learning algorithm." It can lead to underfitting.


Model evaluation, model selection, and algorithm selection in machine learning

#artificialintelligence

A single-PDF version of Model Evaluation parts 1-4 is available on arXiv: https://arxiv.org/abs/1811.12808 This final article in the series Model evaluation, model selection, and algorithm selection in machine learning presents overviews of several statistical hypothesis testing approaches, with applications to machine learning model and algorithm comparisons. This includes statistical tests based on target predictions for independent test sets (the downsides of using a single test set for model comparisons was discussed in previous articles) as well as methods for algorithm comparisons by fitting and evaluating models via cross-validation. Lastly, this article will introduce nested cross-validation, which has become a common and recommended a method of choice for algorithm comparisons for small to moderately-sized datasets. Then, at the end of this article, I provide a list of my personal suggestions concerning model evaluation, selection, and algorithm selection summarizing the several techniques covered in this series of articles. There are several different statistical hypothesis testing frameworks that are being used in practice to compare the performance of classification models, including conventional methods such as difference of two proportions (here, the proportions are the estimated generalization accuracies from a test set), for which we can construct 95% confidence intervals based on the concept of the Normal Approximation to the Binomial that was covered in Part I. Performing a z-score test for two population proportions is inarguably the most straight-forward way to compare to models (but certainly not the best!): In a nutshell, if the 95% confidence intervals of the accuracies of two models do not overlap, we can reject the null hypothesis that the performance of both classifiers is equal at a confidence level of (or 5% probability).


AI and ML choices can dramatically improve data security

#artificialintelligence

As networks have advanced in complexity, so have the tools and tactics of cybercriminals. Organizations increase their cybersecurity budgets and teams, yet breaches keep occurring. In the fight for stronger security, vendors are offering up AI and machine learning as a Holy Grail. But do these technologies actually deliver? Frequent headlines make it clear that cybercriminals are currently are winning battles regularly.


An Automatic Interaction Detection Hybrid Model for Bankcard Response Classification

arXiv.org Machine Learning

In this paper, we propose a hybrid bankcard response model, which integrates decision tree based chi-square automatic interaction detection (CHAID) into logistic regression. In the first stage of the hybrid model, CHAID analysis is used to detect the possibly potential variable interactions. Then in the second stage, these potential interactions are served as the additional input variables in logistic regression. The motivation of the proposed hybrid model is that adding variable interactions may improve the performance of logistic regression. To demonstrate the effectiveness of the proposed hybrid model, it is evaluated on a real credit customer response data set. As the results reveal, by identifying potential interactions among independent variables, the proposed hybrid approach outperforms the logistic regression without searching for interactions in terms of classification accuracy, the area under the receiver operating characteristic curve (ROC), and Kolmogorov-Smirnov (KS) statistics. Furthermore, CHAID analysis for interaction detection is much more computationally efficient than the stepwise search mentioned above and some identified interactions are shown to have statistically significant predictive power on the target variable. Last but not least, the customer profile created based on the CHAID tree provides a reasonable interpretation of the interactions, which is the required by regulations of the credit industry. Hence, this study provides an alternative for handling bankcard classification tasks.