AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.79)

@machinelearnbotJul-14-2017, 02:05:07 GMT

[R] Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting • r/MachineLearning

I'm a little bewildered here. Note, that the softmax is not included in the table for the very simple reason that it gave miserable results on this NN configuration. Softmax Cross Entropy is the de facto output activation in FCNs. They don't specify if that test was with CE error or MSE, but even if it was with MSE (as a later experiment is), that just speaks to the incredibly poorly designed network they used (392-50-10 neurons is truly weird). The idea bears some resemblance to momentum, where we gradually speed things up when the error gradients are consistent.

artificial intelligence, machine learning, social media, (9 more...)

Industry: Media > News (0.40)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.40)

Liu, C. H. Bryan, Chamberlain, Benjamin Paul, Little, Duncan A., Cardoso, Angelo

Generalising Random Forest Parameter Optimisation to Include Stability and Cost

arXiv.org Machine LearningJul-12-2017, 19:00:00 GMT

Random forests are among the most popular classification and regression methods used in industrial applications. To be effective, the parameters of random forests must be carefully tuned. This is usually done by choosing values that minimize the prediction error on a held out dataset. We argue that error reduction is only one of several metrics that must be considered when optimizing random forest parameters for commercial applications. We propose a novel metric that captures the stability of random forests predictions, which we argue is key for scenarios that require successive predictions. We motivate the need for multi-criteria optimization by showing that in practical applications, simply choosing the parameters that lead to the lowest error can introduce unnecessary costs and produce predictions that are not stable across independent runs. To optimize this multi-criteria trade-off, we present a new framework that efficiently finds a principled balance between these three considerations using Bayesian optimisation. The pitfalls of optimising forest parameters purely for error reduction are demonstrated using two publicly available real world datasets. We show that our framework leads to parameter settings that are markedly different from the values discovered by error reduction metrics.

artificial intelligence, machine learning, random forest, (18 more...)

doi: 10.1007/978-3-319-71273-4_9

1706.09865

Country: Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.34)

Wager, Stefan, Athey, Susan

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

arXiv.org Machine LearningJul-9-2017

Many scientific and engineering challenges -- ranging from personalized medicine to customized marketing recommendations -- require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.

artificial intelligence, machine learning, random forest, (20 more...)

1510.04342

Country:

North America > United States > New York (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Strength High (0.93)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Athey, Susan, Tibshirani, Julie, Wager, Stefan

Generalized Random Forests

arXiv.org Machine LearningJul-5-2017

We propose generalized random forests, a method for non-parametric statistical estimation based on random forests (Breiman, 2001) that can be used to fit any quantity of interest identified as the solution to a set of local moment equations. Following the literature on local maximum likelihood estimation, our method operates at a particular point in covariate space by considering a weighted set of nearby training examples; however, instead of using classical kernel weighting functions that are prone to a strong curse of dimensionality, we use an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest. We propose a flexible, computationally efficient algorithm for growing generalized random forests, develop a large sample theory for our method showing that our estimates are consistent and asymptotically Gaussian, and provide an estimator for their asymptotic variance that enables valid confidence intervals. We use our approach to develop new methods for three statistical tasks: non-parametric quantile regression, conditional average partial effect estimation, and heterogeneous treatment effect estimation via instrumental variables. A software implementation, grf for R and C++, is available from CRAN.

artificial intelligence, machine learning, random forest, (18 more...)

1610.01271

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Vietnam (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.54)

arXiv.org Machine LearningJul-5-2017

A Data Science Approach to Understanding Residential Water Contamination in Flint

Chojnacki, Alex, Dai, Chengyu, Farahi, Arya, Shi, Guangsha, Webb, Jared, Zhang, Daniel T., Abernethy, Jacob, Schwartz, Eric

When the residents of Flint learned that lead had contaminated their water system, the local government made water-testing kits available to them free of charge. The city government published the results of these tests, creating a valuable dataset that is key to understanding the causes and extent of the lead contamination event in Flint. This is the nation's largest dataset on lead in a municipal water system. In this paper, we predict the lead contamination for each household's water supply, and we study several related aspects of Flint's water troubles, many of which generalize well beyond this one city. For example, we show that elevated lead risks can be (weakly) predicted from observable home attributes. Then we explore the factors associated with elevated lead. These risk assessments were developed in part via a crowd sourced prediction challenge at the University of Michigan. To inform Flint residents of these assessments, they have been incorporated into a web and mobile application funded by \texttt{Google.org}. We also explore questions of self-selection in the residential testing program, examining which factors are linked to when and how frequently residents voluntarily sample their water.

data mining, flint, machine learning, (20 more...)

doi: 10.1145/3097983.3098078

1707.01591

Country: North America > United States > Michigan > Genesee County > Flint (0.14)

Genre: Research Report (0.64)

Industry:

Water & Waste Management > Water Management > Water Supplies & Services (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

#artificialintelligenceJun-30-2017, 17:15:19 GMT

A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning - Machine Learning Mastery

Develop your first XGBoost TODAY with my free XGBoost-With-Python mini-course. The idea of boosting came out of the idea of whether a weak learner can be modified to become better.

artificial intelligence, machine learning, weak learner, (11 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

@machinelearnbotJun-28-2017, 14:50:07 GMT

Dask and Pandas and XGBoost: Playing nicely between distributed systems

Editor's note: For an introduction to Dask, consider reading Introducing Dask for Parallel Programming: An Interview with Project Lead Developer. To read more about the most recent release, see Dask Release 0.14.1. This post talks about distributing Pandas Dataframes with Dask and then handing them over to distributed XGBoost for training. More generally it discusses the value of launching multiple distributed systems in the same shared-memory processes and smoothly handing data back and forth between them. XGBoost is a well-loved library for a popular class of machine learning algorithms, gradient boosted trees.

artificial intelligence, machine learning, xgboost, (16 more...)

Country:

North America > United States > New York > Kings County > New York City (0.05)
North America > United States > Illinois > Cook County > Chicago (0.05)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Pinto, Fábio, Cerqueira, Vítor, Soares, Carlos, Mendes-Moreira, João

autoBagging: Learning to Rank Bagging Workflows with Metalearning

arXiv.org Machine LearningJun-28-2017

Machine Learning (ML) has been successfully applied to a wide range of domains and applications. One of the techniques behind most of these successful applications is Ensemble Learning (EL), the field of ML that gave birth to methods such as Random Forests or Boosting. The complexity of applying these techniques together with the market scarcity on ML experts, has created the need for systems that enable a fast and easy drop-in replacement for ML libraries. Automated machine learning (autoML) is the field of ML that attempts to answers these needs. Typically, these systems rely on optimization techniques such as bayesian optimization to lead the search for the best model. Our approach differs from these systems by making use of the most recent advances on metalearning and a learning to rank approach to learn from metadata. We propose autoBagging, an autoML system that automatically ranks 63 bagging workflows by exploiting past performance and dataset characterization. Results on 140 classification datasets from the OpenML platform show that autoBagging can yield better performance than the Average Rank method and achieve results that are not statistically different from an ideal model that systematically selects the best workflow for each dataset. For the purpose of reproducibility and generalizability, autoBagging is publicly available as an R package on CRAN.

artificial intelligence, machine learning, workflow, (14 more...)

1706.09367

Country:

North America > United States (0.04)
Europe > Portugal > Porto > Porto (0.04)
Asia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)

@machinelearnbotJun-27-2017, 23:15:06 GMT

How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part 2

In the first part of this series, I introduced the Outbrain Click Prediction machine learning competition. That post described some preliminary and important data science tasks like exploratory data analysis and feature engineering performed for the competition, using a Spark cluster deployed on Google Dataproc. In this post, I describe the competition evaluation, the design of my cross-validation strategy and my baseline models using statistics and trees ensembles. In that competition, Kagglers were required to rank recommended ads by decreasing predicted likelihood of being clicked. Sponsored search advertising, contextual advertising, display advertising and real-time bidding auctions have all relied heavily on the ability of learned models to predict ad click–through rates (CTRs) accurately, quickly and reliably.

categorical value, competition, validation, (16 more...)

Genre: Contests & Prizes (0.40)

Industry:

Marketing (1.00)
Information Technology > Services (0.54)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.50)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.49)
(2 more...)