AITopics

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

arXiv.org Artificial IntelligenceDec-17-2019

srlearn: A Python Library for Gradient-Boosted Statistical Relational Models

Hayes, Alexander L.

We present srlearn, a Python library for boosted statistical relational models. We adapt the scikit-learn interface to this setting and provide examples for how this can be used to express learning and inference problems.

hyperparameter, learning, srlearn, (11 more...)

arXiv.org Artificial Intelligence

1912.08198

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Indiana (0.05)
North America > United States > Texas (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.50)

Lu, Benjamin, Hardin, Johanna

A Unified Framework for Random Forest Prediction Error Estimation

arXiv.org Machine LearningDec-16-2019

We introduce a unified framework for random forest prediction err or estimation based on a novel estimator of the conditional prediction error distribution function. Our framework enables immediate estimation of key parameters often of interest, inc luding conditional mean squared prediction errors, conditional biases, and conditional qu antiles, by a straightforward plugin routine. Our approach is particularly well-adapted for p rediction interval estimation, which has received less attention in the random forest lit erature despite its practical utility; we show via simulations that our proposed predictio n intervals are competitive with, and in some settings outperform, existing methods. T o establish theoretical grounding for our framework, we prove pointwise uniform consiste ncy of a more stringent version of our estimator of the conditional prediction error distrib ution. In addition to providing a suite of measures of prediction uncertainty, our gener al framework is applicable to many variants of the random forest algorithm. The estimator s introduced here are implemented in the R package forestError .

estimator, prediction, random forest, (14 more...)

1912.07435

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
North America > United States > New York (0.04)
North America > United States > Colorado (0.04)
North America > United States > California > Los Angeles County > Claremont (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

arXiv.org Machine LearningDec-13-2019

A Gap Analysis of Low-Cost Outdoor Air Quality Sensor In-Field Calibration

Concas, Francesco, Mineraud, Julien, Lagerspetz, Eemil, Varjonen, Samu, Puolamäki, Kai, Nurmi, Petteri, Tarkoma, Sasu

In recent years, interest in monitoring air quality has been growing. Traditional environmental monitoring stations are very expensive, both to acquire and to maintain, therefore their deployment is generally very sparse. This is a problem when trying to generate air quality maps with a fine spatial resolution. Given the general interest in air quality monitoring, low-cost air quality sensors have become an active area of research and development. Low-cost air quality sensors can be deployed at a finer level of granularity than traditional monitoring stations. Furthermore, they can be portable and mobile. Low-cost air quality sensors, however, present some challenges: they suffer from cross-sensitivities between different ambient pollutants; they can be affected by external factors such as traffic, weather changes, and human behavior; and their accuracy degrades over time. Some promising machine learning approaches can help us obtain highly accurate measurements with low-cost air quality sensors. In this article, we present low-cost sensor technologies, and we survey and assess machine learning-based calibration techniques for their calibration. We conclude by presenting open questions and directions for future research.

calibration, pollutant, sensor, (15 more...)

1912.06384

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Finland > Uusimaa > Helsinki (0.05)
Asia > China > Beijing > Beijing (0.04)
(16 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.45)

Industry:

Materials > Chemicals (1.00)
Energy (1.00)
Law (0.67)
(2 more...)

Technology:

Information Technology > Communications > Networks > Sensor Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.46)

#artificialintelligenceDec-12-2019, 14:27:53 GMT

Productionizing Distributed XGBoost to Train Deep Tree Models with Large Data Sets at Uber

Zero values in SparseVectors are treated by XGBoost on Apache Spark as missing values (defaults to Float.NaN) whereas zeroes in DenseVectors are simply treated as zeros. Vector storage in Apache Spark ML is implicitly optimized, so a vector array is stored as a SparseVector or DenseVector based on space efficiency. If an ML practitioner tries to feed a DenseVector at inference time to a model that is trained on SparseVector or vice versa, XGBoost does not provide any warning and the prediction input will likely go into unexpected branches due to the way zeroes are stored, resulting in inconsistent predictions. Hence, it is critical that the storage structure input remains consistent between serving and training times.

productionizing, train deep tree model, xgboost, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.97)

Tagliaferri, Giovanna, Scacciatelli, Daria, Di Loro, Pierfrancesco Alaimo

VAT tax gap prediction: a 2-steps Gradient Boosting approach

arXiv.org Machine LearningDec-8-2019

Tax evasion is the illegal non-payment of taxes by individuals, corporations, and trusts. It results in a loss of state revenue that can undermine the effectiveness of government policies. One measure of tax evasion is the so-called tax gap: the difference between the income that should be reported to the tax authorities and the amount actually reported. However, economists lack a robust method for estimating the tax gap through a bottom-up approach based on fiscal audits. This is difficult because the declared tax base is available on the whole population but the income reported to the tax authorities is generally available only on a small, non-random sample of audited units. This induces a selection bias which invalidates standard statistical methods. Here, we use machine learning based on a 2-steps Gradient Boosting model, to correct for the selection bias without requiring any strong assumption on the distribution. We use our method to estimate the Italian VAT Gap related to individual firms based on information gathered from administrative sources. Our algorithm estimates the potential VAT turnover of Italian individual firms for the fiscal year 2011 and suggests that the tax gap is about 30% of the total potential tax base. Comparisons with other methods show our technique offers a significant improvement in predictive performance.

algorithm, probability, tax base, (16 more...)

1912.03781

Country:

Europe > Italy > Lazio > Rome (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry: Government > Tax (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.46)

#artificialintelligenceDec-6-2019, 20:41:41 GMT

Terrible performance using XGBoost H2O

I am training a XGBoost model using 5-fold croos validation on a very imbalanced binary classification problem. The dataset has 1200 columns (multi-document word2vec document embeddings). The reported performance on train data was extremely high (probably overfitting!!!): I know H2O cross validation generates an extra model using the whole data available and different performances are expected. But, could be the cause that generated too bad performance on the resulting model?

terrible performance, validation, xgboost model

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.97)

#artificialintelligenceDec-6-2019, 20:40:28 GMT

Random Forest Algorithm - Random Forest Explained Random Forest in Machine Learning Simplilearn

This Random Forest Algorithm tutorial will explain how Random Forest algorithm works in Machine Learning. By the end of this video, you will be able to understand what is Machine Learning, what is Classification problem, applications of Random Forest, why we need Random Forest, how it works with simple examples and how to implement Random Forest algorithm in Python. Below are the topics covered in this Machine Learning tutorial: 1. You can also go through the Slides here: https://goo.gl/K8T4tW Machine Learning Articles: https://www.simplilearn.com/what-is-a... To gain in-depth knowledge of Machine Learning, check our Machine Learning certification training course: https://www.simplilearn.com/big-data-... #MachineLearningAlgorithms #Datasciencecourse #DataScience #SimplilearnMachineLearning #MachineLearningCourse - - - - - - - - About Simplilearn Machine Learning course: A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people's digital interactions.

learning, machine learning, random forest, (8 more...)

Genre: Instructional Material (0.63)

Industry: Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)

Ramosaj, Burim, Pauly, Markus

Asymptotic Unbiasedness of the Permutation Importance Measure in Random Forest Models

arXiv.org Machine LearningDec-5-2019

Variable selection in sparse regression models is an important task as applications ranging from biomedical research to econometrics have shown. Especially for higher dimensional regression problems, for which the link function between response and covariates cannot be directly detected, the selection of informative variables is challenging. Under these circumstances, the Random Forest method is a helpful tool to predict new outcomes while delivering measures for variable selection. One common approach is the usage of the permutation importance. Due to its intuitive idea and flexible usage, it is important to explore circumstances, for which the permutation importance based on Random Forest correctly indicates informative covariates. Regarding the latter, we deliver theoretical guarantees for the validity of the permutation importance measure under specific assumptions and prove its (asymptotic) unbiasedness. An extensive simulation study verifies our findings.

permutation importance, sample size, signal-to-noise ratio, (16 more...)

1912.03306

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)

#artificialintelligenceDec-4-2019, 00:03:56 GMT

Scikit-Optimize: Bayesian Hyperparameter Optimization in Python

There are four optimization algorithms to try. You can run a simple random search over the parameters. Nothing fancy here but it is useful to have this option within the same API to compare if needed. Both of those methods as well as the one in the next section are examples of Bayesian Hyperparameter Optimization also known as Sequential Model-Based Optimization SMBO. The idea behind this approach is to estimate the user-defined objective function with the random forest, extra trees, or gradient boosted trees regressor.

acquisition function, bayesian hyperparameter optimization, objective function, (7 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.59)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.39)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.39)