boston housing dataset
The Morgan-Pitman Test of Equality of Variances and its Application to Machine Learning Model Evaluation and Selection
Arratia, Argimiro, Cabaña, Alejandra, Mordecki, Ernesto, Rovira-Parra, Gerard
Model selection in non-linear models often prioritizes performance metrics over statistical tests, limiting the ability to account for sampling variability. We propose the use of a statistical test to assess the equality of variances in forecasting errors. The test builds upon the classic Morgan-Pitman approach, incorporating enhancements to ensure robustness against data with heavy-tailed distributions or outliers with high variance, plus a strategy to make residuals from machine learning models statistically independent. Through a series of simulations and real-world data applications, we demonstrate the test's effectiveness and practical utility, offering a reliable tool for model evaluation and selection in diverse contexts.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Spain (0.05)
- South America > Uruguay > Montevideo > Montevideo (0.04)
- Europe > Hungary > Budapest > Budapest (0.04)
Robust Uncertainty Quantification using Conformalised Monte Carlo Prediction
Bethell, Daniel, Gerasimou, Simos, Calinescu, Radu
Deploying deep learning models in safety-critical applications remains a very challenging task, mandating the provision of assurances for the dependable operation of these models. Uncertainty quantification (UQ) methods estimate the model's confidence per prediction, informing decision-making by considering the effect of randomness and model misspecification. Despite the advances of state-of-the-art UQ methods, they are computationally expensive or produce conservative prediction sets/intervals. We introduce MC-CP, a novel hybrid UQ method that combines a new adaptive Monte Carlo (MC) dropout method with conformal prediction (CP). MC-CP adaptively modulates the traditional MC dropout at runtime to save memory and computation resources, enabling predictions to be consumed by CP, yielding robust prediction sets/intervals. Throughout comprehensive experiments, we show that MC-CP delivers significant improvements over advanced UQ methods, like MC dropout, RAPS and CQR, both in classification and regression benchmarks. MC-CP can be easily added to existing models, making its deployment simple.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom > England > North Yorkshire > York (0.04)
Linear Regression on Boston Housing Dataset
In my previous blog, I covered the basics of linear regression and gradient descent. To get hands-on linear regression we will take an original dataset and apply the concepts that we have learned. We will take the Housing dataset which contains information about different houses in Boston. This data was originally a part of UCI Machine Learning Repository and has been removed now. We can also access this data from the scikit-learn library.
Analyzing Boston housing dataset
I hope you are all safe and healthy. It's also been a while since I've gotten my hands dirty in writing scripts and analyzing data. So, I'll start with something kinda light for me -- analyzing one of the go-to datasets for projects and demos, the Boston housing dataset. The Boston housing dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Massachusetts. It has 506 samples and 14 variables.
Best Public Datasets for Machine Learning and Data Science
This resource is continuously updated. If you know of any other suitable and open datasets, please let us know by emailing us at pub@towardsai.net or by dropping a comment below. Google Dataset Search: Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they are hosted, whether it's a publisher's site, a digital library, or an author's web page. It's a phenomenal dataset finder, and it contains over 25 million datasets. Kaggle: Kaggle provides a vast container of datasets, sufficient for the enthusiast to the expert.
- North America > United States > Massachusetts > Suffolk County > Boston (0.05)
- North America > United States > New York (0.05)
- North America > United States > California > San Diego County > San Diego (0.05)
- Health & Medicine (0.74)
- Transportation > Ground > Road (0.51)
- Transportation > Passenger (0.51)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)
Best Public Datasets for Machine Learning and Data Science
This resource is continuously updated. If you know any other suitable and open datasets, please let us know by emailing us at pub@towardsai.net or by dropping a comment below. Google Dataset Search: Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they are hosted, whether it's a publisher's site, a digital library, or an author's web page. It's a phenomenal dataset finder, and it contains over 25 million datasets. Kaggle: Kaggle provides a vast container of datasets, sufficient for the enthusiast to the expert.
- North America > United States > Massachusetts > Suffolk County > Boston (0.05)
- North America > United States > New York (0.05)
- North America > United States > California > San Diego County > San Diego (0.05)
- Health & Medicine (0.74)
- Transportation > Ground > Road (0.51)
- Transportation > Passenger (0.51)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)
Ensembles of Random SHAPs
Utkin, Lev V., Konstantinov, Andrei V.
Ensemble-based modifications of the well-known SHapley Additive exPlanations (SHAP) method for the local explanation of a black-box model are proposed. The modifications aim to simplify SHAP which is computationally expensive when there is a large number of features. The main idea behind the proposed modifications is to approximate SHAP by an ensemble of SHAPs with a smaller number of features. According to the first modification, called ER-SHAP, several features are randomly selected many times from the feature set, and Shapley values for the features are computed by means of "small" SHAPs. The explanation results are averaged to get the final Shapley values. According to the second modification, called ERW-SHAP, several points are generated around the explained instance for diversity purposes, and results of their explanation are combined with weights depending on distances between points and the explained instance. The third modification, called ER-SHAP-RF, uses the random forest for preliminary explanation of instances and determining a feature probability distribution which is applied to selection of features in the ensemble-based procedure of ER-SHAP. Many numerical experiments illustrating the proposed modifications demonstrate their efficiency and properties for local explanation.
- Asia > Russia (0.28)
- Europe > Italy > Marche > Ancona Province > Ancona (0.04)
- North America > United States > Wisconsin (0.04)
- (2 more...)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Interpretable Machine Learning with an Ensemble of Gradient Boosting Machines
Konstantinov, Andrei V., Utkin, Lev V.
A method for the local and global interpretation of a black-box model on the basis of the well-known generalized additive models is proposed. It can be viewed as an extension or a modification of the algorithm using the neural additive model. The method is based on using an ensemble of gradient boosting machines (GBMs) such that each GBM is learned on a single feature and produces a shape function of the feature. The ensemble is composed as a weighted sum of separate GBMs resulting a weighted sum of shape functions which form the generalized additive model. GBMs are built in parallel using randomized decision trees of depth 1, which provide a very simple architecture. Weights of GBMs as well as features are computed in each iteration of boosting by using the Lasso method and then updated by means of a specific smoothing procedure. In contrast to the neural additive model, the method provides weights of features in the explicit form, and it is simply trained. A lot of numerical experiments with an algorithm implementing the proposed method on synthetic and real datasets demonstrate its efficiency and properties for local and global interpretation.
- Asia > Russia (0.14)
- North America > United States > Wisconsin (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.90)
Best Public Datasets for Machine Learning and Data Science
This resource is continuously updated. If you know any other suitable and open dataset, please let us know by emailing us at pub@towardsai.net or by dropping a comment below. Check out the Monte Carlo Simulation An In-depth Tutorial with Python. Google Dataset Search: Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they are hosted, whether it's a publisher's site, a digital library, or an author's web page. It's a phenomenal dataset finder, and it contains over 25 million datasets.
- North America > United States > Massachusetts > Suffolk County > Boston (0.05)
- North America > United States > New York (0.05)
- North America > United States > California > San Diego County > San Diego (0.05)
- Health & Medicine (0.74)
- Transportation > Ground > Road (0.52)
- Transportation > Passenger (0.51)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.99)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.30)
Neural Networks in Python
In this tutorial, we will implement a multi-layered perceptron (a type of a feed-forward neural network) in Python using three different libraries. We'll start off with the most basic example possible, going to more complex and flexible frameworks with the aim of increasing our understanding of how to implement neural networks in Python. Quoting from the scikit-learn documentation [1], "A Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function f: Rᵐ Rᵒ by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features X x¹,x²,…,xᵐ, and a target y, it can learn a non-linear function approximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers".
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- North America > United States > Massachusetts > Suffolk County > Boston (0.05)