Goto

Collaborating Authors

 non-parametric test


Statistical Hypothesis Testing for Information Value (IV)

Rojas, Helder, Alvarez, Cirilo, Rojas, Nilton

arXiv.org Machine Learning

Information value (IV) is a quite popular technique for features selection before the modeling phase. There are practical criteria, based on fixed thresholds for IV, but at the same time mysterious and lacking theoretical arguments, to decide if a predictor has sufficient predictive power to be considered in the modeling phase. However, the mathematical development and statistical inference methods for this technique are almost nonexistent in the literature. In this paper we present a theoretical framework for IV, and at the same time, we propose a non-parametric hypothesis test to evaluate the predictive power of features contemplated in a data set. Due to its relationship with divergence measures developed in the Information Theory, we call our proposal the J - Divergence test. We show how to efficiently compute our test statistic and we study its performance on simulated data. In various scenarios, particularly in unbalanced data sets, we show its superiority over conventional criteria based on fixed thresholds. Furthermore, we apply our test on fraud identification data and provide an open-source Python library, called "statistical-iv"(https://pypi.org/project/statistical-iv/), where we implement our main results.


Benchmarking Non-Parametric Statistical Tests

Neural Information Processing Systems

Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole "population", we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions.


Parametric vs. Non-parametric tests, and when to use them

#artificialintelligence

The fundamentals of Data Science include computer science, statistics and math. It's very easy to get caught up in the latest and greatest, most powerful algorithms -- convolutional neural nets, reinforcement learning etc. As an ML/health researcher and algorithm developer, I often employ these techniques. However, something I have seen rife in the data science community after having trained 10 years as an electrical engineer is that if all you have is a hammer, everything looks like a nail. Suffice it to say that while many of these exciting algorithms have immense applicability, too often the statistical underpinnings of the data science community are overlooked.


A Non-Parametric Test to Detect Data-Copying in Generative Models

Meehan, Casey, Chaudhuri, Kamalika, Dasgupta, Sanjoy

arXiv.org Machine Learning

Detecting overfitting in generative models is an important challenge in machine learning. In this work, we formalize a form of overfitting that we call {\em{data-copying}} -- where the generative model memorizes and outputs training samples or small variations thereof. We provide a three sample non-parametric test for detecting data-copying that uses the training set, a separate sample from the target distribution, and a generated sample from the model, and study the performance of our test on several canonical models and datasets. For code \& examples, visit https://github.com/casey-meehan/data-copying


Learn R for Applied Statistics - Programmer Books

#artificialintelligence

Gain the R programming language fundamentals for doing the applied statistics useful for data exploration and analysis in data science and data mining. This book covers topics ranging from R syntax basics, descriptive statistics, and data visualizations to inferential statistics and regressions. After learning R's syntax, you will work through data visualizations such as histograms and boxplot charting, descriptive statistics, and inferential statistics such as t-test, chi-square test, ANOVA, non-parametric test, and linear regressions. Learn R for Applied Statistics is a timely skills-migration book that equips you with the R programming fundamentals and introduces you to applied statistics for data explorations. Those who are interested in data science, in particular data exploration using applied statistics, and the use of R programming for data visualizations.


Comparing Machine Learning Models: Statistical vs. Practical Significance

#artificialintelligence

A lot of work has been done on building and tuning ML models, but a natural question that eventually comes up after all that hard work is -- how do we actually compare the models we've built? If we're facing a choice between models A and B, which one is the winner and why? Could the models be combined together so that optimal performance is achieved? A very shallow approach would be to compare the overall accuracy on the test set, say, model A's accuracy is 94% vs. model B's accuracy is 95%, and blindly conclude that B won the race. In fact, there is so much more than the overall accuracy to investigate and more facts to consider.


Detecting stationarity in time series data

#artificialintelligence

Stationarity is an important concept in time series analysis. For a concise (but thorough) introduction to the topic, and the reasons that make it important, take a look at my previous blog post on the topic. As such, the ability to determine if a time series is stationary is important. Rather than deciding between two strict options, this usually means being able to ascertain, with high probability, that a series is generated by a stationary process. In this brief post, I will cover several ways to do just that.


A Guide To Conduct Analysis Using Non-Parametric Statistical Tests

@machinelearnbot

The average salary package of an economics honors graduate at Hansraj College during the end of the 1980s was around INR 1,000,000 p.a. The number is significantly higher than people graduating in early 80s or early 90s. What could be the reason for such a high average? Well, one of the highest paid Indian celebrity, Shahrukh Khan graduated from Hansraj College in 1988 where he was pursuing economics honors. This, and many such examples tell us that average is not a good indicator of the center of the data. It can be extremely influenced by Outliers. In such cases, looking at median is a better choice.