This is the conference, and here's my talk (will do Google hangout, just as with my recent talks in Bern, Strasbourg, etc): Through a series of examples, we consider problems with classical hypothesis testing, whether performed using classical p-values or confidence intervals, Bayes factors, or Bayesian inference using noninformative priors. We locate the problem not in the use of any particular statistical method but rather with larger problems of deterministic thinking and a misguided version of Popperianism in which the rejection of a straw-man null hypothesis is taken as confirmation of a preferred alternative. We suggest solutions involving multilevel modeling and informative Bayesian inference. The post Hypothesis Testing is a Bad Idea (my talk at Warwick, England, 2:30pm Thurs 15 Sept) appeared first on Statistical Modeling, Causal Inference, and Social Science. The post Hypothesis Testing is a Bad Idea (my talk at Warwick, England, 2:30pm Thurs 15 Sept) appeared first on All About Statistics.

Thanks to my CS7641 class at Georgia Tech in my MS Analytics program, where I discovered this concept and was inspired to write about it. It is somewhat surprising that among all the high-flying buzzwords of machine learning, we don't hear much about the one phrase which fuses some of the core concepts of statistical learning, information theory, and natural philosophy into a single three-word-combo. Moreover, it is not just an obscure and pedantic phrase meant for machine learning (ML) Ph.Ds and theoreticians. It has a precise and easily accessible meaning for anyone interested to explore, and a practical pay-off for the practitioners of ML and data science. I am talking about Minimum Description Length.

Question: Why do you square the error in a regression machine learning task? Ans: "Why, of course, it turns out all the errors (residuals) into positive quantities!" Question: "OK, why not use a simpler absolute value function x to make all the errors positive?" Ans: "Aha, you are trying to trick me. Absolute value function is not differentiable everywhere!" Question: "That should not matter much for numerical algorithms. LASSO regression uses a term with absolute value and it can be handled.

This work constructs a hypothesis test for detecting whether an data-generating function $h: \real p \rightarrow \real$ belongs to a specific reproducing kernel Hilbert space $\mathcal{H}_0$, where the structure of $\mathcal{H}_0$ is only partially known. Utilizing the theory of reproducing kernels, we reduce this hypothesis to a simple one-sided score test for a scalar parameter, develop a testing procedure that is robust against the mis-specification of kernel functions, and also propose an ensemble-based estimator for the null model to guarantee test performance in small samples. To demonstrate the utility of the proposed method, we apply our test to the problem of detecting nonlinear interaction between groups of continuous features. We evaluate the finite-sample performance of our test under different data-generating functions and estimation strategies for the null model. Our results revealed interesting connection between notions in machine learning (model underfit/overfit) and those in statistical inference (i.e.

Chen, Songjian (Sun Yat-sen University) | Xu, Yabo (Sun Yat-sen University) | Chang, Huiyou (Sun Yat-sen Universit)

In this paper, we propose a new unsupervised approach for word segmentation. The core idea of our approach is a novel word induction criterion called WordRank, which estimates the goodness of word hypotheses (character or phoneme sequences). We devise a method to derive exterior word boundary information from the link structures of adjacent word hypotheses and incorporate interior word boundary information to complete the model. In light of WordRank, word segmentation can be modeled as an optimization problem. A Viterbi-styled algorithm is developed for the search of the optimal segmentation. Extensive experiments conducted on phonetic transcripts as well as standard Chinese and Japanese data sets demonstrate the effectiveness of our approach. On the standard Brent version of Bernstein-Ratner corpora, our approach outperforms the state-of-the-art Bayesian models by more than 3%. Plus, our approach is simpler and more efficient than the Bayesian methods. Consequently, our approach is more suitable for real-world applications.