Regression
Cross-validation Confidence Intervals for Test Error
Bayle, Pierre, Bayle, Alexandre, Janson, Lucas, Mackey, Lester
This work develops central limit theorems for cross-validation and consistent estimators of its asymptotic variance under weak stability conditions on the learning algorithm. Together, these results provide practical, asymptotically-exact confidence intervals for $k$-fold test error and valid, powerful hypothesis tests of whether one learning algorithm has smaller $k$-fold test error than another. These results are also the first of their kind for the popular choice of leave-one-out cross-validation. In our real-data experiments with diverse learning algorithms, the resulting intervals and tests outperform the most popular alternative methods from the literature.
Aspectuality Across Genre: A Distributional Semantics Approach
Kober, Thomas, Alikhani, Malihe, Stone, Matthew, Steedman, Mark
The interpretation of the lexical aspect of verbs in English plays a crucial role for recognizing textual entailment and learning discourse-level inferences. We show that two elementary dimensions of aspectual class, states vs. events, and telic vs. atelic events, can be modelled effectively with distributional semantics. We find that a verb's local context is most indicative of its aspectual class, and demonstrate that closed class words tend to be stronger discriminating contexts than content words. Our approach outperforms previous work on three datasets. Lastly, we contribute a dataset of human--human conversations annotated with lexical aspect and present experiments that show the correlation of telicity with genre and discourse goals.
Measure Inducing Classification and Regression Trees for Functional Data
Belli, Edoardo, Vantini, Simone
We propose a tree-based algorithm for classification and regression problems in the context of functional data analysis, which allows to leverage representation learning and multiple splitting rules at the node level, reducing generalization error while retaining the interpretability of a tree. This is achieved by learning a weighted functional $L^{2}$ space by means of constrained convex optimization, which is then used to extract multiple weighted integral features from the input functions, in order to determine the binary split for each internal node of the tree. The approach is designed to manage multiple functional inputs and/or outputs, by defining suitable splitting rules and loss functions that can depend on the specific problem and can also be combined with scalar and categorical data, as the tree is grown with the original greedy CART algorithm. We focus on the case of scalar-valued functional inputs defined on unidimensional domains and illustrate the effectiveness of our method in both classification and regression tasks, through a simulation study and four real world applications.
Post-selection inference with HSIC-Lasso
Freidling, Tobias, Poignard, Benjamin, Climente-Gonzรกlez, Hรฉctor, Yamada, Makoto
Detecting influential features in complex (non-linear and/or high-dimensional) datasets is key for extracting the relevant information. Most of the popular selection procedures, however, require assumptions on the underlying data - such as distributional ones -, which barely agree with empirical observations. Therefore, feature selection based on nonlinear methods, such as the model-free HSIC-Lasso, is a more relevant approach. In order to ensure valid inference among the chosen features, the selection procedure must be accounted for. In this paper, we propose selective inference with HSIC-Lasso using the framework of truncated Gaussians together with the polyhedral lemma. Based on these theoretical foundations, we develop an algorithm allowing for low computational costs and the treatment of the hyper-parameter selection issue. The relevance of our method is illustrated using artificial and real-world datasets. In particular, our empirical findings emphasise that type-I error control at the considered level can be achieved.
DeSMOG: Detecting Stance in Media On Global Warming
Luo, Yiwei, Card, Dallas, Jurafsky, Dan
Citing opinions is a powerful yet understudied strategy in argumentation. For example, an environmental activist might say, "Leading scientists agree that global warming is a serious concern," framing a clause which affirms their own stance ("that global warming is serious") as an opinion endorsed ("[scientists] agree") by a reputable source ("leading"). In contrast, a global warming denier might frame the same clause as the opinion of an untrustworthy source with a predicate connoting doubt: "Mistaken scientists claim [...]." Our work studies opinion-framing in the global warming (GW) debate, an increasingly partisan issue that has received little attention in NLP. We introduce DeSMOG, a dataset of stance-labeled GW sentences, and train a BERT classifier to study novel aspects of argumentation in how different sides of a debate represent their own and each other's opinions. From 56K news articles, we find that similar linguistic devices for self-affirming and opponent-doubting discourse are used across GW-accepting and skeptic media, though GW-skeptical media shows more opponent-doubt. We also find that authors often characterize sources as hypocritical, by ascribing opinions expressing the author's own view to source entities known to publicly endorse the opposing view. We release our stance dataset, model, and lexicons of framing devices for future work on opinion-framing and the automatic detection of GW stance.
Introduction to logistic regression
July 29, 2020 For random field theory based multiple comparison corrections In brain imaging, it is often necessary to compute the distribution of the supremum of a random field. Unfortunately, computing the distribution of the supremum of the random field is not easy and requires satisfying many distributional assumptions that may not be true in real data. Thus, there is a need to come up with a different framework that does not use the traditional statistical hypothesis testing paradigm that requires to compute p-values. With this as a motivation, we can use a different approach called the logistic regression that does not require computing the p-value and still be able to localize the regions of brain network differences (Flury 1997, Hastie et al. 2003, Chung et al. 2008). Unlike other discriminant and classification techniques that tried to classify preselected feature vectors, the method here does not require any preselected feature vectors and performs the classification at each edge level (Higdon et al. 2004, Shen et al. 2004, Thomaz et al. 2006).
Manifold structure in graph embeddings
The hypothesis that high-dimensional data tend to live near a manifold of low dimension is an important theme of modern statistics and machine learning, often held to explain why highdimensional learning is realistically possible [61, 5, 15, 13, 18, 44]. The object of this paper is show that, for a theoretically tractable but rich class of random graph models, such a phenomenon occurs in the spectral embedding of a graph. Manifold structure is shown to arise when the graph follows a latent position model [28], wherein connections are posited to occur as a function of the nodes' underlying positions in space. Because of their intuitive appeal, such models have been employed in a great diversity of disciplines, including social science [35, 42, 21], neuroscience [17, 52], statistical mechanics [34], information technology [69], biology [53] and ecology [19]. In many more endeavours latent position models are used -- at least according to Definition 1 (to follow) -- but are known by a different name; examples include the standard [29], mixed [2] and degree-corrected [32] stochastic block models, random geometric graphs [50], and the graphon model [39], which encompasses them.
Mastering Python for Data Science - Programmer Books
Data science is a relatively new knowledge domain which is used by various organizations to make data driven decisions. Data scientists have to wear various hats to work with data and to derive value from it. The Python programming language, beyond having conquered the scientific community in the last decade, is now an indispensable tool for the data science practitioner and a must-know tool for every aspiring data scientist. Using Python will offer you a fast, reliable, cross-platform, and mature environment for data analysis, machine learning, and algorithmic problem solving. This comprehensive guide helps you move beyond the hype and transcend the theory by providing you with a hands-on, advanced study of data science.
Logistic Regression in Python
Logistic Regression is used for classification problems in machine learning. It is used to deal with binary classification and multiclass classification. In logistic regression, the target variable/dependent variable should be a discrete value or categorical value. Binary classification problems with two class values like male/female, yes/no, True/False, 0/1, pass/fail. Let's learn about logistic regression for binary classification in this story.