Regression
Offensive Language and Hate Speech Detection for Danish
Sigurbergsson, Gudbjartur Ingi, Derczynski, Leon
The presence of offensive language on social media platforms and the implications this poses is becoming a major concern in modern society. Given the enormous amount of content created every day, automatic methods are required to detect and deal with this type of content. Until now, most of the research has focused on solving the problem for the English language, while the problem is multilingual. We construct a Danish dataset containing user-generated comments from \textit{Reddit} and \textit{Facebook}. It contains user generated comments from various social media platforms, and to our knowledge, it is the first of its kind. Our dataset is annotated to capture various types and target of offensive language. We develop four automatic classification systems, each designed to work for both the English and the Danish language. In the detection of offensive language in English, the best performing system achieves a macro averaged F1-score of $0.74$, and the best performing system for Danish achieves a macro averaged F1-score of $0.70$. In the detection of whether or not an offensive post is targeted, the best performing system for English achieves a macro averaged F1-score of $0.62$, while the best performing system for Danish achieves a macro averaged F1-score of $0.73$. Finally, in the detection of the target type in a targeted offensive post, the best performing system for English achieves a macro averaged F1-score of $0.56$, and the best performing system for Danish achieves a macro averaged F1-score of $0.63$. Our work for both the English and the Danish language captures the type and targets of offensive language, and present automatic methods for detecting different kinds of offensive language such as hate speech and cyberbullying.
Logistic Regression Equivalence: A Framework for Comparing Logistic Regression Models Across Populations
Ashiri-Prossner, Guy, Benjamini, Yuval
In this paper we discuss how to evaluate the differences between fitted logistic regression models across sub-populations. Our motivating example is in studying computerized diagnosis for learning disabilities, where sub-populations based on gender may or may not require separate models. In this context, significance tests for hypotheses of no difference between populations may provide perverse incentives, as larger variances and smaller samples increase the probability of not-rejecting the null. We argue that equivalence testing for a prespecified tolerance level on population differences incentivizes accuracy in the inference. We develop a cascading set of equivalence tests, in which each test addresses a different aspect of the model: the way the phenomenon is coded in the regression coefficients, the individual predictions in the per example log odds ratio and the overall accuracy in the mean square prediction error. For each equivalence test, we propose a strategy for setting the equivalence thresholds. The large-sample approximations are validated using simulations. For diagnosis data, we show examples for equivalent and non-equivalent models.
Retire: Robust Expectile Regression in High Dimensions
Man, Rebeka, Tan, Kean Ming, Wang, Zian, Zhou, Wen-Xin
High-dimensional data can often display heterogeneity due to heteroscedastic variance or inhomogeneous covariate effects. Penalized quantile and expectile regression methods offer useful tools to detect heteroscedasticity in high-dimensional data. The former is computationally challenging due to the non-smooth nature of the check loss, and the latter is sensitive to heavy-tailed error distributions. In this paper, we propose and study (penalized) robust expectile regression (retire), with a focus on iteratively reweighted $\ell_1$-penalization which reduces the estimation bias from $\ell_1$-penalization and leads to oracle properties. Theoretically, we establish the statistical properties of the retire estimator under two regimes: (i) low-dimensional regime in which $d \ll n$; (ii) high-dimensional regime in which $s\ll n\ll d$ with $s$ denoting the number of significant predictors. In the high-dimensional setting, we carefully characterize the solution path of the iteratively reweighted $\ell_1$-penalized retire estimation, adapted from the local linear approximation algorithm for folded-concave regularization. Under a mild minimum signal strength condition, we show that after as many as $\log(\log d)$ iterations the final iterate enjoys the oracle convergence rate. At each iteration, the weighted $\ell_1$-penalized convex program can be efficiently solved by a semismooth Newton coordinate descent algorithm. Numerical studies demonstrate the competitive performance of the proposed procedure compared with either non-robust or quantile regression based alternatives.
Auto-Encoder Neural Network Incorporating X-Ray Fluorescence Fundamental Parameters with Machine Learning
We consider energy-dispersive X-ray Fluorescence (EDXRF) applications where the fundamental parameters method is impractical such as when instrument parameters are unavailable. For example, on a mining shovel or conveyor belt, rocks are constantly moving (leading to varying angles of incidence and distances) and there may be other factors not accounted for (like dust). Neural networks do not require instrument and fundamental parameters but training neural networks requires XRF spectra labelled with elemental composition, which is often limited because of its expense. We develop a neural network model that learns from limited labelled data and also benefits from domain knowledge by learning to invert a forward model. The forward model uses transition energies and probabilities of all elements and parameterized distributions to approximate other fundamental and instrument parameters. We evaluate the model and baseline models on a rock dataset from a lithium mineral exploration project. Our model works particularly well for some low-Z elements (Li, Mg, Al, and K) as well as some high-Z elements (Sn and Pb) despite these elements being outside the suitable range for common spectrometers to directly measure, likely owing to the ability of neural networks to learn correlations and non-linear relationships.
Semantic Latent Space Regression of Diffusion Autoencoders for Vertebral Fracture Grading
Keicher, Matthias, Atad, Matan, Schinz, David, Gersing, Alexandra S., Foreman, Sarah C., Goller, Sophia S., Weissinger, Juergen, Rischewski, Jon, Dietrich, Anna-Sophia, Wiestler, Benedikt, Kirschke, Jan S., Navab, Nassir
Vertebral fractures are a consequence of osteoporosis, with significant health implications for affected patients. Unfortunately, grading their severity using CT exams is hard and subjective, motivating automated grading methods. However, current approaches are hindered by imbalance and scarcity of data and a lack of interpretability. To address these challenges, this paper proposes a novel approach that leverages unlabelled data to train a generative Diffusion Autoencoder (DAE) model as an unsupervised feature extractor. We model fracture grading as a continuous regression, which is more reflective of the smooth progression of fractures. Specifically, we use a binary, supervised fracture classifier to construct a hyperplane in the DAE's latent space. We then regress the severity of the fracture as a function of the distance to this hyperplane, calibrating the results to the Genant scale. Importantly, the generative nature of our method allows us to visualize different grades of a given vertebra, providing interpretability and insight into the features that contribute to automated grading.
Valid Inference after Causal Discovery
Gradu, Paula, Zrnic, Tijana, Wang, Yixin, Jordan, Michael I.
Causal discovery and causal estimation are fundamental tasks in causal reasoning and decision-making. Causal discovery aims to identify the underlying structure of the causal problem, often in the form of a graphical representation which makes explicit which variables causally influence which other variables, while causal estimation aims to quantify the magnitude of the effect of one variable on another. These two goals frequently go hand in hand: quantifying causal effects requires adjustments that rely on either assuming or discovering the underlying graphical structure. Methodologies for causal discovery and causal estimation have mostly been developed separately, and the statistical challenges that arise when solving these problems jointly have largely been overlooked. Indeed, a naive black-box combination of causal discovery algorithms and standard inference methods for causal effects suffers from "double dipping." That is, classical confidence intervals, such as those used for linear regression coefficients, need no longer cover the target estimand if the causal structure is not fixed a priori but is estimated on the same data used to compute the intervals.
Verifiable and Provably Secure Machine Unlearning
Eisenhofer, Thorsten, Riepel, Doreen, Chandrasekaran, Varun, Ghosh, Esha, Ohrimenko, Olga, Papernot, Nicolas
Machine unlearning aims to remove points from the training dataset of a machine learning model after training; for example when a user requests their data to be deleted. While many machine unlearning methods have been proposed, none of them enable users to audit the procedure. Furthermore, recent work shows a user is unable to verify if their data was unlearnt from an inspection of the model alone. Rather than reasoning about model parameters, we propose to view verifiable unlearning as a security problem. To this end, we present the first cryptographic definition of verifiable unlearning to formally capture the guarantees of a machine unlearning system. In this framework, the server first computes a proof that the model was trained on a dataset $D$. Given a user data point $d$ requested to be deleted, the server updates the model using an unlearning algorithm. It then provides a proof of the correct execution of unlearning and that $d \notin D'$, where $D'$ is the new training dataset. Our framework is generally applicable to different unlearning techniques that we abstract as admissible functions. We instantiate the framework, based on cryptographic assumptions, using SNARKs and hash chains. Finally, we implement the protocol for three different unlearning techniques (retraining-based, amnesiac, and optimization-based) to validate its feasibility for linear regression, logistic regression, and neural networks.
An ADMM approach for multi-response regression with overlapping groups and interaction effects
Asenso, Theophilus Quachie, Zucknick, Manuela
The constraints ensure that the interaction term can be nonzero only if the corresponding main term is nonzero. Even though the idea is still young, it has been applied in different areas, for example to multinomial logistic regression (Asenso et al., 2022b), Cox's proportional hazards model (Du and Tibshirani, 2018) and support vector machines (Asenso et al., 2022a). However, in all the above studies, the block-wise coordinate descent procedure was used in solving the problem which includes overlapping groups. The algorithm involves multiple "if" statements and a generalized gradient at the final stage. This implies that extending the model to a multi-response case would require rigorous computations like the case of Li et al. 2015, which might be difficult to handle. In this paper, we introduce the alternating direction method of multipliers (ADMM) to handle this problem and extend the results from the single response model to a multi-response problem. We provide a publicly available software package MAD-MMplasso (Asenso and Zucknick, 2022) implemented in R. We present a brief review on the ADMM algorithm in what follows.
10 Powerful Machine Learning Models for Predictive Analytics - CinexTech
In today's data-driven world, predictive analytics has become an integral part of businesses to anticipate future trends and gain a competitive advantage. Machine learning models have made it easier to analyze and interpret data and make informed decisions. This article will discuss the 10 powerful machine learning models for predictive analytics that businesses can utilize to improve their operations. Predictive analytics is the process of analyzing historical data to make predictions about future events. Machine learning models have made it possible to predict these events accurately by analyzing large volumes of data.
Skeleton Regression: A Graph-Based Approach to Estimation with Manifold Structure
We introduce a new regression framework designed to deal with large-scale, complex data that lies around a low-dimensional manifold. Our approach first constructs a graph representation, referred to as the skeleton, to capture the underlying geometric structure. We then define metrics on the skeleton graph and apply nonparametric regression techniques, along with feature transformations based on the graph, to estimate the regression function. In addition to the included nonparametric methods, we also discuss the limitations of some nonparametric regressors with respect to the general metric space such as the skeleton graph. The proposed regression framework allows us to bypass the curse of dimensionality and provides additional advantages that it can handle the union of multiple manifolds and is robust to additive noise and noisy observations. We provide statistical guarantees for the proposed method and demonstrate its effectiveness through simulations and real data examples.