Accuracy
Label Bias Identification in ML model using Python code
The field of machine learning is continuously evolving which led to a significant rise of the same in demand and importance. The application of machine learning models is now everywhere -- in our day-to-day life starting from movie recommendations on Netflix to product recommendations on Amazon. Starting from hiring a new employee to financial product approval decisions are now automatically done through machine learning models. It is assumed that a huge amount of data analyzed through improved machine learning algorithms can guide better decisions and smart actions in real-time without human intervention. However, this widespread usage of machine learning models leads to risk -the risk of bias.
EMG Pattern Recognition via Bayesian Inference with Scale Mixture-Based Stochastic Generative Models
Furui, Akira, Igaue, Takuya, Tsuji, Toshio
Electromyogram (EMG) has been utilized to interface signals for prosthetic hands and information devices owing to its ability to reflect human motion intentions. Although various EMG classification methods have been introduced into EMG-based control systems, they do not fully consider the stochastic characteristics of EMG signals. This paper proposes an EMG pattern classification method incorporating a scale mixture-based generative model. A scale mixture model is a stochastic EMG model in which the EMG variance is considered as a random variable, enabling the representation of uncertainty in the variance. This model is extended in this study and utilized for EMG pattern classification. The proposed method is trained by variational Bayesian learning, thereby allowing the automatic determination of the model complexity. Furthermore, to optimize the hyperparameters of the proposed method with a partial discriminative approach, a mutual information-based determination method is introduced. Simulation and EMG analysis experiments demonstrated the relationship between the hyperparameters and classification accuracy of the proposed method as well as the validity of the proposed method. The comparison using public EMG datasets revealed that the proposed method outperformed the various conventional classifiers. These results indicated the validity of the proposed method and its applicability to EMG-based control systems. In EMG pattern recognition, a classifier based on a generative model that reflects the stochastic characteristics of EMG signals can outperform the conventional general-purpose classifier.
A Review of Generative Adversarial Networks in Cancer Imaging: New Applications, New Solutions
Osuala, Richard, Kushibar, Kaisar, Garrucho, Lidia, Linardos, Akis, Szafranowska, Zuzanna, Klein, Stefan, Glocker, Ben, Diaz, Oliver, Lekadir, Karim
Despite technological and medical advances, the detection, interpretation, and treatment of cancer based on imaging data continue to pose significant challenges. These include high inter-observer variability, difficulty of small-sized lesion detection, nodule interpretation and malignancy determination, inter- and intra-tumour heterogeneity, class imbalance, segmentation inaccuracies, and treatment effect uncertainty. The recent advancements in Generative Adversarial Networks (GANs) in computer vision as well as in medical imaging may provide a basis for enhanced capabilities in cancer detection and analysis. In this review, we assess the potential of GANs to address a number of key challenges of cancer imaging, including data scarcity and imbalance, domain and dataset shifts, data access and privacy, data annotation and quantification, as well as cancer detection, tumour profiling and treatment planning. We provide a critical appraisal of the existing literature of GANs applied to cancer imagery, together with suggestions on future research directions to address these challenges. We analyse and discuss 163 papers that apply adversarial training techniques in the context of cancer imaging and elaborate their methodologies, advantages and limitations. With this work, we strive to bridge the gap between the needs of the clinical cancer imaging community and the current and prospective research on GANs in the artificial intelligence community.
Canonical Polyadic Decomposition and Deep Learning for Machine Fault Detection
Gaetan, Frusque, Gabriel, Michau, Olga, Fink
Acoustic monitoring for machine fault detection is a recent and expanding research path that has already provided promising results for industries. However, it is impossible to collect enough data to learn all types of faults from a machine. Thus, new algorithms, trained using data from healthy conditions only, were developed to perform unsupervised anomaly detection. A key issue in the development of these algorithms is the noise in the signals, as it impacts the anomaly detection performance. In this work, we propose a powerful data-driven and quasi non-parametric denoising strategy for spectral data based on a tensor decomposition: the Non-negative Canonical Polyadic (CP) decomposition. This method is particularly adapted for machine emitting stationary sound. We demonstrate in a case study, the Malfunctioning Industrial Machine Investigation and Inspection (MIMII) baseline, how the use of our denoising strategy leads to a sensible improvement of the unsupervised anomaly detection. Such approaches are capable to make sound-based monitoring of industrial processes more reliable.
Robust Variable Selection and Estimation Via Adaptive Elastic Net S-Estimators for Linear Regression
Heavy-tailed error distributions and predictors with anomalous values are ubiquitous in high-dimensional regression problems and can seriously jeopardize the validity of statistical analyses if not properly addressed. For more reliable estimation under these adverse conditions, we propose a new robust regularized estimator for simultaneous variable selection and coefficient estimation. This estimator, called adaptive PENSE, possesses the oracle property without prior knowledge of the scale of the residuals and without any moment conditions on the error distribution. The proposed estimator gives reliable results even under very heavy-tailed error distributions and aberrant contamination in the predictors or residuals. Importantly, even in these challenging settings variable selection by adaptive PENSE remains stable. Numerical studies on simulated and real data sets highlight superior finite-sample performance in a vast range of settings compared to other robust regularized estimators in the case of contaminated samples and competitiveness compared to classical regularized estimators in clean samples.
What is Machine Learning? A Primer for the Epidemiologist
Machine learning is a branch of computer science that has the potential to transform epidemiologic sciences. Amid a growing focus on "Big Data," it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies. Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction to classification to clustering. We provide a brief introduction to 5 common machine learning algorithms and 4 ensemble-based approaches. We then summarize epidemiologic applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiologic research and discuss opportunities and challenges for integrating machine learning and existing epidemiologic research methods. Machine learning is a branch of computer science that broadly aims to enable computers to "learn" without being directly programmed (1). It has origins in the artificial intelligence movement of the 1950s and emphasizes practical objectives and applications, particularly prediction and optimization. Computers "learn" in machine learning by improving their performance at tasks through "experience" (2, p. xv). In practice, "experience" usually means fitting to data; hence, there is not a clear boundary between machine learning and statistical approaches. Indeed, whether a given methodology is considered "machine learning" or "statistical" often reflects its history as much as genuine differences, and many algorithms (e.g., least absolute shrinkage and selection operator (LASSO), stepwise regression) may or may not be considered machine learning depending on who you ask. Still, despite methodological similarities, machine learning is philosophically and practically distinguishable. At the liberty of (considerable) oversimplification, machine learning generally emphasizes predictive accuracy over hypothesis-driven inference, usually focusing on large, high-dimensional (i.e., having many covariates) data sets (3, 4). Regardless of the precise distinction between approaches, in practice, machine learning offers epidemiologists important tools. In particular, a growing focus on "Big Data" emphasizes problems and data sets for which machine learning algorithms excel while more commonly used statistical approaches struggle. This primer provides a basic introduction to machine learning with the aim of providing readers a foundation for critically reading studies based on these methods and a jumping-off point for those interested in using machine learning techniques in epidemiologic research.
Convolutional module for heart localization and segmentation in MRI
Lima, Daniel, Graves, Catharine, Gutierrez, Marco, Brandoli, Bruno, Rodrigues-Jr, Jose
Magnetic resonance imaging (MRI) is a medical imaging technique used to capture volumetric image sequences of internal soft tissues, such as cardiac muscles. In comparison to X-Ray imaging (XR) and Computer Tomography (CT), MRI provides images with improved structural details via finer spatial resolutions. Cardiac MRI (CMR) focuses on the heart, allowing trained cardiologists to measure heart parameters, for example the mass of the cardiac muscle (myocardium mass), the volumes of blood cavities (atrial and ventricular volumes) and the amount of blood pumped per heartbeat (ejection fraction) [Peng et al., 2016]. Those parameters are used to assess how healthy is the heart, by recognizing early conditions and signs before the onset of infarcts and other complications. Due to the size and complexity of CMR sequences, complex techniques are required to produce detailed analyses; one of these techniques is deep learning (DL). Many of the tasks and goals related to the cardiac functional analysis - for example segmentation of structures [Bernard et al., 2018], estimation of heart parameters [Xue et al., 2018], and detection of diseases [Khened et al., 2017] - have benefited from DL methods. For even better results, research in DL has pointed out that models based on convolutional neural networks (CNN) have had a higher efficacy when provided with regions-of-interest (ROI) either explicitly or implicitly [Xue et al., 2018]. The detection of ROIs, usually named ROI proposal, is a preprocessing step whose goal is to identify the most prominent regions of an image (frame) for discovering clinically-relevant artifacts. The explicit ROI proposal approaches usually follow a combination of methods, for example: (a) pipelining a segmentation and a regression network; (b) preprocessing the input with a region proposal algorithm [He et al., 2015] or with a CNN [Wu et al., 2020]; or (c) by using manual cropping [Xue et al., 2017].
Directions in Abusive Language Training Data: Garbage In, Garbage Out
Vidgen, Bertie, Derczynski, Leon
Data-driven analysis and detection of abusive online content covers many different tasks, phenomena, contexts, and methodologies. This paper systematically reviews abusive language dataset creation and content in conjunction with an open website for cataloguing abusive language data. This collection of knowledge leads to a synthesis providing evidence-based recommendations for practitioners working with this complex and highly diverse data.
Can we globally optimize cross-validation loss? Quasiconvexity in ridge regression
Stephenson, William T., Frangella, Zachary, Udell, Madeleine, Broderick, Tamara
Models like LASSO and ridge regression are extensively used in practice due to their interpretability, ease of use, and strong theoretical guarantees. Cross-validation (CV) is widely used for hyperparameter tuning in these models, but do practical optimization methods minimize the true out-of-sample loss? A recent line of research promises to show that the optimum of the CV loss matches the optimum of the out-of-sample loss (possibly after simple corrections). It remains to show how tractable it is to minimize the CV loss. In the present paper, we show that, in the case of ridge regression, the CV loss may fail to be quasiconvex and thus may have multiple local optima. We can guarantee that the CV loss is quasiconvex in at least one case: when the spectrum of the covariate matrix is nearly flat and the noise in the observed responses is not too high. More generally, we show that quasiconvexity status is independent of many properties of the observed data (response norm, covariate-matrix right singular vectors and singular-value scaling) and has a complex dependence on the few that remain. We empirically confirm our theory using simulated experiments.
Detection of Double Compression in MPEG-4 Videos Using Refined Features-based CNN
Nam, Seung-Hun, Ahn, Wonhyuk, Kwon, Myung-Joon, Yu, In-Jae
Double compression is accompanied by various types of video manipulation and its traces can be exploited to determine whether a video is a forgery. This Letter presents a convolutional neural network for detecting double compression in MPEG-4 videos. Through analysis of the intra-coding process, we utilize two refined features for capturing the subtle artifacts caused by double compression. The discrete cosine transform (DCT) histogram feature effectively detects the change of statistical characteristics in DCT coefficients and the parameter-based feature is utilized as auxiliary information to help the network learn double compression artifacts. When compared with state-of-the-art networks and forensic method, the results show that the proposed approach achieves a higher performance.