Performance Analysis
Logistic Regression for Beginners - A Complete Guide - Let's Discuss Stuff
Logistic Regression is the most widely used classification algorithm in machine learning. It is used in many real-world scenarios like spam detected, cancer detection, IRIS dataset, etc. Mostly it is used in binary classification problems. But it can also be used in multiclass classification. Logistic Regression predicts the probability that the given data point belongs to a certain class or not. In this article, I will be using the famous heart disease dataset from Kaggle. In this dataset, the main goal is to predict whether the given person has heart disease or not.
Coupled regularized sample covariance matrix estimator for multiple classes
The estimation of covariance matrices of multiple classes with limited training data is a difficult problem. The sample covariance matrix (SCM) is known to perform poorly when the number of variables is large compared to the available number of samples. In order to reduce the mean squared error (MSE) of the SCM, regularized (shrinkage) SCM estimators are often used. In this work, we consider regularized SCM (RSCM) estimators for multiclass problems that couple together two different target matrices for regularization: the pooled (average) SCM of the classes and the scaled identity matrix. Regularization toward the pooled SCM is beneficial when the population covariances are similar, whereas regularization toward the identity matrix guarantees that the estimators are positive definite. We derive the MSE optimal tuning parameters for the estimators as well as propose a method for their estimation under the assumption that the class populations follow (unspecified) elliptical distributions with finite fourth-order moments. The MSE performance of the proposed coupled RSCMs are evaluated with simulations and in a regularized discriminant analysis (RDA) classification set-up on real data. The results based on three different real data sets indicate comparable performance to cross-validation but with a significant speed-up in computation time.
Sparse Longitudinal Representations of Electronic Health Record Data for the Early Detection of Chronic Kidney Disease in Diabetic Patients
Zhang, Jinghe, Kowsari, Kamran, Boukhechba, Mehdi, Harrison, James, Lobo, Jennifer, Barnes, Laura
Chronic kidney disease (CKD) is a gradual loss of renal function over time, and it increases the risk of mortality, decreased quality of life, as well as serious complications. The prevalence of CKD has been increasing in the last couple of decades, which is partly due to the increased prevalence of diabetes and hypertension. To accurately detect CKD in diabetic patients, we propose a novel framework to learn sparse longitudinal representations of patients' medical records. The proposed method is also compared with widely used baselines such as Aggregated Frequency Vector and Bag-of-Pattern in Sequences on real EHR data, and the experimental results indicate that the proposed model achieves higher predictive performance. Additionally, the learned representations are interpreted and visualized to bring clinical insights.
Distance-Based Anomaly Detection for Industrial Surfaces Using Triplet Networks
Tayeh, Tareq, Aburakhia, Sulaiman, Myers, Ryan, Shami, Abdallah
Surface anomaly detection plays an important quality control role in many manufacturing industries to reduce scrap production. Machine-based visual inspections have been utilized in recent years to conduct this task instead of human experts. In particular, deep learning Convolutional Neural Networks (CNNs) have been at the forefront of these image processing-based solutions due to their predictive accuracy and efficiency. Training a CNN on a classification objective requires a sufficiently large amount of defective data, which is often not available. In this paper, we address that challenge by training the CNN on surface texture patches with a distance-based anomaly detection objective instead. A deep residual-based triplet network model is utilized, and defective training samples are synthesized exclusively from non-defective samples via random erasing techniques to directly learn a similarity metric between the same-class samples and out-of-class samples. Evaluation results demonstrate the approach's strength in detecting different types of anomalies, such as bent, broken, or cracked surfaces, for known surfaces that are part of the training data and unseen novel surfaces.
Six Ethical Quandaries of Predictive Policing - KDnuggets
Nowhere could the application of machine learning prove more important -- nor more risky -- than in law enforcement and national security. In this article, I'll review this area and then cover six perplexing and pressing ethical quandaries that arise. Predictive policing introduces a scientific element to law enforcement decisions, such as whether to investigate or detain, how long to sentence, and whether to parole. In making such decisions, judges and officers take into consideration the probability a suspect or defendant will be convicted for a crime in the future -- which is commonly the dependent variable for a predictive policing model. These independent variables may include prior convictions, income level, employment status, family background, neighborhood, education level, and the behavior of family and friends.
The Macroeconomy as a Random Forest
I develop Macroeconomic Random Forest (MRF), an algorithm adapting the canonical Machine Learning (ML) tool to flexibly model evolving parameters in a linear macro equation. Its main output, Generalized Time-Varying Parameters (GTVPs), is a versatile device nesting many popular nonlinearities (threshold/switching, smooth transition, structural breaks/change) and allowing for sophisticated new ones. The approach delivers clear forecasting gains over numerous alternatives, predicts the 2008 drastic rise in unemployment, and performs well for inflation. Unlike most ML-based methods, MRF is directly interpretable -- via its GTVPs. For instance, the successful unemployment forecast is due to the influence of forward-looking variables (e.g., term spreads, housing starts) nearly doubling before every recession. Interestingly, the Phillips curve has indeed flattened, and its might is highly cyclical.
Predictive Analysis of Diabetic Retinopathy with Transfer Learning
Labhsetwar, Shreyas Rajesh, Salvi, Raj Sunil, Kolte, Piyush Arvind, venkatesh, Veerasai Subramaniam, Baretto, Alistair Michael
With the prevalence of Diabetes, the Diabetes Mellitus Retinopathy (DR) is becoming a major health problem across the world. The long-term medical complications arising due to DR have a significant impact on the patient as well as the society, as the disease mostly affects individuals in their most productive years. Early detection and treatment can help reduce the extent of damage to the patients. The rise of Convolutional Neural Networks for predictive analysis in the medical field paves the way for a robust solution to DR detection. This paper studies the performance of several highly efficient and scalable CNN architectures for Diabetic Retinopathy Classification with the help of Transfer Learning. The research focuses on VGG16, Resnet50 V2 and EfficientNet B0 models. The classification performance is analyzed using several performance metrics including True Positive Rate, False Positive Rate, Accuracy, etc. Also, several performance graphs are plotted for visualizing the architecture performance including Confusion Matrix, ROC Curve, etc. The results indicate that Transfer Learning with ImageNet weights using VGG 16 model demonstrates the best classification performance with the best Accuracy of 95%. It is closely followed by ResNet50 V2 architecture with the best Accuracy of 93%. This paper shows that predictive analysis of DR from retinal images is achieved with Transfer Learning on Convolutional Neural Networks.
Stereo Frustums: A Siamese Pipeline for 3D Object Detection
Mo, Xi, Sajid, Usman, Wang, Guanghui
The paper proposes a light-weighted stereo frustums matching module for 3D objection detection. The proposed framework takes advantage of a high-performance 2D detector and a point cloud segmentation network to regress 3D bounding boxes for autonomous driving vehicles. Instead of performing traditional stereo matching to compute disparities, the module directly takes the 2D proposals from both the left and the right views as input. Based on the epipolar constraints recovered from the well-calibrated stereo cameras, we propose four matching algorithms to search for the best match for each proposal between the stereo image pairs. Each matching pair proposes a segmentation of the scene which is then fed into a 3D bounding box regression network. Results of extensive experiments on KITTI dataset demonstrate that the proposed Siamese pipeline outperforms the state-of-the-art stereo-based 3D bounding box regression methods.
On the Privacy Risks of Algorithmic Fairness
Algorithmic fairness and privacy are essential elements of trustworthy machine learning for critical decision making processes. Fair machine learning algorithms are developed to minimize discrimination against protected groups in machine learning. This is achieved, for example, by imposing a constraint on the model to equalize its behavior across different groups. This can significantly increase the influence of some training data points on the fair model. We study how this change in influence can change the information leakage of the model about its training data. We analyze the privacy risks of statistical notions of fairness (i.e., equalized odds) through the lens of membership inference attacks: inferring whether a data point was used for training a model. We show that fairness comes at the cost of privacy. However, this privacy cost is not distributed equally: the information leakage of fair models increases significantly on the unprivileged subgroups, which suffer from the discrimination in regular models. Furthermore, the more biased the underlying data is, the higher the privacy cost of achieving fairness for the unprivileged subgroups is. We demonstrate this effect on multiple datasets and explain how fairness-aware learning impacts privacy.