Performance Analysis
Machine learning for dynamically predicting the onset of renal replacement therapy in chronic kidney disease patients using claims data
Lopez-Martinez, Daniel, Chen, Christina, Chen, Ming-Jun
Chronic kidney disease (CKD) represents a slowly progressive disorder that can eventually require renal replacement therapy (RRT) including dialysis or renal transplantation. Early identification of patients who will require RRT (as much as 1 year in advance) improves patient outcomes, for example by allowing higher-quality vascular access for dialysis. Therefore, early recognition of the need for RRT by care teams is key to successfully managing the disease. Unfortunately, there is currently no commonly used predictive tool for RRT initiation. In this work, we present a machine learning model that dynamically identifies CKD patients at risk of requiring RRT up to one year in advance using only claims data. To evaluate the model, we studied approximately 3 million Medicare beneficiaries for which we made over 8 million predictions. We showed that the model can identify at risk patients with over 90% sensitivity and specificity. Although additional work is required before this approach is ready for clinical use, this study provides a basis for a screening tool to identify patients at risk within a time window that enables early proactive interventions intended to improve RRT outcomes.
Phishing URL Detection: A Network-based Approach Robust to Evasion
Kim, Taeri, Park, Noseong, Hong, Jiwon, Kim, Sang-Wook
Many cyberattacks start with disseminating phishing URLs. When clicking these phishing URLs, the victim's private information is leaked to the attacker. There have been proposed several machine learning methods to detect phishing URLs. However, it still remains under-explored to detect phishing URLs with evasion, i.e., phishing URLs that pretend to be benign by manipulating patterns. In many cases, the attacker i) reuses prepared phishing web pages because making a completely brand-new set costs non-trivial expenses, ii) prefers hosting companies that do not require private information and are cheaper than others, iii) prefers shared hosting for cost efficiency, and iv) sometimes uses benign domains, IP addresses, and URL string patterns to evade existing detection methods. Inspired by those behavioral characteristics, we present a network-based inference method to accurately detect phishing URLs camouflaged with legitimate patterns, i.e., robust to evasion. In the network approach, a phishing URL will be still identified as phishy even after evasion unless a majority of its neighbors in the network are evaded at the same time. Our method consistently shows better detection performance throughout various experimental tests than state-of-the-art methods, e.g., F-1 of 0.89 for our method vs. 0.84 for the best feature-based method.
Real or Not? Disaster Tweets classification with RoBERTa
This article was published as a part of the Data Science Blogathon. Today we live in a world of active social networking where every kind of information is shared among users worldwide. This is greatly facilitated by the ubiquitousness of smartphones and other handheld communication devices. Some popular sites are Facebook, Whatsapp, LinkedIn, etc.; however, Twitter is a viral microblogging site used worldwide for open information exchange. On Twitter, various types of information are exchanged in the form of short messages that include information regarding any mishaps or accidents happening worldwide.
ML interview preparation-- popular topics
Hello, happy to see you here. Today let's dive into some popular topics which are often discussed in machine learning interviews. Without further ado, presenting to you the next cheat sheet. But what exactly do we do when we already have data? We will stop on each task separately in detail in next articles.
Exploiting Fairness to Enhance Sensitive Attributes Reconstruction
Ferry, Julien, Aïvodji, Ulrich, Gambs, Sébastien, Huguet, Marie-José, Siala, Mohamed
In recent years, a growing body of work has emerged on how to learn machine learning models under fairness constraints, often expressed with respect to some sensitive attributes. In this work, we consider the setting in which an adversary has black-box access to a target model and show that information about this model's fairness can be exploited by the adversary to enhance his reconstruction of the sensitive attributes of the training data. More precisely, we propose a generic reconstruction correction method, which takes as input an initial guess made by the adversary and corrects it to comply with some user-defined constraints (such as the fairness information) while minimizing the changes in the adversary's guess. The proposed method is agnostic to the type of target model, the fairness-aware learning method as well as the auxiliary knowledge of the adversary. To assess the applicability of our approach, we have conducted a thorough experimental evaluation on two state-of-the-art fair learning methods, using four different fairness metrics with a wide range of tolerances and with three datasets of diverse sizes and sensitive attributes. The experimental results demonstrate the effectiveness of the proposed approach to improve the reconstruction of the sensitive attributes of the training set.
Semi-WTC: A Practical Semi-supervised Framework for Attack Categorization through Weight-Task Consistency
Li, Zihan, Chen, Wentao, Wei, Zhiqing, Luo, Xingqi, Su, Bing
Supervised learning has been widely used for attack categorization, requiring high-quality data and labels. However, the data is often imbalanced and it is difficult to obtain sufficient annotations. Moreover, supervised models are subject to real-world deployment issues, such as defending against unseen artificial attacks. To tackle the challenges, we propose a semi-supervised fine-grained attack categorization framework consisting of an encoder and a two-branch structure and this framework can be generalized to different supervised models. The multilayer perceptron with residual connection is used as the encoder to extract features and reduce the complexity. The Recurrent Prototype Module (RPM) is proposed to train the encoder effectively in a semi-supervised manner. To alleviate the data imbalance problem, we introduce the Weight-Task Consistency (WTC) into the iterative process of RPM by assigning larger weights to classes with fewer samples in the loss function. In addition, to cope with new attacks in real-world deployment, we propose an Active Adaption Resampling (AAR) method, which can better discover the distribution of unseen sample data and adapt the parameters of encoder. Experimental results show that our model outperforms the state-of-the-art semi-supervised attack detection methods with a 3% improvement in classification accuracy and a 90% reduction in training time.
Label Structure Preserving Contrastive Embedding for Multi-Label Learning with Missing Labels
Ma, Zhongchen, Li, Lisha, Mao, Qirong, Chen, Songcan
Contrastive learning (CL) has shown impressive advances in image representation learning in whichever supervised multi-class classification or unsupervised learning. However, these CL methods fail to be directly adapted to multi-label image classification due to the difficulty in defining the positive and negative instances to contrast a given anchor image in multi-label scenario, let the label missing one alone, implying that borrowing a commonly-used way from contrastive multi-class learning to define them will incur a lot of false negative instances unfavorable for learning. In this paper, with the introduction of a label correction mechanism to identify missing labels, we first elegantly generate positives and negatives for individual semantic labels of an anchor image, then define a unique contrastive loss for multi-label image classification with missing labels (CLML), the loss is able to accurately bring images close to their true positive images and false negative images, far away from their true negative images. Different from existing multi-label CL losses, CLML also preserves low-rank global and local label dependencies in the latent representation space where such dependencies have been shown to be helpful in dealing with missing labels. To the best of our knowledge, this is the first general multi-label CL loss in the missing-label scenario and thus can seamlessly be paired with those losses of any existing multi-label learning methods just via a single hyperparameter. The proposed strategy has been shown to improve the classification performance of the Resnet101 model by margins of 1.2%, 1.6%, and 1.3% respectively on three standard datasets, MSCOCO, VOC, and NUS-WIDE. Code is available at https://github.com/chuangua/ContrastiveLossMLML.
On Effectively Predicting Autism Spectrum Disorder Using an Ensemble of Classifiers
Twala, Bhekisipho, Molloy, Eamon
An ensemble of classifiers combines several single classifiers to deliver a final prediction or classification decision. An increasingly provoking question is whether such systems can outperform the single best classifier. If so, what form of an ensemble of classifiers (also known as multiple classifier learning systems or multiple classifiers) yields the most significant benefits in the size or diversity of the ensemble itself? Given that the tests used to detect autism traits are time-consuming and costly, developing a system that will provide the best outcome and measurement of autism spectrum disorder (ASD) has never been critical. In this paper, several single and later multiple classifiers learning systems are evaluated in terms of their ability to predict and identify factors that influence or contribute to ASD for early screening purposes. A dataset of behavioural data and robot-enhanced therapy of 3,000 sessions and 300 hours, recorded from 61 children are utilised for this task. Simulation results show the superior predictive performance of multiple classifier learning systems (especially those with three classifiers per ensemble) compared to individual classifiers, with bagging and boosting achieving excellent results. It also appears that social communication gestures remain the critical contributing factor to the ASD problem among children.
Are Attribute Inference Attacks Just Imputation?
Jayaraman, Bargav, Evans, David
Models can expose sensitive information about their training data. In an attribute inference attack, an adversary has partial knowledge of some training records and access to a model trained on those records, and infers the unknown values of a sensitive feature of those records. We study a fine-grained variant of attribute inference we call \emph{sensitive value inference}, where the adversary's goal is to identify with high confidence some records from a candidate set where the unknown attribute has a particular sensitive value. We explicitly compare attribute inference with data imputation that captures the training distribution statistics, under various assumptions about the training data available to the adversary. Our main conclusions are: (1) previous attribute inference methods do not reveal more about the training data from the model than can be inferred by an adversary without access to the trained model, but with the same knowledge of the underlying distribution as needed to train the attribute inference attack; (2) black-box attribute inference attacks rarely learn anything that cannot be learned without the model; but (3) white-box attacks, which we introduce and evaluate in the paper, can reliably identify some records with the sensitive value attribute that would not be predicted without having access to the model. Furthermore, we show that proposed defenses such as differentially private training and removing vulnerable records from training do not mitigate this privacy risk. The code for our experiments is available at \url{https://github.com/bargavj/EvaluatingDPML}.
Improving debris flow evacuation alerts in Taiwan using machine learning
Tsai, Yi-Lin, Irvin, Jeremy, Chundi, Suhas, Ng, Andrew Y., Field, Christopher B., Kitanidis, Peter K.
Taiwan has the highest susceptibility to and fatalities from debris flows worldwide. The existing debris flow warning system in Taiwan, which uses a time-weighted measure of rainfall, leads to alerts when the measure exceeds a predefined threshold. However, this system generates many false alarms and misses a substantial fraction of the actual debris flows. Towards improving this system, we implemented five machine learning models that input historical rainfall data and predict whether a debris flow will occur within a selected time. We found that a random forest model performed the best among the five models and outperformed the existing system in Taiwan. Furthermore, we identified the rainfall trajectories strongly related to debris flow occurrences and explored trade-offs between the risks of missing debris flows versus frequent false alerts. These results suggest the potential for machine learning models trained on hourly rainfall data alone to save lives while reducing false alerts.