Accuracy
Inspection-L: Self-Supervised GNN Node Embeddings for Money Laundering Detection in Bitcoin
Lo, Wai Weng, Kulatilleke, Gayan K., Sarhan, Mohanad, Layeghy, Siamak, Portmann, Marius
Criminals have become increasingly experienced in using cryptocurrencies, such as Bitcoin, for money laundering. The use of cryptocurrencies can hide criminal identities and transfer hundreds of millions of dollars of dirty funds through their criminal digital wallets. However, this is considered a paradox because cryptocurrencies are goldmines for open-source intelligence, giving law enforcement agencies more power when conducting forensic analyses. This paper proposed Inspection-L, a graph neural network (GNN) framework based on a self-supervised Deep Graph Infomax (DGI) and Graph Isomorphism Network (GIN), with supervised learning algorithms, namely Random Forest (RF), to detect illicit transactions for anti-money laundering (AML). To the best of our knowledge, our proposal is the first to apply self-supervised GNNs to the problem of AML in Bitcoin. The proposed method was evaluated on the Elliptic dataset and shows that our approach outperforms the state-of-the-art in terms of key classification metrics, which demonstrates the potential of self-supervised GNN in the detection of illicit cryptocurrency transactions.
FDA Publishes Updated List With 521 Authorized AI/ML Enabled Devices
Since 1995, the FDA has authorized more than 500 AI/ML-enabled medical devices via 510(k) clearance, granted De Novo request, or approved PMA. This week the FDA published an updated list with 178 new devices that were authorized through July 2022. According to the FDA, their list is based on publicly available information and is not a comprehensive resource of FDA approved AI/ML-enabled medical devices. In today's DeepTech newsletter I'm sharing a high level analysis of the 521 devices on the list, charts to visualize the data, and a summary of milestones. Note: According to the FDA their list is based on publicly available information and is not a comprehensive resource of approved AI/ML-enabled medical devices.
Modeling Dependent Structure for Utterances in ASR Evaluation
The bootstrap resampling method has been popular for performing significance analysis on word error rate (WER) in automatic speech recognition (ASR) evaluation. To deal with dependent speech data, the blockwise bootstrap approach is also introduced. By dividing utterances into uncorrelated blocks, this approach resamples these blocks instead of original data. However, it is typically nontrivial to uncover the dependent structure among utterances and identify the blocks, which might lead to subjective conclusions in statistical testing. In this paper, we present graphical lasso based methods to explicitly model such dependency and estimate uncorrelated blocks of utterances in a rigorous way, after which blockwise bootstrap is applied on top of the inferred blocks. We show the resulting variance estimator of WER in ASR evaluation is statistically consistent under mild conditions. We also demonstrate the validity of proposed approach on LibriSpeech dataset.
Performances of Symmetric Loss for Private Data from Exponential Mechanism
Bi, Jing, Suppakitpaisarn, Vorapong
This study explores the robustness of learning by symmetric loss on private data. Specifically, we leverage exponential mechanism (EM) on private labels. First, we theoretically re-discussed properties of EM when it is used for private learning with symmetric loss. Then, we propose numerical guidance of privacy budgets corresponding to different data scales and utility guarantees. Further, we conducted experiments on the CIFAR-10 dataset to present the traits of symmetric loss. Since EM is a more generic differential privacy (DP) technique, it being robust has the potential for it to be generalized, and to make other DP techniques more robust.
Detecting Label Errors in Token Classification Data
Wang, Wei-Chen, Mueller, Jonas
Mislabeled examples are a common issue in real-world data, particularly for tasks like token classification where many labels must be chosen on a fine-grained basis. Here we consider the task of finding sentences that contain label errors in token classification datasets. We study 11 different straightforward methods that score tokens/sentences based on the predicted class probabilities output by a (any) token classification model (trained via any procedure). In precision-recall evaluations based on real-world label errors in entity recognition data from CoNLL-2003, we identify a simple and effective method that consistently detects those sentences containing label errors when applied with different token classification models.
The fight against money laundering: Machine learning is a game changer
The volume of money laundering and other financial crimes is growing worldwide--and the techniques used to evade their detection are becoming ever more sophisticated. This has elicited a vigorous response from banks, which, collectively, are investing billions each year to improve their defenses against financial crime (in 2020, institutions spent an estimated $214 billion on financial-crime compliance). 1 1. What's more, the resulting regulatory fines related to compliance are surging year over year as regulator's impose tougher penalties. But banks' traditional rule- and scenario-based approaches to fighting financial crimes has always seemed a step behind the bad guys, making the fight against money laundering an ongoing challenge for compliance, monitoring, and risk organizations. Now, there is an opportunity for banks to get out in front.
ROC and AUC for Model Evaluation
ROC or Receiver Operating Characteristic Curve is the most frequently used tool for evaluating the binary or multi-class classification model. Unlike other metrics, it is calculated on prediction scores like Precision-Recall Curve instead of prediction class. In my previous post, the importance of the precision-recall curve is highlighted as how to plot for multi-class classification. To understand ROC Curve, let's quickly refresh our memory on the possible outcomes in a binary classification problem by referring to the Confusion Matrix. ROC Curve is a plot of True Positive Rate(TPR) plotted against False Positive Rate(FPR) at various threshold values. It helps to visualize how threshold affects classifier performance.
Constructing Prediction Intervals with Neural Networks: An Empirical Evaluation of Bootstrapping and Conformal Inference Methods
Contarino, Alex, Kabban, Christine Schubert, Johnstone, Chancellor, Mohd-Zaid, Fairul
Artificial neural networks (ANNs) are popular tools for accomplishing many machine learning tasks, including predicting continuous outcomes. However, the general lack of confidence measures provided with ANN predictions limit their applicability. Supplementing point predictions with prediction intervals (PIs) is common for other learning algorithms, but the complex structure and training of ANNs renders constructing PIs difficult. This work provides the network design choices and inferential methods for creating better performing PIs with ANNs. A two-step experiment is executed across 11 data sets, including an imaged-based data set. Two distribution-free methods for constructing PIs, bootstrapping and conformal inference, are considered. The results of the first experimental step reveal that the choices inherent to building an ANN affect PI performance. Guidance is provided for optimizing PI performance with respect to each network feature and PI method. In the second step, 20 algorithms for constructing PIs, each using the principles of bootstrapping or conformal inference, are implemented to determine which provides the best performance while maintaining reasonable computational burden. In general, this trade-off is optimized when implementing the cross-conformal method, which maintained interval coverage and efficiency with decreased computational burden.
A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter
Yee, Kyra, Sebag, Alice Schoenauer, Redfield, Olivia, Sheng, Emily, Eck, Matthias, Belli, Luca
Harmful content detection models tend to have higher false positive rates for content from marginalized groups. In the context of marginal abuse modeling on Twitter, such disproportionate penalization poses the risk of reduced visibility, where marginalized communities lose the opportunity to voice their opinion on the platform. Current approaches to algorithmic harm mitigation, and bias detection for NLP models are often very ad hoc and subject to human bias. We make two main contributions in this paper. First, we design a novel methodology, which provides a principled approach to detecting and measuring the severity of potential harms associated with a text-based model. Second, we apply our methodology to audit Twitter's English marginal abuse model, which is used for removing amplification eligibility of marginally abusive content. Without utilizing demographic labels or dialect classifiers, we are still able to detect and measure the severity of issues related to the over-penalization of the speech of marginalized communities, such as the use of reclaimed speech, counterspeech, and identity related terms. In order to mitigate the associated harms, we experiment with adding additional true negative examples and find that doing so provides improvements to our fairness metrics without large degradations in model performance.
Evaluating the Performance of StyleGAN2-ADA on Medical Images
Woodland, McKell, Wood, John, Anderson, Brian M., Kundu, Suprateek, Lin, Ethan, Koay, Eugene, Odisio, Bruno, Chung, Caroline, Kang, Hyunseon Christine, Venkatesan, Aradhana M., Yedururi, Sireesha, De, Brian, Lin, Yuan-Mao, Patel, Ankit B., Brock, Kristy K.
Although generative adversarial networks (GANs) have shown promise in medical imaging, they have four main limitations that impeded their utility: computational cost, data requirements, reliable evaluation measures, and training complexity. Our work investigates each of these obstacles in a novel application of StyleGAN2-ADA to high-resolution medical imaging datasets. Our dataset is comprised of liver-containing axial slices from non-contrast and contrast-enhanced computed tomography (CT) scans. Additionally, we utilized four public datasets composed of various imaging modalities. We trained a StyleGAN2 network with transfer learning (from the Flickr-Faces-HQ dataset) and data augmentation (horizontal flipping and adaptive discriminator augmentation). The network's generative quality was measured quantitatively with the Fr\'echet Inception Distance (FID) and qualitatively with a visual Turing test given to seven radiologists and radiation oncologists. The StyleGAN2-ADA network achieved a FID of 5.22 ($\pm$ 0.17) on our liver CT dataset. It also set new record FIDs of 10.78, 3.52, 21.17, and 5.39 on the publicly available SLIVER07, ChestX-ray14, ACDC, and Medical Segmentation Decathlon (brain tumors) datasets. In the visual Turing test, the clinicians rated generated images as real 42% of the time, approaching random guessing. Our computational ablation study revealed that transfer learning and data augmentation stabilize training and improve the perceptual quality of the generated images. We observed the FID to be consistent with human perceptual evaluation of medical images. Finally, our work found that StyleGAN2-ADA consistently produces high-quality results without hyperparameter searches or retraining.