Accuracy
Metric Elicitation; Moving from Theory to Practice
Ali, Safinah, Upadhyay, Sohini, Hiranandani, Gaurush, Glassman, Elena L., Koyejo, Oluwasanmi
Metric Elicitation (ME) is a framework for eliciting classification metrics that better align with implicit user preferences based on the task and context. The existing ME strategy so far is based on the assumption that users can most easily provide preference feedback over classifier statistics such as confusion matrices. This work examines ME, by providing a first ever implementation of the ME strategy. Specifically, we create a web-based ME interface and conduct a user study that elicits users' preferred metrics in a binary classification setting. We discuss the study findings and present guidelines for future research in this direction.
Out-of-Distribution Detection with Deep Nearest Neighbors
Sun, Yiyou, Ming, Yifei, Zhu, Xiaojin, Li, Yixuan
Out-of-distribution (OOD) detection is a critical task for deploying machine learning models in the open world. Distance-based methods have demonstrated promise, where testing samples are detected as OOD if they are relatively far away from in-distribution (ID) data. However, prior methods impose a strong distributional assumption of the underlying feature space, which may not always hold. In this paper, we explore the efficacy of non-parametric nearest-neighbor distance for OOD detection, which has been largely overlooked in the literature. Unlike prior works, our method does not impose any distributional assumption, hence providing stronger flexibility and generality. We demonstrate the effectiveness of nearest-neighbor-based OOD detection on several benchmarks and establish superior performance. Under the same model trained on ImageNet-1k, our method substantially reduces the false positive rate (FPR@TPR95) by 24.77% compared to a strong baseline SSD+, which uses a parametric approach Mahalanobis distance in detection. Code is available: https://github.com/deeplearning-wisc/knn-ood.
On Probability versus Likelihood. A discussion about two terms that are…
From the perspective of machine learning and data science, probabilities and likelihoods are used to quantify uncertainty, or perhaps how probable it is that an observation belongs to one class or another. They crop up when looking at confusion matrices; and indeed, algorithms like Naive Bayes classification are pretty much probabilistic models. The reality is that data scientists cannot escape these concepts. In everyday language, though, we tend to use the terms probability and likelihood almost interchangeably. Indeed, it's not uncommon to hear things like'how likely is it to rain today?' or'what are the chances of this or that happening?'
Financial Risk Management on a Neutral Atom Quantum Processor
Leclerc, Lucas, Ortiz-Guitierrez, Luis, Grijalva, Sebastian, Albrecht, Boris, Cline, Julia R. K., Elfving, Vincent E., Signoles, Adrien, Henriet, Loïc, Del Bimbo, Gianni, Sheikh, Usman Ayub, Shah, Maitree, Andrea, Luc, Ishtiaq, Faysal, Duarte, Andoni, Mugel, Samuel, Caceres, Irene, Kurek, Michel, Orus, Roman, Seddik, Achraf, Hammammi, Oumaima, Isselnane, Hacene, M'tamon, Didier
Machine Learning models capable of handling the large datasets collected in the financial world can often become black boxes expensive to run. The quantum computing paradigm suggests new optimization techniques, that combined with classical algorithms, may deliver competitive, faster and more interpretable models. In this work we propose a quantum-enhanced machine learning solution for the prediction of credit rating downgrades, also known as fallen-angels forecasting in the financial risk management field. We implement this solution on a neutral atom Quantum Processing Unit with up to 60 qubits on a real-life dataset. We report competitive performances against the state-of-the-art Random Forest benchmark whilst our model achieves better interpretability and comparable training times. We examine how to improve performance in the near-term validating our ideas with Tensor Networks-based numerical simulations.
Ergo, SMIRK is Safe: A Safety Case for a Machine Learning Component in a Pedestrian Automatic Emergency Brake System
Borg, Markus, Henriksson, Jens, Socha, Kasper, Lennartsson, Olof, Lönegren, Elias Sonnsjö, Bui, Thanh, Tomaszewski, Piotr, Sathyamoorthy, Sankar Raman, Brink, Sebastian, Moghadam, Mahshid Helali
Machine Learning (ML) is increasingly used in critical applications, e.g., supervised learning using Deep Neural Networks (DNN) to support automotive perception. Software systems developed for safety-critical applications must undergo assessments to demonstrate compliance with functional safety standards. However, as the conventional safety standards are not fully applicable for ML-enabled systems (Salay et al, 2018; Tambon et al, 2022), several domain-specific initiatives aim to complement them, e.g., organized by the EU Aviation Safety Agency, the ITU-WHO Focus Group on AI for Health, and the International Organization for Standardization. In the automotive industry, several standardization initiatives are ongoing to allow safe use of ML in road vehicles. It is evident that the established functional safety as defined in ISO 26262 Functional Safety (FuSa) is no longer sufficient for the next generation of Advanced Driver-Assistance Systems (ADAS) and Autonomous Driving (AD). One complementary standard under development is ISO 21448 Safety of the Intended Functionality (SOTIF). SOTIF aims for absence of unreasonable risk due to hazards resulting from functional insufficiencies, incl.
Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS
Tom, Gary, Hickman, Riley J., Zinzuwadia, Aniket, Mohajeri, Afshan, Sanchez-Lengeling, Benjamin, Aspuru-Guzik, Alan
Deep learning models that leverage large datasets are often the state of the art for modelling molecular properties. When the datasets are smaller (< 2000 molecules), it is not clear that deep learning approaches are the right modelling tool. In this work we perform an extensive study of the calibration and generalizability of probabilistic machine learning models on small chemical datasets. Using different molecular representations and models, we analyse the quality of their predictions and uncertainties in a variety of tasks (binary, regression) and datasets. We also introduce two simulated experiments that evaluate their performance: (1) Bayesian optimization guided molecular design, (2) inference on out-of-distribution data via ablated cluster splits. We offer practical insights into model and feature choice for modelling small chemical datasets, a common scenario in new chemical experiments. We have packaged our analysis into the DIONYSUS repository, which is open sourced to aid in reproducibility and extension to new datasets.
Can we integrate spatial verification methods into neural-network loss functions for atmospheric science?
Lagerquist, Ryan, Ebert-Uphoff, Imme
In the last decade, much work in atmospheric science has focused on spatial verification (SV) methods for gridded prediction, which overcome serious disadvantages of pixelwise verification. However, neural networks (NN) in atmospheric science are almost always trained to optimize pixelwise loss functions, even when ultimately assessed with SV methods. This establishes a disconnect between model verification during vs. after training. To address this issue, we develop spatially enhanced loss functions (SELF) and demonstrate their use for a real-world problem: predicting the occurrence of thunderstorms (henceforth, "convection") with NNs. In each SELF we use either a neighbourhood filter, which highlights convection at scales larger than a threshold, or a spectral filter (employing Fourier or wavelet decomposition), which is more flexible and highlights convection at scales between two thresholds. We use these filters to spatially enhance common verification scores, such as the Brier score. We train each NN with a different SELF and compare their performance at many scales of convection, from discrete storm cells to tropical cyclones. Among our many findings are that (a) for a low (high) risk threshold, the ideal SELF focuses on small (large) scales; (b) models trained with a pixelwise loss function perform surprisingly well; (c) however, models trained with a spectral filter produce much better-calibrated probabilities than a pixelwise model. We provide a general guide to using SELFs, including technical challenges and the final Python code, as well as demonstrating their use for the convection problem. To our knowledge this is the most in-depth guide to SELFs in the geosciences.
Evaluate Language Understanding of AI Models
The GLUE benchmark contains datasets and measures to evaluate general NLP models. With many general-purpose language models available today, it is important to know how they perform across different tasks and not just a specific one. There is also a leaderboard that shows the ranking of these general purpose models on different datasets. We discuss each task briefly followed by an example. Understanding some basic metrics like accuracy, F1-score would be helpful to grasp how these models are evaluated.
Point – Counterpoint on Why Organizations Suck at AI - DataScienceCentral.com
I love this infographic recently floating around LinkedIn. Sorry, don't know to whom to give credit, but it does provide an interesting depiction of how senior management thinks AI works and the realities of what's required to make AI work (Figure 1). Intent is an understanding and clarification of the intended need or objective defined at the beginning of the process. Intent is the why we are on this journey. Understanding Intent requires a detailed articulation of what are you trying to accomplish (e.g., objectives, need, purpose), what are the KPIs and metrics against which you will measure progress and success, who are the different stakeholders and constituents who will be involved in the scoping and execution of business objectives, what are the key decisions that these stakeholders need to make in support of the objectives and what are the KPIs and metrics against which they will measure progress and success, what are the Desired Outcomes, what are the potential costs associated with making the wrong decisions (critical for understanding the ramifications of False Positives and False Negatives), what are the ramifications of objective or nee failure, what are the potential unintended consequences…should I keep going (Figure 2)?
Which products activate a product? An explainable machine learning approach
Fessina, Massimiliano, Albora, Giambattista, Tacchella, Andrea, Zaccaria, Andrea
Tree-based machine learning algorithms provide the most precise assessment of the feasibility for a country to export a target product given its export basket. However, the high number of parameters involved prevents a straightforward interpretation of the results and, in turn, the explainability of policy indications. In this paper, we propose a procedure to statistically validate the importance of the products used in the feasibility assessment. In this way, we are able to identify which products, called explainers, significantly increase the probability to export a target product in the near future. The explainers naturally identify a low dimensional representation, the Feature Importance Product Space, that enhances the interpretability of the recommendations and provides out-of-sample forecasts of the export baskets of countries. Interestingly, we detect a positive correlation between the complexity of a product and the complexity of its explainers.