Provo
The Generalized Proximity Forest
Shaw, Ben, Rustad, Adam, Maia, Sofia Pelagalli, Rhodes, Jake S., Moon, Kevin R.
Abstract--Recent work has demonstrated the utility of Random Forest (RF) proximities for various supervised machine learning tasks, including outlier detection, missing data imputation, and visualization. However, the utility of the RF proximities depends upon the success of the RF model, which itself is not the ideal model in all contexts. RF proximities have recently been extended to time series by means of the distance-based Proximity Forest (PF) model, among others, affording time series analysis with the benefits of RF proximities. In this work, we introduce the generalized PF model, thereby extending RF proximities to all contexts in which supervised distance-based machine learning can occur . Additionally, we introduce a variant of the PF model for regression tasks. We also introduce the notion of using the generalized PF model as a meta-learning framework, extending supervised imputation capability to any pre-trained classifier . We experimentally demonstrate the unique advantages of the generalized PF model compared with both the RF model and the k-nearest neighbors model.
Localized Uncertainty Quantification in Random Forests via Proximities
Rhodes, Jake S., Brown, Scott D., Wilkinson, J. Riley
Abstract--In machine learning, uncertainty quantification helps assess the reliability of model predictions, which is important in high-stakes scenarios. Traditional approaches often emphasize predictive accuracy, but there is a growing focus on incorporating uncertainty measures. While current methods often rely on quantile regression or Monte Carlo techniques, we propose a new approach using naturally occurring test sets and similarity measures (proximities) typically viewed as byproducts of random forests. Specifically, we form localized distributions of OOB errors around nearby points, defined using the proximities, to create prediction intervals for regression and trust scores for classification. By varying the number of nearby points, our intervals can be adjusted to achieve the desired coverage while retaining the flexibility that reflects the certainty of individual predictions. For classification, excluding points identified as unclassifiable by our method generally enhances the accuracy of the model and provides higher accuracy-rejection AUC scores than competing methods. Although traditional machine learning models usually provide point estimates, there is growing recognition of the need to incorporate uncertainty to support more informed decisions [1]. By quantifying uncertainty, users can assess the reliability of model outputs and better interpret results, especially for out-of-distribution samples through calibrated confidence estimates.
Label-Guided Imputation via Forest-Based Proximities for Improved Time Series Classification
Rhodes, Jake S., Rustad, Adam G., Maia, Sofia Pelagalli, Thacker, Evan, Choi, Hyunmi, Gutierrez, Jose, Rundek, Tatjana, Shaw, Ben
Missing data is a common problem in time series data. Most methods for imputation ignore label information pertaining to the time series even if that information exists. In this paper, we provide a framework for missing data imputation in the context of time series classification, where each time series is associated with a categorical label. We define a means of imputing missing values conditional upon labels, the method being guided by powerful, existing supervised models designed for high accuracy in this task. From each model, we extract a tree-based proximity measure from which imputation can be applied. We show that imputation using this method generally provides richer information leading to higher classification accuracies, despite the imputed values differing from the true values.
Guided Manifold Alignment with Geometry-Regularized Twin Autoencoders
Rhodes, Jake S., Rustad, Adam G., Nielsen, Marshall S., McClellan, Morgan Chase, Gardner, Dallan, Hedges, Dawson
Abstract--Manifold alignment (MA) involves a set of techniques for learning shared representations across domains, yet many traditional MA methods are incapable of performing out-of-sample extension, limiting their real-world applicability. We propose a guided representation learning framework leveraging a geometry-regularized twin autoencoder (AE) architecture to enhance MA while enabling generalization to unseen data. Our method enforces structured cross-modal mappings to maintain geometric fidelity in learned embeddings. By incorporating a pre-trained alignment model and a multitask learning formulation, we improve cross-domain generalization and representation robustness while maintaining alignment fidelity. We evaluate our approach using several MA methods, showing improvements in embedding consistency, information preservation, and cross-domain transfer . Additionally, we apply our framework to Alzheimer's disease diagnosis, demonstrating its ability to integrate multi-modal patient data and enhance predictive accuracy in cases limited to a single domain by leveraging insights from the multi-modal problem. Manifold learning encompasses a set of methods used to create a lower-dimensional representation, or an embedding, of higher-dimensional data. Such representations can form a key role in data visualization [1]-[5], dimensionality reduction as a preprocessing step for subsequent machine-learning or analytical tasks [6], or serve as a denoising mechanism [4]. In the context of multi-domain problems, where multiple types of data are considered, manifold learning becomes more challenging as data distributions across different domains or modalities may exhibit domain-specific variations while still sharing a common geometric structure. Manifold alignment (MA) seeks to address this problem. In some contexts, a common, shared representation of multi-modal data can be viewed as a natural extension of manifold learning. For example, cell samples of the same type but collected at a different time or using different methodologies should still share features in common, but differences in the measured features may occur due to batch effects [7], obscuring the similarities.