Goto

Collaborating Authors

 forest model


Federated Random Forest for Partially Overlapping Clinical Data

arXiv.org Artificial Intelligence

In the healthcare sector, a consciousness surrounding data privacy and corresponding data protection regulations, as well as heterogeneous and non-harmonized data, pose huge challenges to large-scale data analysis. Moreover, clinical data often involves partially overlapping features, as some observations may be missing due to various reasons, such as differences in procedures, diagnostic tests, or other recorded patient history information across hospitals or institutes. To address the challenges posed by partially overlapping features and incomplete data in clinical datasets, a comprehensive approach is required. Particularly in the domain of medical data, promising outcomes are achieved by federated random forests whenever features align. However, for most standard algorithms, like random forest, it is essential that all data sets have identical parameters. Therefore, in this work the concept of federated random forest is adapted to a setting with partially overlapping features. Moreover, our research assesses the effectiveness of the newly developed federated random forest models for partially overlapping clinical data. For aggregating the federated, globally optimized model, only features available locally at each site can be used. We tackled two issues in federation: (i) the quantity of involved parties, (ii) the varying overlap of features. This evaluation was conducted across three clinical datasets. The federated random forest model even in cases where only a subset of features overlaps consistently demonstrates superior performance compared to its local counterpart. This holds true across various scenarios, including datasets with imbalanced classes. Consequently, federated random forests for partially overlapped data offer a promising solution to transcend barriers in collaborative research and corporate cooperation.


Respiratory Anomaly Detection using Reflected Infrared Light-wave Signals

arXiv.org Artificial Intelligence

In this study, we present a non-contact respiratory anomaly detection method using incoherent light-wave signals reflected from the chest of a mechanical robot that can breathe like human beings. In comparison to existing radar and camera-based sensing systems for vitals monitoring, this technology uses only a low-cost ubiquitous light source (e.g., infrared light emitting diode) and sensor (e.g., photodetector). This light-wave sensing (LWS) system recognizes different breathing anomalies from the variations of light intensity reflected from the chest of the robot within a 0.5m-1.5m range. The anomaly detection model demonstrates up to 96.6% average accuracy in classifying 7 different types of breathing data using machine learning. The model can also detect faulty data collected by the system that does not contain breathing information. The developed system can be utilized at home or healthcare facilities as a smart, non-contact and discreet respiration monitoring method.


Generalized Autoregressive Score Trees and Forests

arXiv.org Machine Learning

We propose methods to improve the forecasts from generalized autoregressive score (GAS) models (Creal et. al, 2013; Harvey, 2013) by localizing their parameters using decision trees and random forests. These methods avoid the curse of dimensionality faced by kernel-based approaches, and allow one to draw on information from multiple state variables simultaneously. We apply the new models to four distinct empirical analyses, and in all applications the proposed new methods significantly outperform the baseline GAS model. In our applications to stock return volatility and density prediction, the optimal GAS tree model reveals a leverage effect and a variance risk premium effect. Our study of stock-bond dependence finds evidence of a flight-to-quality effect in the optimal GAS forest forecasts, while our analysis of high-frequency trade durations uncovers a volume-volatility effect.


A Comparison of Decision Forest Inference Platforms from A Database Perspective

arXiv.org Artificial Intelligence

Decision forest, including RandomForest, XGBoost, and LightGBM, is one of the most popular machine learning techniques used in many industrial scenarios, such as credit card fraud detection, ranking, and business intelligence. Because the inference process is usually performance-critical, a number of frameworks were developed and dedicated for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. However, these frameworks are all decoupled with data management frameworks. It is unclear whether in-database inference will improve the overall performance. In addition, these frameworks used different algorithms, optimization techniques, and parallelism models. It is unclear how these implementations will affect the overall performance and how to make design decisions for an in-database inference framework. In this work, we investigated the above questions by comprehensively comparing the end-to-end performance of the aforementioned inference frameworks and netsDB, an in-database inference framework we implemented. Through this study, we identified that netsDB is best suited for handling small-scale models on large-scale datasets and all-scale models on small-scale datasets, for which it achieved up to hundreds of times of speedup. In addition, the relation-centric representation we proposed significantly improved netsDB's performance in handling large-scale models, while the model reuse optimization we proposed further improved netsDB's performance in handling small-scale datasets.


Prediction of superconducting properties of materials based on machine learning models

arXiv.org Artificial Intelligence

The application of superconducting materials is becoming more and more widespread. Traditionally, the discovery of new superconducting materials relies on the experience of experts and a large number of "trial and error" experiments, which not only increases the cost of experiments but also prolongs the period of discovering new superconducting materials. In recent years, machine learning has been increasingly applied to materials science. Based on this, this manuscript proposes the use of XGBoost model to identify superconductors; the first application of deep forest model to predict the critical temperature of superconductors; the first application of deep forest to predict the band gap of materials; and application of a new sub-network model to predict the Fermi energy level of materials. Compared with our known similar literature, all the above algorithms reach state-of-the-art. Finally, this manuscript uses the above models to search the COD public dataset and identify 50 candidate superconducting materials with possible critical temperature greater than 90 K.


Inter and Intra-Annual Spatio-Temporal Variability of Habitat Suitability for Asian Elephants in India: A Random Forest Model-based Analysis

arXiv.org Artificial Intelligence

We develop a Random Forest model to estimate the species distribution of Asian elephants in India and study the inter and intra-annual spatiotemporal variability of habitats suitable for them. Climatic, topographic variables and satellite-derived Land Use/Land Cover (LULC), Net Primary Productivity (NPP), Leaf Area Index (LAI), and Normalized Difference Vegetation Index (NDVI) are used as predictors, and the species sighting data of Asian elephants from Global Biodiversity Information Reserve is used to develop the Random Forest model. A careful hyper-parameter tuning and training-validation-testing cycle are completed to identify the significant predictors and develop a final model that gives precision and recall of 0.78 and 0.77. The model is applied to estimate the spatial and temporal variability of suitable habitats. We observe that seasonal reduction in the suitable habitat may explain the migration patterns of Asian elephants and the increasing human-elephant conflict. Further, the total available suitable habitat area is observed to have reduced, which exacerbates the problem. This machine learning model is intended to serve as an input to the Agent-Based Model that we are building as part of our Artificial Intelligence-driven decision support tool to reduce human-wildlife conflict.


Introducing TensorFlow Decision Forests

#artificialintelligence

We are happy to open source TensorFlow Decision Forests (TF-DF). TF-DF is a collection of production-ready state-of-the-art algorithms for training, serving and interpreting decision forest models (including random forests and gradient boosted trees). You can now use these models for classification, regression and ranking tasks - with the flexibility and composability of the TensorFlow and Keras. Decision forests are a family of machine learning algorithms with quality and speed competitive with (and often favorable to) neural networks, especially when you're working with tabular data. They're built from many decision trees, which makes them easy to use and understand - and you can take advantage of a plethora of interpretability tools and techniques that already exist today.


Growing Deep Forests Efficiently with Soft Routing and Learned Connectivity

arXiv.org Machine Learning

Despite the latest prevailing success of deep neural networks (DNNs), several concerns have been raised against their usage, including the lack of intepretability the gap between DNNs and other well-established machine learning models, and the growingly expensive computational costs. A number of recent works [1], [2], [3] explored the alternative to sequentially stacking decision tree/random forest building blocks in a purely feed-forward way, with no need of back propagation. Since decision trees enjoy inherent reasoning transparency, such deep forest models can also facilitate the understanding of the internaldecision making process. This paper further extends the deep forest idea in several important aspects. Firstly, we employ a probabilistic tree whose nodes make probabilistic routing decisions, a.k.a., soft routing, rather than hard binary decisions.Besides enhancing the flexibility, it also enables non-greedy optimization for each tree. Second, we propose an innovative topology learning strategy: every node in the ree now maintains a new learnable hyperparameter indicating the probability that it will be a leaf node. In that way, the tree will jointly optimize both its parameters and the tree topology during training. Experiments on the MNIST dataset demonstrate that our empowered deep forests can achieve better or comparable performance than [1],[3] , with dramatically reduced model complexity. For example,our model with only 1 layer of 15 trees can perform comparably with the model in [3] with 2 layers of 2000 trees each.


JigSaw: A tool for discovering explanatory high-order interactions from random forests

arXiv.org Machine Learning

Machine learning is revolutionizing biology by facilitating the prediction of outcomes from complex patterns found in massive data sets. Large biological data sets, like those generated by transcriptome or microbiome studies,measure many relevant components that interact in vivo with one another in modular ways.Identifying the high-order interactions that machine learning models use to make predictions would facilitate the development of hypotheses linking combinations of measured components to outcome. By using the structure of random forests, a new algorithmic approach, termed JigSaw,was developed to aid in the discovery of patterns that could explain predictions made by the forest. By examining the patterns of individual decision trees JigSaw identifies high-order interactions between measured features that are strongly associated with a particular outcome and identifies the relevant decision thresholds. JigSaw's effectiveness was tested in simulation studies where it was able to recover multiple ground truth patterns;even in the presence of significant noise. It was then used to find patterns associated with outcomes in two real world data sets.It was first used to identify patterns clinical measurements associated with heart disease. It was then used to find patterns associated with breast cancer using metabolites measured in the blood. In heart disease, JigSaw identified several three-way interactions that combine to explain most of the heart disease records (66%) with high precision (93%). In breast cancer, three two-way interactions were recovered that can be combined to explain almost all records (92%) with good precision (79%). JigSaw is an efficient method for exploring high-dimensional feature spaces for rules that explain statistical associations with a given outcome and can inspire the generation of testable hypotheses.


Estimating heterogeneous treatment effects with right-censored data via causal survival forests

arXiv.org Machine Learning

There is fast-growing literature on estimating heterogeneous treatment effects via random forests in observational studies. However, there are few approaches available for right-censored survival data. In clinical trials, right-censored survival data are frequently encountered. Quantifying the causal relationship between a treatment and the survival outcome is of great interest. Random forests provide a robust, nonparametric approach to statistical estimation. In addition, recent developments allow forest-based methods to quantify the uncertainty of the estimated heterogeneous treatment effects. We propose causal survival forests that directly target on estimating the treatment effect from an observational study. We establish consistency and asymptotic normality of the proposed estimators and provide an estimator of the asymptotic variance that enables valid confidence intervals of the estimated treatment effect. The performance of our approach is demonstrated via extensive simulations and data from an HIV study.