data problem
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > Canada (0.04)
- (2 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Minimum Wasserstein distance estimator under covariate shift: closed-form, super-efficiency and irregularity
Lang, Junjun, Zhang, Qiong, Liu, Yukun
Covariate shift arises when covariate distributions differ between source and target populations while the conditional distribution of the response remains invariant, and it underlies problems in missing data and causal inference. We propose a minimum Wasserstein distance estimation framework for inference under covariate shift that avoids explicit modeling of outcome regressions or importance weights. The resulting W-estimator admits a closed-form expression and is numerically equivalent to the classical 1-nearest neighbor estimator, yielding a new optimal transport interpretation of nearest neighbor methods. We establish root-$n$ asymptotic normality and show that the estimator is not asymptotically linear, leading to super-efficiency relative to the semiparametric efficient estimator under covariate shift in certain regimes, and uniformly in missing data problems. Numerical simulations, along with an analysis of a rainfall dataset, underscore the exceptional performance of our W-estimator.
- Asia > Bangladesh (0.04)
- North America > United States > New York (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (3 more...)
MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models
State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm may introduce bias for modeling the underlying data distribution. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph constraint. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
Phase Transitions in the Pooled Data Problem
Jonathan Scarlett, Volkan Cevher
In this paper, we study the pooled data problem of identifying the labels associated with a large collection of items, based on a sequence of pooled tests revealing the counts of each label within the pool. In the noiseless setting, we identify an exact asymptotic threshold on the required number of tests with optimal decoding, and prove a phase transition between complete success and complete failure. In addition, we present a novel noisy variation of the problem, and provide an information-theoretic framework for characterizing the required number of tests for general random noise models. Our results reveal that noise can make the problem considerably more difficult, with strict increases in the scaling laws even at low noise levels. Finally, we demonstrate similar behavior in an approximate recovery setting, where a given number of errors is allowed in the decoded labels.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (4 more...)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (2 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Multiply Robust Conformal Risk Control with Coarsened Data
Paul, Manit, Kuchibhotla, Arun Kumar, Tchetgen, Eric J. Tchetgen
Conformal Prediction (CP) has recently received a tremendous amount of interest, leading to a wide range of new theoretical and methodological results for predictive inference with formal theoretical guarantees. However, the vast majority of CP methods assume that all units in the training data have fully observed data on both the outcome and covariates of primary interest, an assumption that rarely holds in practice. In reality, training data are often missing the outcome, a subset of covariates, or both on some units. In addition, time-to-event outcomes in the training set may be censored due to dropout or administrative end-of-follow-up. Accurately accounting for such coarsened data in the training sample while fulfilling the primary objective of well-calibrated conformal predictive inference, requires robustness and efficiency considerations. In this paper, we consider the general problem of obtaining distribution-free valid prediction regions for an outcome given coarsened training data. Leveraging modern semiparametric theory, we achieve our goal by deriving the efficient influence function of the quantile of the outcome we aim to predict, under a given semiparametric model for the coarsened data, carefully combined with a novel conformal risk control procedure. Our principled use of semiparametric theory has the key advantage of facilitating flexible machine learning methods such as random forests to learn the underlying nuisance functions of the semiparametric model. A straightforward application of the proposed general framework produces prediction intervals with stronger coverage properties under covariate shift, as well as the construction of multiply robust prediction sets in monotone missingness scenarios. We further illustrate the performance of our methods through various simulation studies.
- North America > United States > Pennsylvania (0.04)
- Europe > Finland > Uusimaa > Helsinki (0.04)
- Health & Medicine > Therapeutic Area > Rheumatology (0.92)
- Health & Medicine > Therapeutic Area > Musculoskeletal (0.67)
- Health & Medicine > Therapeutic Area > Immunology (0.67)
Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling
Rodchenko, Tanya, Noy, Natasha, Scherrer, Nino, Prendki, Jennifer
For example, translation between languages exhibits regular and persistent patterns at different scales (across sentences, paragraphs, documents). In general, language patterns are stable over time. We know what type of data we need to expand to new languages. And while it may be challenging to acquire the data for rare or only spoken languages, it is easy to judge whether newly acquired data is what we need. In contrast, use cases where data lacks strong, persistent topological features or where the structure is highly fragmented or unstable over time, may not be as well-suited for data scaling approaches.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Phase Transitions in the Pooled Data Problem
Jonathan Scarlett, Volkan Cevher
In this paper, we study the pooled data problem of identifying the labels associated with a large collection of items, based on a sequence of pooled tests revealing the counts of each label within the pool. In the noiseless setting, we identify an exact asymptotic threshold on the required number of tests with optimal decoding, and prove a phase transition between complete success and complete failure. In addition, we present a novel noisy variation of the problem, and provide an information-theoretic framework for characterizing the required number of tests for general random noise models. Our results reveal that noise can make the problem considerably more difficult, with strict increases in the scaling laws even at low noise levels. Finally, we demonstrate similar behavior in an approximate recovery setting, where a given number of errors is allowed in the decoded labels.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (4 more...)
HBIC: A Biclustering Algorithm for Heterogeneous Datasets
José-García, Adán, Jacques, Julie, Chauvet, Clément, Sobanski, Vincent, Dhaenens, Clarisse
Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix. Several biclustering algorithms have been proposed for handling numeric datasets. However, real-world data mining problems often involve heterogeneous datasets with mixed attributes. To address this challenge, we introduce a biclustering approach called HBIC, capable of discovering meaningful biclusters in complex heterogeneous data, including numeric, binary, and categorical data. The approach comprises two stages: bicluster generation and bicluster model selection. In the initial stage, several candidate biclusters are generated iteratively by adding and removing rows and columns based on the frequency of values in the original matrix. In the second stage, we introduce two approaches for selecting the most suitable biclusters by considering their size and homogeneity. Through a series of experiments, we investigated the suitability of our approach on a synthetic benchmark and in a biomedical application involving clinical data of systemic sclerosis patients. The evaluation comparing our method to existing approaches demonstrates its ability to discover high-quality biclusters from heterogeneous data. Our biclustering approach is a starting point for heterogeneous bicluster discovery, leading to a better understanding of complex underlying data structures.
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.87)