AITopics | data problem

Collaborating Authors

data problem

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

dc36f18a9a0a776671d4879cae69b551-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 18:13:17 GMT

dataset, imputation, rape, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
South America > Paraguay > Asunción > Asunción (0.04)
North America > Canada (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Minimum Wasserstein distance estimator under covariate shift: closed-form, super-efficiency and irregularity

Lang, Junjun, Zhang, Qiong, Liu, Yukun

arXiv.org Machine LearningJan-13-2026

Covariate shift arises when covariate distributions differ between source and target populations while the conditional distribution of the response remains invariant, and it underlies problems in missing data and causal inference. We propose a minimum Wasserstein distance estimation framework for inference under covariate shift that avoids explicit modeling of outcome regressions or importance weights. The resulting W-estimator admits a closed-form expression and is numerically equivalent to the classical 1-nearest neighbor estimator, yielding a new optimal transport interpretation of nearest neighbor methods. We establish root-$n$ asymptotic normality and show that the estimator is not asymptotically linear, leading to super-efficiency relative to the semiparametric efficient estimator under covariate shift in certain regimes, and uniformly in missing data problems. Numerical simulations, along with an analysis of a rainfall dataset, underscore the exceptional performance of our W-estimator.

artificial intelligence, estimator, machine learning, (16 more...)

arXiv.org Machine Learning

2601.07282

Country:

Asia > China (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Education (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.70)

Add feedback

MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models

Neural Information Processing SystemsDec-23-2025, 21:37:14 GMT

State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm may introduce bias for modeling the underlying data distribution. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph constraint. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.

causal discovery, continuous additive noise model, missdag, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

Phase Transitions in the Pooled Data Problem

Jonathan Scarlett, Volkan Cevher

Neural Information Processing SystemsNov-21-2025, 05:04:04 GMT

In this paper, we study the pooled data problem of identifying the labels associated with a large collection of items, based on a sequence of pooled tests revealing the counts of each label within the pool. In the noiseless setting, we identify an exact asymptotic threshold on the required number of tests with optimal decoding, and prove a phase transition between complete success and complete failure. In addition, we present a novel noisy variation of the problem, and provide an information-theoretic framework for characterizing the required number of tests for general random noise models. Our results reveal that noise can make the problem considerably more difficult, with strict increases in the scaling laws even at low noise levels. Finally, we demonstrate similar behavior in an approximate recovery setting, where a given number of errors is allowed in the decoded labels.

artificial intelligence, required number, supplementary material, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.46)

Add feedback

dc36f18a9a0a776671d4879cae69b551-Paper.pdf

Neural Information Processing SystemsAug-22-2025, 00:57:43 GMT

dataset, imputation, rape, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
South America > Paraguay > Asunción > Asunción (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

dc36f18a9a0a776671d4879cae69b551-AuthorFeedback.pdf

Neural Information Processing SystemsAug-22-2025, 00:57:32 GMT

artificial intelligence, baseline, reviewer, (15 more...)

Neural Information Processing Systems

Genre: Research Report (0.33)

Technology: Information Technology > Artificial Intelligence (0.49)

Add feedback

Multiply Robust Conformal Risk Control with Coarsened Data

Paul, Manit, Kuchibhotla, Arun Kumar, Tchetgen, Eric J. Tchetgen

arXiv.org Machine LearningAug-22-2025

Conformal Prediction (CP) has recently received a tremendous amount of interest, leading to a wide range of new theoretical and methodological results for predictive inference with formal theoretical guarantees. However, the vast majority of CP methods assume that all units in the training data have fully observed data on both the outcome and covariates of primary interest, an assumption that rarely holds in practice. In reality, training data are often missing the outcome, a subset of covariates, or both on some units. In addition, time-to-event outcomes in the training set may be censored due to dropout or administrative end-of-follow-up. Accurately accounting for such coarsened data in the training sample while fulfilling the primary objective of well-calibrated conformal predictive inference, requires robustness and efficiency considerations. In this paper, we consider the general problem of obtaining distribution-free valid prediction regions for an outcome given coarsened training data. Leveraging modern semiparametric theory, we achieve our goal by deriving the efficient influence function of the quantile of the outcome we aim to predict, under a given semiparametric model for the coarsened data, carefully combined with a novel conformal risk control procedure. Our principled use of semiparametric theory has the key advantage of facilitating flexible machine learning methods such as random forests to learn the underlying nuisance functions of the semiparametric model. A straightforward application of the proposed general framework produces prediction intervals with stronger coverage properties under covariate shift, as well as the construction of multiply robust prediction sets in monotone missingness scenarios. We further illustrate the performance of our methods through various simulation studies.

artificial intelligence, machine learning, prediction, (19 more...)

arXiv.org Machine Learning

2508.15489

Country:

North America > United States > Pennsylvania (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Rheumatology (0.92)
Health & Medicine > Therapeutic Area > Musculoskeletal (0.67)
Health & Medicine > Therapeutic Area > Immunology (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Not Every AI Problem is a Data Problem: We Should Be Intentional About Data Scaling

Rodchenko, Tanya, Noy, Natasha, Scherrer, Nino, Prendki, Jennifer

arXiv.org Artificial IntelligenceJan-23-2025

For example, translation between languages exhibits regular and persistent patterns at different scales (across sentences, paragraphs, documents). In general, language patterns are stable over time. We know what type of data we need to expand to new languages. And while it may be challenging to acquire the data for rare or only spoken languages, it is easy to judge whether newly acquired data is what we need. In contrast, use cases where data lacks strong, persistent topological features or where the structure is highly fragmented or unstable over time, may not be as well-suited for data scaling approaches.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2501.13779

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)

Add feedback

Phase Transitions in the Pooled Data Problem

Jonathan Scarlett, Volkan Cevher

Neural Information Processing SystemsOct-2-2024, 20:17:36 GMT

required number, supplementary material, transition, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(4 more...)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.46)

Add feedback

HBIC: A Biclustering Algorithm for Heterogeneous Datasets

José-García, Adán, Jacques, Julie, Chauvet, Clément, Sobanski, Vincent, Dhaenens, Clarisse

arXiv.org Artificial IntelligenceAug-23-2024

Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix. Several biclustering algorithms have been proposed for handling numeric datasets. However, real-world data mining problems often involve heterogeneous datasets with mixed attributes. To address this challenge, we introduce a biclustering approach called HBIC, capable of discovering meaningful biclusters in complex heterogeneous data, including numeric, binary, and categorical data. The approach comprises two stages: bicluster generation and bicluster model selection. In the initial stage, several candidate biclusters are generated iteratively by adding and removing rows and columns based on the frequency of values in the original matrix. In the second stage, we introduce two approaches for selecting the most suitable biclusters by considering their size and homogeneity. Through a series of experiments, we investigated the suitability of our approach on a synthetic benchmark and in a biomedical application involving clinical data of systemic sclerosis patients. The evaluation comparing our method to existing approaches demonstrates its ability to discover high-quality biclusters from heterogeneous data. Our biclustering approach is a starting point for heterogeneous bicluster discovery, leading to a better understanding of complex underlying data structures.

algorithm, bicluster, dataset, (14 more...)

arXiv.org Artificial Intelligence

2408.13217

Country: