AITopics

2507.12935

Country: Europe (0.28)

Genre:

Research Report (0.82)
Workflow (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.96)

arXiv.org Artificial IntelligenceJul-18-2025

Data Transformation Strategies to Remove Heterogeneity

Yoo, Sangbong, Lee, Jaeyoung, Yoon, Chanyoung, Son, Geonyeong, Hong, Hyein, Seo, Seongbum, Yim, Soobin, Jung, Chanyoung, Park, Jungsoo, Kim, Misuk, Jang, Yun

Data heterogeneity is a prevalent issue, stemming from various conflicting factors, making its utilization complex. This uncertainty, particularly resulting from disparities in data formats, frequently necessitates the involvement of experts to find resolutions. Current methodologies primarily address conflicts related to data structures and schemas, often overlooking the pivotal role played by data transformation. As the utilization of artificial intelligence (AI) continues to expand, there is a growing demand for a more streamlined data preparation process, and data transformation becomes paramount. It customizes training data to enhance AI learning efficiency and adapts input formats to suit diverse AI models. Selecting an appropriate transformation technique is paramount in preserving crucial data details. Despite the widespread integration of AI across various industries, comprehensive reviews concerning contemporary data transformation approaches are scarce. This survey explores the intricacies of data heterogeneity and its underlying sources. It systematically categorizes and presents strategies to address heterogeneity stemming from differences in data formats, shedding light on the inherent challenges associated with each strategy.

data mining, large language model, machine learning, (25 more...)

2507.12677

Country:

North America > United States > California (1.00)
Europe > Germany (0.68)
North America > Canada (0.68)
(3 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment (0.92)
Information Technology (0.67)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(7 more...)

Zhu, Chenrui, Bounia, Louenas, Nguyen, Vu Linh, Destercke, Sébastien, Hoarau, Arthur

Robust Explanations Through Uncertainty Decomposition: A Path to Trustworthier AI

arXiv.org Artificial IntelligenceJul-18-2025

Recent advancements in machine learning have emphasized the need for transparency in model predictions, particularly as interpretability diminishes when using increasingly complex architectures. In this paper, we propose leveraging prediction uncertainty as a complementary approach to classical explainability methods. Specifically, we distinguish between aleatoric (data-related) and epistemic (model-related) uncertainty to guide the selection of appropriate explanations. Epistemic uncertainty serves as a rejection criterion for unreliable explanations and, in itself, provides insight into insufficient training (a new form of explanation). Aleatoric uncertainty informs the choice between feature-importance explanations and counterfactual explanations. This leverages a framework of explainability methods driven by uncertainty quantification and disentanglement. Our experiments demonstrate the impact of this uncertainty-aware approach on the robustness and attainability of explanations in both traditional machine learning and deep learning scenarios.

explanation, machine learning, natural language, (16 more...)

2507.12913

Country: Europe > France (0.14)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.70)
(2 more...)

Chlon, Leon, Rashidi, Sarah, Khamis, Zein, Awada, MarcAntonio M.

LLMs are Bayesian, in Expectation, not in Realization

Large language models demonstrate remarkable in-context learning capabilities, adapting to new tasks without parameter updates. While this phenomenon has been successfully modeled as implicit Bayesian inference, recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property, a cornerstone requirement of Bayesian updating on exchangeable data. This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications. Our theoretical analysis establishes four key results: (1) positional encodings induce martingale violations of order $Θ(\log n / n)$; (2) transformers achieve information-theoretic optimality with excess risk $O(n^{-1/2})$ in expectation over orderings; (3) the implicit posterior representation converges to the true Bayesian posterior in the space of sufficient statistics; and (4) we derive the optimal chain-of-thought length as $k^* = Θ(\sqrt{n}\log(1/\varepsilon))$ with explicit constants, providing a principled approach to reduce inference costs while maintaining performance. Empirical validation on GPT-3 confirms predictions (1)-(3), with transformers reaching 99\% of theoretical entropy limits within 20 examples. Our framework provides practical methods for extracting calibrated uncertainty estimates from position-aware architectures and optimizing computational efficiency in deployment.

large language model, machine learning, natural language, (19 more...)

2507.11768

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)

Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update

Zhang, Yu-Jie, Xu, Sheng-An, Zhao, Peng, Sugiyama, Masashi

We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function, thereby modeling a broad class of reward distributions such as Bernoulli and Poisson. While GLBs are widely applicable to real-world scenarios, their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency. Existing methods typically trade off between two objectives, either incurring high per-round costs for optimal regret guarantees or compromising statistical efficiency to enable constant-time updates. In this paper, we propose a jointly efficient algorithm that attains a nearly optimal regret bound with $\mathcal{O}(1)$ time and space complexities per round. The core of our method is a tight confidence set for the online mirror descent (OMD) estimator, which is derived through a novel analysis that leverages the notion of mix loss from online prediction. The analysis shows that our OMD estimator, even with its one-pass updates, achieves statistical efficiency comparable to maximum likelihood estimation, thereby leading to a jointly efficient optimistic method.

bandit, data mining, machine learning, (22 more...)

2507.11847

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Middle East > Jordan (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.88)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.54)

From Observational Data to Clinical Recommendations: A Causal Framework for Estimating Patient-level Treatment Effects and Learning Policies

Gutman, Rom, Sheiba, Shimon, Klein, Omer Noy, Bird, Naama Dekel, Gruber, Amit, Aronson, Doron, Caspi, Oren, Shalit, Uri

We propose a framework for building patient-specific treatment recommendation models, building on the large recent literature on learning patient-level causal models and inspired by the target trial paradigm of Hernan and Robins. We focus on safety and validity, including the crucial issue of causal identification when using observational data. We do not provide a specific model, but rather a way to integrate existing methods and know-how into a practical pipeline. We further provide a real world use-case of treatment optimization for patients with heart failure who develop acute kidney injury during hospitalization. The results suggest our pipeline can improve patient outcomes over the current treatment regime.

data mining, machine learning, policy value, (21 more...)

2507.11381

Country:

Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Research Report > Strength High (0.92)

Industry:

Health & Medicine > Therapeutic Area > Nephrology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
(2 more...)

h-calibration: Rethinking Classifier Recalibration with Probabilistic Error-Bounded Objective

Huang, Wenjian, Cao, Guiping, Xia, Jiahao, Chen, Jingkun, Wang, Hao, Zhang, Jianguo

Deep neural networks have demonstrated remarkable performance across numerous learning tasks but often suffer from miscalibration, resulting in unreliable probability outputs. This has inspired many recent works on mitigating miscalibration, particularly through post-hoc recalibration methods that aim to obtain calibrated probabilities without sacrificing the classification performance of pre-trained models. In this study, we summarize and categorize previous works into three general strategies: intuitively designed methods, binning-based methods, and methods based on formulations of ideal calibration. Through theoretical and practical analysis, we highlight ten common limitations in previous approaches. To address these limitations, we propose a probabilistic learning framework for calibration called h-calibration, which theoretically constructs an equivalent learning formulation for canonical calibration with boundedness. On this basis, we design a simple yet effective post-hoc calibration algorithm. Our method not only overcomes the ten identified limitations but also achieves markedly better performance than traditional methods, as validated by extensive experiments. We further analyze, both theoretically and experimentally, the relationship and advantages of our learning objective compared to traditional proper scoring rule. In summary, our probabilistic framework derives an approximately equivalent differentiable objective for learning error-bounded calibrated probabilities, elucidating the correspondence and convergence properties of computational statistics with respect to theoretical bounds in canonical calibration. The theoretical effectiveness is verified on standard post-hoc calibration benchmarks by achieving state-of-the-art performance. This research offers valuable reference for learning reliable likelihood in related fields.

artificial intelligence, calibration, machine learning, (18 more...)

doi: 10.1109/TPAMI.2025.3582796

2506.17968

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.13)
Asia > China > Guangdong Province > Shenzhen (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.92)

Romanus, Ogonnaya Michael, Boulaguiem, Younes, Molinari, Roberto

Fiducial Matching: Differentially Private Inference for Categorical Data

The task of statistical inference, which includes the building of confidence intervals and tests for parameters and effects of interest to a researcher, is still an open area of investigation in a differentially private (DP) setting. Indeed, in addition to the randomness due to data sampling, DP delivers another source of randomness consisting of the noise added to protect an individual's data from being disclosed to a potential attacker. As a result of this convolution of noises, in many cases it is too complicated to determine the stochastic behavior of the statistics and parameters resulting from a DP procedure. In this work, we contribute to this line of investigation by employing a simulation-based matching approach, solved through tools from the fiducial framework, which aims to replicate the data generation pipeline (including the DP step) and retrieve an approximate distribution of the estimates resulting from this pipeline. For this purpose, we focus on the analysis of categorical (nominal) data that is common in national surveys, for which sensitivity is naturally defined, and on additive privacy mechanisms. We prove the validity of the proposed approach in terms of coverage and highlight its good computational and statistical performance for different inferential tasks in simulated and applied data settings.

artificial intelligence, bayesian inference, machine learning, (18 more...)

2507.11762

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Alabama > Lee County > Auburn (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.96)
Health & Medicine > Therapeutic Area > Immunology > HIV (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.68)

Song, Jiafang, Datta, Abhirup

Fast Variational Bayes for Large Spatial Data

Recent variational Bayes methods for geospatial regression, proposed as an alternative to computationally expensive Markov chain Monte Carlo (MCMC) sampling, have leveraged Nearest Neighbor Gaussian processes (NNGP) to achieve scalability. Yet, these variational methods remain inferior in accuracy and speed compared to spNNGP, the state-of-the-art MCMC-based software for NNGP. We introduce spVarBayes, a suite of fast variational Bayesian approaches for large-scale geospatial data analysis using NNGP. Our contributions are primarily computational. We replace auto-differentiation with a combination of calculus of variations, closed-form gradient updates, and linear response corrections for improved variance estimation. We also accommodate covariates (fixed effects) in the model and offer inference on the variance parameters. Simulation experiments demonstrate that we achieve comparable accuracy to spNNGP but with reduced computational costs, and considerably outperform existing variational inference methods in terms of both accuracy and speed. Analysis of a large forest canopy height dataset illustrates the practical implementation of proposed methods and shows that the inference results are consistent with those obtained from the MCMC approach. The proposed methods are implemented in publicly available Github R-package spVarBayes.

artificial intelligence, machine learning, variational distribution, (16 more...)

2507.12251

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
North America > United States > Alaska (0.04)

Genre: Research Report > New Finding (0.92)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.87)

arXiv.org Artificial IntelligenceJul-17-2025

Synthetic Tabular Data Generation: A Comparative Survey for Modern Techniques

Challagundla, Raju, Dorodchi, Mohsen, Wang, Pu, Lee, Minwoo

As privacy regulations become more stringent and access to real-world data becomes increasingly constrained, synthetic data generation has emerged as a vital solution, especially for tabular datasets, which are central to domains like finance, healthcare and the social sciences. This survey presents a comprehensive and focused review of recent advances in synthetic tabular data generation, emphasizing methods that preserve complex feature relationships, maintain statistical fidelity, and satisfy privacy requirements. A key contribution of this work is the introduction of a novel taxonomy based on practical generation objectives, including intended downstream applications, privacy guarantees, and data utility, directly informing methodological design and evaluation strategies. Therefore, this review prioritizes the actionable goals that drive synthetic data creation, including conditional generation and risk-sensitive modeling. Additionally, the survey proposes a benchmark framework to align technical innovation with real-world demands. By bridging theoretical foundations with practical deployment, this work serves as both a roadmap for future research and a guide for implementing synthetic tabular data in privacy-critical environments.

data mining, machine learning, natural language, (20 more...)

2507.1159

Country: North America > United States > North Carolina (0.14)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.46)
Government > Regional Government > North America Government > United States Government (0.45)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
(11 more...)