AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.99)

Neural Information Processing SystemsFeb-13-2026, 00:15:30 GMT

Representation Learning of Compositional Data

Marta Avalos, Richard Nock, Cheng Soon Ong, Julien Rouar, Ke Sun

Instead,compositional datamust be first transformed before analysis.

artificial intelligence, divergence, machine learning, (18 more...)

Country:

Europe > Finland > Paijanne Tavastia > Lahti (0.05)
Oceania > Australia (0.04)
North America > United States (0.04)
North America > Canada > Quebec > Montreal (0.04)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsDec-24-2025, 15:03:31 GMT

Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome

Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in microbiology, geochemistry, and other applications. Drawing on key principles from compositional data analysis, such as the \emph{Aitchison geometry of the simplex} and subcompositions, we define novel augmentation strategies for this data modality.

compositional data, data augmentation, predictive model, (6 more...)

Industry: Health & Medicine > Therapeutic Area (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.98)

Neural Information Processing SystemsNov-20-2025, 22:21:35 GMT

Representation Learning of Compositional Data

We consider the problem of learning a low dimensional representation for compositional data. Compositional data consists of a collection of nonnegative data that sum to a constant value. Since the parts of the collection are statistically dependent, many standard tools cannot be directly applied. Instead, compositional data must be first transformed before analysis. Focusing on principal component analysis (PCA), we propose an approach that allows low dimensional representation learning directly from the original data.

compositional data, name change, representation learning, (6 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.99)

Teixeira, Joaquim Valerio, Reznik, Ed, Banerjee, Sudpito, Tansey, Wesley

A Hierarchical Variational Graph Fused Lasso for Recovering Relative Rates in Spatial Compositional Data

arXiv.org Machine LearningSep-26-2025

The analysis of spatial data from biological imaging technology, such as imaging mass spectrometry (IMS) or imaging mass cytometry (IMC), is challenging because of a competitive sampling process which convolves signals from molecules in a single pixel. To address this, we develop a scalable Bayesian framework that leverages natural sparsity in spatial signal patterns to recover relative rates for each molecule across the entire image. Our method relies on the use of a heavy-tailed variant of the graphical lasso prior and a novel hierarchical variational family, enabling efficient inference via automatic differentiation variational inference. Simulation results show that our approach outperforms state-of-the-practice point estimate methodologies in IMS, and has superior posterior coverage than mean-field variational inference techniques. Results on real IMS data demonstrate that our approach better recovers the true anatomical structure of known tissue, removes artifacts, and detects active regions missed by the standard analysis approach.

inference, molecule, relative rate, (13 more...)

2509.20636

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report (0.84)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.88)

Zhou, Yidong, Iao, Su I, Müller, Hans-Georg

Fréchet Geodesic Boosting

arXiv.org Machine LearningSep-23-2025

Gradient boosting has become a cornerstone of machine learning, enabling base learners such as decision trees to achieve exceptional predictive performance. While existing algorithms primarily handle scalar or Euclidean outputs, increasingly prevalent complex-structured data, such as distributions, networks, and manifold-valued outputs, present challenges for traditional methods. Such non-Euclidean data lack algebraic structures such as addition, subtraction, or scalar multiplication required by standard gradient boosting frameworks. To address these challenges, we introduce Fréchet geodesic boosting (FGBoost), a novel approach tailored for outputs residing in geodesic metric spaces. FGBoost leverages geodesics as proxies for residuals and constructs ensembles in a way that respects the intrinsic geometry of the output space. Through theoretical analysis, extensive simulations, and real-world applications, we demonstrate the strong performance and adaptability of FGBoost, showcasing its potential for modeling complex data.

chet regression, fgboost, regression, (16 more...)

2509.18013

Country:

North America > United States > California > Yolo County > Davis (0.14)
North America > United States > New Jersey (0.04)
Oceania > New Zealand (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.48)

Industry:

Health & Medicine (1.00)
Banking & Finance > Economy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Park, Junyoung, Park, Cheolwoo, Ahn, Jeongyoun

Interpretable dimension reduction for compositional data

arXiv.org Machine LearningSep-9-2025

High-dimensional compositional data, such as those from human microbiome studies, pose unique statistical challenges due to the simplex constraint and excess zeros. While dimension reduction is indispensable for analyzing such data, conventional approaches often rely on log-ratio transformations that compromise interpretability and distort the data through ad hoc zero replacements. We introduce a novel framework for interpretable dimension reduction of compositional data that avoids extra transformations and zero imputations. Our approach generalizes the concept of amalgamation by softening its operation, mapping high-dimensional compositions directly to a lower-dimensional simplex, which can be visualized in ternary plots. The framework further provides joint visualization of the reduction matrix, enabling intuitive, at-a-glance interpretation. To achieve optimal reduction within our framework, we incorporate sufficient dimension reduction, which defines a new identifiable objective: the central compositional subspace. For estimation, we propose a compositional kernel dimension reduction (CKDR) method. The estimator is provably consistent, exhibits sparsity that reveals underlying amalgamation structures, and comes with an intrinsic predictive model for downstream analyses. Applications to real microbiome datasets demonstrate that our approach provides a powerful graphical exploration tool for uncovering meaningful biological patterns, opening a new pathway for analyzing high-dimensional compositional data.

dimension reduction, matrix, subspace, (13 more...)

2509.05563

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Michigan (0.04)

Genre: Research Report > New Finding (0.92)

Industry:

Health & Medicine > Therapeutic Area (0.67)
Health & Medicine > Pharmaceuticals & Biotechnology (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning in High Dimensional Spaces (1.00)

Neural Information Processing SystemsJan-16-2025, 01:46:02 GMT

Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome

Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in microbiology, geochemistry, and other applications. Drawing on key principles from compositional data analysis, such as the \emph{Aitchison geometry of the simplex} and subcompositions, we define novel augmentation strategies for this data modality. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease.

compositional data, data augmentation, predictive model, (3 more...)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.64)
Health & Medicine > Therapeutic Area > Gastroenterology (0.64)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Pal, Samyajoy, Heumann, Christian

Variational Approach for Efficient KL Divergence Estimation in Dirichlet Mixture Models

arXiv.org Machine LearningMar-18-2024

This study tackles the efficient estimation of Kullback-Leibler (KL) Divergence in Dirichlet Mixture Models (DMM), crucial for clustering compositional data. Despite the significance of DMMs, obtaining an analytically tractable solution for KL Divergence has proven elusive. Past approaches relied on computationally demanding Monte Carlo methods, motivating our introduction of a novel variational approach. Our method offers a closed-form solution, significantly enhancing computational efficiency for swift model comparisons and robust estimation evaluations. Validation using real and simulated data showcases its superior efficiency and accuracy over traditional Monte Carlo-based methods, opening new avenues for rapid exploration of diverse DMM models and advancing statistical analyses of compositional data.

dirichlet mixture model, divergence, kl divergence, (11 more...)

2403.12158

Country: Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Lundborg, Anton Rask, Pfister, Niklas

Perturbation-based Analysis of Compositional Data

arXiv.org Machine LearningNov-30-2023

Existing statistical methods for compositional data analysis are inadequate for many modern applications for two reasons. First, modern compositional datasets, for example in microbiome research, display traits such as high-dimensionality and sparsity that are poorly modelled with traditional approaches. Second, assessing -- in an unbiased way -- how summary statistics of a composition (e.g., racial diversity) affect a response variable is not straightforward. In this work, we propose a framework based on hypothetical data perturbations that addresses both issues. Unlike existing methods for compositional data, we do not transform the data and instead use perturbations to define interpretable statistical functionals on the compositions themselves, which we call average perturbation effects. These average perturbation effects, which can be employed in many applications, naturally account for confounding that biases frequently used marginal dependence analyses. We show how average perturbation effects can be estimated efficiently by deriving a perturbation-dependent reparametrization and applying semiparametric estimation techniques. We analyze the proposed estimators empirically on simulated data and demonstrate advantages over existing techniques on US census and microbiome data. For all proposed estimators, we provide confidence intervals with uniform asymptotic coverage guarantees.

artificial intelligence, machine learning, perturbation, (19 more...)

2311.18501

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Finland > Paijanne Tavastia > Lahti (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
Education > Educational Setting (0.45)

Technology:

Information Technology > Geographic Information Systems (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)