Goto

Collaborating Authors

 Tasmania



World's smallest possum may be hiding in South Australia

Popular Science

Environment Animals Wildlife World's smallest possum may be hiding in South Australia The tiny mammal weighs less than one pound. Breakthroughs, discoveries, and DIY tips sent six days a week. Weighing less than one pound, the little pygmy possum () is one of the smallest mammals in Australia. These miniscule mammals feed on nectar, pollen, and insects, and differ from opossums . Opossums live in the United States and parts of Canada and have a bare tail instead of a furry tail.




Test-Time Collective Prediction

Neural Information Processing Systems

An increasingly common setting in machine learning involves multiple parties, each with their own data, who want to jointly make predictions on future test points. Agents wish to benefit from the collective expertise of the full set of agents to make better predictions than they would individually, but may not be willing to release labeled data or model parameters.


World's oldest-known rock art found in Indonesian cave

Popular Science

Science Archaeology World's oldest-known rock art found in Indonesian cave The claw-like drawing of a human hand is roughly 67,800-years-old. Breakthroughs, discoveries, and DIY tips sent six days a week. A drawing of a claw-like hand on the wall of a cave in Sulawesi, Indonesia is now the oldest known rock art in the world. The roughly 67,800-year-old art exceeds the previous record holder in the same region of Southeast Asia by 15,000 years or more. The drawing is detailed in a study published today in the journal, and helps fill in the archaeological timeline of how and when Australia was first settled.


Variational Approximations for Robust Bayesian Inference via Rho-Posteriors

Khribch, EL Mahdi, Alquier, Pierre

arXiv.org Machine Learning

The $ρ$-posterior framework provides universal Bayesian estimation with explicit contamination rates and optimal convergence guarantees, but has remained computationally difficult due to an optimization over reference distributions that precludes intractable posterior computation. We develop a PAC-Bayesian framework that recovers these theoretical guarantees through temperature-dependent Gibbs posteriors, deriving finite-sample oracle inequalities with explicit rates and introducing tractable variational approximations that inherit the robustness properties of exact $ρ$-posteriors. Numerical experiments demonstrate that this approach achieves theoretical contamination rates while remaining computationally feasible, providing the first practical implementation of $ρ$-posterior inference with rigorous finite-sample guarantees.


Covariance-Driven Regression Trees: Reducing Overfitting in CART

Zhang, Likun, Ma, Wei

arXiv.org Machine Learning

Decision trees are powerful machine learning algorithms, widely used in fields such as economics and medicine for their simplicity and interpretability. However, decision trees such as CART are prone to overfitting, especially when grown deep or the sample size is small. Conventional methods to reduce overfitting include pre-pruning and post-pruning, which constrain the growth of uninformative branches. In this paper, we propose a complementary approach by introducing a covariance-driven splitting criterion for regression trees (CovRT). This method is more robust to overfitting than the empirical risk minimization criterion used in CART, as it produces more balanced and stable splits and more effectively identifies covariates with true signals. We establish an oracle inequality of CovRT and prove that its predictive accuracy is comparable to that of CART in high-dimensional settings. We find that CovRT achieves superior prediction accuracy compared to CART in both simulations and real-world tasks.


Harmonizing Community Science Datasets to Model Highly Pathogenic Avian Influenza (HPAI) in Birds in the Subantarctic

Littauer, Richard, Bubendorfer, Kris

arXiv.org Artificial Intelligence

Community science observational datasets are useful in epidemiology and ecology for modeling species distributions, but the heterogeneous nature of the data presents significant challenges for standardization, data quality assurance and control, and workflow management. In this paper, we present a data workflow for cleaning and harmonizing multiple community science datasets, which we implement in a case study using eBird, iNaturalist, GBIF, and other datasets to model the impact of highly pathogenic avian influenza in populations of birds in the subantarctic. We predict population sizes for several species where the demographics are not known, and we present novel estimates for potential mortality rates from HPAI for those species, based on a novel aggregated dataset of mortality rates in the subantarctic.


Robust Detection of Synthetic Tabular Data under Schema Variability

Kindji, G. Charbel N., Fromont, Elisa, Rojas-Barahona, Lina Maria, Urvoy, Tanguy

arXiv.org Artificial Intelligence

The rise of powerful generative models has sparked concerns over data authenticity. While detection methods have been extensively developed for images and text, the case of tabular data, despite its ubiquity, has been largely overlooked. Yet, detecting synthetic tabular data is especially challenging due to its heterogeneous structure and unseen formats at test time. We address the underexplored task of detecting synthetic tabular data ''in the wild'', i.e. when the detector is deployed on tables with variable and previously unseen schemas. We introduce a novel datum-wise transformer architecture that significantly outperforms the only previously published baseline, improving both AUC and accuracy by 7 points. By incorporating a table-adaptation component, our model gains an additional 7 accuracy points, demonstrating enhanced robustness. This work provides the first strong evidence that detecting synthetic tabular data in real-world conditions is feasible, and demonstrates substantial improvements over previous approaches. Following acceptance of the paper, we are finalizing the administrative and licensing procedures necessary for releasing the source code. This extended version will be updated as soon as the release is complete.