AITopics

2605.29152

Country: North America > United States (0.45)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

He, Tianxiao, Williams, Alex H., Harvey, Sarah E.

How Data Augmentation Shapes Neural Representations

arXiv.org Machine LearningMay-18-2026

Data augmentation is widely recognized for improving generalization in deep networks, yet its impact on the geometry of learned representations remains poorly understood. In this work, we characterize how different data augmentation strategies reshape internal representations in neural networks. Using tools from shape analysis, we embed network hidden representations into a metric space where distance is invariant to scaling, translation, rotation and reflection. We show that increasing augmentation strength leads to well-behaved trajectories in this space, and that different augmentation types steer representations in distinct directions. Moreover, we investigate how neural representation shapes are distorted along data augmentation trajectories, and show that insights from neural geometry can predict which representations provide the most improvement when ensembling models. Our results reveal shared geometric patterns across architectures and seeds, and suggest that analyzing shape-space trajectories offers a principled tool for understanding and comparing data augmentation methods.

artificial intelligence, machine learning, representation, (18 more...)

2605.15306

Country: North America > United States > New York (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)

arXiv.org Machine LearningMay-12-2026

Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities

Xu, Yichen

Synthetic tabular data are often evaluated by distributional similarity, privacy distance, or train-on-synthetic-test-on-real predictive performance, but these criteria do not ensure validity for causal inference. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can preserve predictive utility while distorting average treatment effect (ATE) estimates. The failure is structural: ATE preservation requires both a realistic covariate law and an accurate treatment-effect contrast, whereas prediction loss penalizes treatment-effect error only through an overlap-weighted term. We formalize this mismatch through sensitivity and loss-decomposition results, and identify an analogous decomposition in block-level next-token prediction under log loss. Motivated by the tabular causal analysis, we propose a hybrid synthetic-data framework that generates covariates while modeling treatment and outcome mechanisms separately, allowing causal-purpose treatment assignment such as randomized synthetic assignment. We evaluate this framework in three settings: ATE preservation under fully generative versus hybrid synthesis, targeted augmentation for practical positivity problems, and synthetic simulation engines for comparing OR, IPW, AIPW, and TMLE before real-data analysis. Across synthetic and ACTG experiments, hybrid synthesis improves causal fidelity relative to fully generative baselines; LLM-based hybrid synthesis is often more faithful than CTGAN for ATE preservation and finite-sample estimator benchmarking.

large language model, machine learning, natural language, (19 more...)

2604.23904

Country:

Asia (0.46)
North America > United States (0.28)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.68)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Morisset, Lucas, Durmus, Alain, Hardy, Adrien

Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation

arXiv.org Machine LearningMay-12-2026

Data augmentation (DA) is now a standard ingredient in modern machine learning pipelines, with extensive empirical evidence reporting improvements in generalization across modalities and tasks Mumuni and Mumuni (2022); Wang et al. (2025). It is often used to encode task-relevant symmetries directly into the training procedure, for instance by encouraging invariance to image rotations or other transformations of the input Shorten and Khoshgoftaar (2019); Chen et al. (2020). It has also been identified as one of the most effective regularization techniques across both supervised learning settings Bishop (1995); Cubuk et al. (2019); Mumuni and Mumuni (2022); Wang et al. (2025) and self-supervised/unsupervised learning Feng et al. (2021); Van Assel et al. (2025). Domain-specific augmentation pipelines have been central to progress in computer vision Shorten and Khoshgoftaar (2019); Kumar et al. (2024), natural language processing Feng et al. (2021); Shorten et al. (2021); Bayer et al. (2022), and time-series or audio applications Wen et al. (2020); Iwana and Uchida (2021); Iglesias et al. (2023). Despite these empirical successes, the benefits of DA remain highly task-and data-dependent, and augmentation schemes are often engineered in an ad hoc manner Fawzi et al. (2016); Cubuk et al. (2019); Lim et al. (2019); Hataya et al. (2020). In contrast with this rich empirical literature, comprehensive theoretical analyses of DA remain relatively scarce. Two classical starting points are, first, the interpretation of additive Gaussian noise as a form of explicit (ridge-like) regularization Bishop (1995); Lin et al. (2024), and second, the idea that leveraging distributional invariances and group structure in the learning objective helps decrease the variance of the model without increasing its bias Chen et al. (2020). Yet, when applied to modern and complex augmentation schemes, these works either provide only upper bounds on the generalization error Lin et al. (2024), or require very strong assumptions on the data distribution (e.g.

machine learning, natural language, random feature regression, (18 more...)

2605.1029

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.45)

Neural Information Processing SystemsMay-1-2026, 04:56:44 GMT

df88b275bef31ac96c85f0c4013734fc-Paper-Conference.pdf

data mining, large language model, machine learning, (19 more...)

Country:

Europe (0.46)
North America (0.46)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine > Health Care Technology > Medical Record (0.95)
Health & Medicine > Diagnostic Medicine (0.68)
Consumer Products & Services (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Data Science > Data Mining (0.93)
(2 more...)

Neural Information Processing SystemsMay-1-2026, 01:40:05 GMT

09d22e4155aa4fdadf3dac8c6bd940fe-Supplemental-Conference.pdf

artificial intelligence, deep learning, machine learning, (19 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsApr-30-2026, 10:24:21 GMT

Provable Benefit of Cutout and CutMix for Feature Learning

Patch-level data augmentation techniques such as Cutout and CutMix have demonstrated significant efficacy in enhancing the performance of vision tasks. However, a comprehensive theoretical understanding of these methods remains elusive. In this paper, we study two-layer neural networks trained using three distinct methods: vanilla training without augmentation, Cutout training, and CutMix training. Our analysis focuses on a feature-noise data model, which consists of several label-dependent features of varying rarity and label-independent noises of differing strengths. Our theorems demonstrate that Cutout training can learn low-frequency features that vanilla training cannot, while CutMix training can learn even rarer features that Cutout cannot capture. From this, we establish that CutMix yields the highest test accuracy among the three. Our novel analysis reveals that CutMix training makes the network learn all features and noise vectors evenly regardless of the rarity and strength, which provides an interesting insight into understanding patch-level augmentation.

artificial intelligence, machine learning, proceedings, (7 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.58)

Neural Information Processing SystemsApr-30-2026, 09:39:40 GMT

fba4a59c7a569fce120eea9aa9227052-Paper-Conference.pdf

data mining, machine learning, node, (15 more...)

Country:

Europe (1.00)
North America > United States > California (0.92)
Asia (0.68)

Genre: Research Report (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.93)

Neural Information Processing SystemsApr-30-2026, 09:07:36 GMT

Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation

Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity. We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks, including cases of domain generalization and contextual bias. Code is available at https://github.com/lisadunlap/ALIA.

large language model, machine learning, natural language, (21 more...)

Country: Asia > Middle East (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Neural Information Processing SystemsApr-30-2026, 06:41:05 GMT

+39+26+56+67+20+15+22Coarse-grainedobjectFine-grainedobjectTexturePathologyUltrasounddatasetexpansionAuto-createddatawithnewinformationSmalldatasetExpandeddatasetcatdog

The power of DNNs relies heavily on the quantity and quality of training data. However, collecting and annotating data on a large scale is often expensive and timeconsuming. To address this issue, we explore a new task, termed dataset expansion, aimed at expanding a ready-to-use small dataset by automatically creating new labeled samples. To this end, we present a Guided Imagination Framework (GIF) that leverages cutting-edge generative models like DALL-E2 and Stable Diffusion (SD) to "imagine" and create informative new data from the input seed data. Specifically, GIF conducts data imagination by optimizing the latent features of the seed data in the semantically meaningful space of the prior model, resulting in the creation of photo-realistic images with new content. To guide the imagination towards creating informative samples for model training, we introduce two key criteria, i.e., class-maintained information boosting and sample diversity promotion. These criteria are verified to be essential for effective dataset expansion: GIF-SD obtains 13.5% higher model accuracy on natural image datasets than unguided expansion with SD. With these essential criteria, GIF successfully expands small datasets in various scenarios, boosting model accuracy by 36.9% on average over six natural image datasets and by 13.5% on average over three medical datasets.

artificial intelligence, machine learning, natural language, (19 more...)

Country: Asia (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Diagnostic Medicine (0.47)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)