AITopics | counterfactual data augmentation

Appendix A Proofs of Formal Claims

Neural Information Processing SystemsFeb-17-2026, 13:38:25 GMT

By pre-training the model on domain-specific data, PubMED BERT is expected to have a better understanding of biomedical concepts, terminology, and language patterns compared to general domain models like BERT -base and BERT -large [ 95 ]. The main advantage of using PubMED BERT for biomedical text mining tasks is its domain-specific knowledge, which can lead to improved performance and more accurate results when fine-tuned on various downstream tasks, such as named entity recognition, relation extraction, document classification, and question answering. Since PubMED BERT is pre-trained on a large corpus of biomedical text, it is better suited to capturing the unique language patterns, complex terminology, and the relationships between entities in the biomedical domain.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Health Care Providers & Services (0.94)
Health & Medicine > Therapeutic Area (0.71)
Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

df88b275bef31ac96c85f0c4013734fc-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 13:38:22 GMT

data mining, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Pennsylvania (0.04)
North America > United States > New Jersey (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Health & Medicine > Health Care Technology > Medical Record (0.95)
Health & Medicine > Diagnostic Medicine (0.68)
Consumer Products & Services (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)
(2 more...)

Add feedback

f0f800c92d191d736c4411f3b3f8ef4a-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 20:46:42 GMT

dataset, mutual information, rationale selector, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > North Carolina (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

294e09f267683c7ddc6cc5134a7e68a8-Paper.pdf

Neural Information Processing SystemsFeb-7-2026, 21:45:22 GMT

arxiv preprint arxiv, coda, learning, (11 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.72)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
(2 more...)

Add feedback

Data Augmentations for Improved (Large) Language Model Generalization

Neural Information Processing SystemsDec-26-2025, 23:36:45 GMT

The reliance of text classifiers on spurious correlations can lead to poor generalization at deployment, raising concerns about their use in safety-critical domains such as healthcare. In this work, we propose to use counterfactual data augmentation, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features and to learn more robust text classifiers. We show that this strategy is appropriate in prediction problems where the label is spuriously correlated with an attribute. Under the assumptions of such problems, we discuss the favorable sample complexity of counterfactual data augmentation, compared to importance re-weighting. Pragmatically, we match examples using auxiliary data, based on diff-in-diff methodology, and use a large language model (LLM) to represent a conditional probability of text. Through extensive experimentation on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, we demonstrate that our method for simulating interventions improves out-of-distribution (OOD) accuracy compared to baseline invariant learning algorithms.

data augmentation, language model generalization, name change, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.64)

Add feedback

Counterfactual Data Augmentation using Locally Factored Dynamics

Neural Information Processing SystemsDec-23-2025, 21:33:29 GMT

Many dynamic processes, including common scenarios in robotic control and reinforcement learning (RL), involve a set of interacting subprocesses. Though the subprocesses are not independent, their interactions are often sparse, and the dynamics at any given time step can often be decomposed into locally independent} causal mechanisms. Such local causal structures can be leveraged to improve the sample efficiency of sequence prediction and off-policy reinforcement learning. We formalize this by introducing local causal models (LCMs), which are induced from a global causal model by conditioning on a subset of the state space. We propose an approach to inferring these structures given an object-oriented state representation, as well as a novel algorithm for Counterfactual Data Augmentation (CoDA). CoDA uses local structures and an experience replay to generate counterfactual experiences that are causally valid in the global model. We find that CoDA significantly improves the performance of RL agents in locally factored tasks, including the batch-constrained and goal-conditioned settings.

counterfactual data augmentation, factored dynamic, name change, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.84)

Add feedback

CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples

Jin, Kyohoon, Choi, Juhwan, Yun, Jungmin, Lee, Junho, Jang, Soojin, Kim, Youngbin

arXiv.org Artificial IntelligenceNov-21-2025

Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.

artificial intelligence, machine learning, proceedings, (19 more...)

arXiv.org Artificial Intelligence

2508.21083

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bot Meets Shortcut: How Can LLMs Aid in Handling Unknown Invariance OOD Scenarios?

Zheng, Shiyan, Wan, Herun, Luo, Minnan, Huang, Junhang

arXiv.org Artificial IntelligenceNov-17-2025

While existing social bot detectors perform well on benchmarks, their robustness across diverse real-world scenarios remains limited due to unclear ground truth and varied misleading cues. In particular, the impact of shortcut learning, where models rely on spurious correlations instead of capturing causal task-relevant features, has received limited attention. To address this gap, we conduct an in-depth study to assess how detectors are influenced by potential shortcuts based on textual features, which are most susceptible to manipulation by social bots. We design a series of shortcut scenarios by constructing spurious associations between user labels and superficial textual cues to evaluate model robustness. Results show that shifts in irrelevant feature distributions significantly degrade social bot detector performance, with an average relative accuracy drop of 32\% in the baseline models. To tackle this challenge, we propose mitigation strategies based on large language models, leveraging counterfactual data augmentation. These methods mitigate the problem from data and model perspectives across three levels, including data distribution at both the individual user text and overall dataset levels, as well as the model's ability to extract causal information. Our strategies achieve an average relative performance improvement of 56\% under shortcut scenarios.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.08455

Country:

North America > United States (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Asia > China > Shaanxi Province > Xi'an (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Appendix A Proofs of Formal Claims

Neural Information Processing SystemsOct-9-2025, 09:38:21 GMT

By pre-training the model on domain-specific data, PubMED BERT is expected to have a better understanding of biomedical concepts, terminology, and language patterns compared to general domain models like BERT -base and BERT -large [ 95 ]. The main advantage of using PubMED BERT for biomedical text mining tasks is its domain-specific knowledge, which can lead to improved performance and more accurate results when fine-tuned on various downstream tasks, such as named entity recognition, relation extraction, document classification, and question answering. Since PubMED BERT is pre-trained on a large corpus of biomedical text, it is better suited to capturing the unique language patterns, complex terminology, and the relationships between entities in the biomedical domain.

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country: