AITopics | multinli

Country: North America > United States > California (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.94)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-13-2026, 01:14:20 GMT

fb64a552feda3d981dbe43527a80a07e-Supplemental-Conference.pdf

augmentation policy, dataset, weight decay, (13 more...)

Country:

Oceania (0.04)
Europe (0.04)
Asia (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Hill, John C., LaBonte, Tyler, Zhang, Xinchen, Muthukumar, Vidya

On the Unreasonable Effectiveness of Last-layer Retraining

arXiv.org Artificial IntelligenceDec-2-2025

Last-layer retraining (LLR) methods -- wherein the last layer of a neural network is reinitialized and retrained on a held-out set following ERM training -- have garnered interest as an efficient approach to rectify dependence on spurious correlations and improve performance on minority groups. Surprisingly, LLR has been found to improve worst-group accuracy even when the held-out set is an imbalanced subset of the training set. We initially hypothesize that this ``unreasonable effectiveness'' of LLR is explained by its ability to mitigate neural collapse through the held-out set, resulting in the implicit bias of gradient descent benefiting robustness. Our empirical investigation does not support this hypothesis. Instead, we present strong evidence for an alternative hypothesis: that the success of LLR is primarily due to better group balance in the held-out set. We conclude by showing how the recent algorithms CB-LLR and AFR perform implicit group-balancing to elicit a robustness improvement.

artificial intelligence, llr, machine learning, (19 more...)

2512.01766

Country: North America > United States > California (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsOct-11-2025, 00:43:49 GMT

dc4db2ff2c1aefce3b594f821ea82fe6-Paper-Conference.pdf

artificial intelligence, machine learning, natural language, (19 more...)

Country: North America > United States > California (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.94)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Neural Information Processing SystemsOct-9-2025, 16:49:18 GMT

Appendix Outline

There are 4 groups defined to the tuples (y, s).

artificial intelligence, dataset, machine learning, (15 more...)

Country:

Oceania (0.04)
Europe (0.04)
Asia (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Zhang, Lingze, Pavlick, Ellie

Does Training on Synthetic Data Make Models Less Robust?

arXiv.org Artificial IntelligenceFeb-10-2025

An increasingly common practice is to train large language models (LLMs) using synthetic data. Often this synthetic data is produced by the same or similar LLMs as those it is being used to train. This raises the question of whether the synthetic data might in fact exacerbate certain "blindspots" by reinforcing heuristics that the LLM already encodes. In this paper, we conduct simulated experiments on the natural language inference (NLI) task with Llama-2-7B-hf models. We use MultiNLI as the general task and HANS, a targeted evaluation set designed to measure the presence of specific heuristic strategies for NLI, as our "blindspot" task. Our goal is to determine whether performance disparities between the general and blind spot tasks emerge. Our results indicate that synthetic data does not reinforce blindspots in the way we expected. Specifically, we see that, while fine-tuning with synthetic data doesn't necessarily reduce the use of the heuristic, it also does not make it worse as we hypothesized.

large language model, machine learning, natural language, (19 more...)

2502.07164

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Virginia (0.04)
North America > United States > New York > New York County > New York City (0.04)
(7 more...)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Tao, Junyi, Chen, Xiaoyin, Liu, Nelson F.

Inference and Verbalization Functions During In-Context Learning

arXiv.org Artificial IntelligenceOct-11-2024

Large language models (LMs) are capable of in-context learning from a few demonstrations (example-label pairs) to solve new tasks during inference. Despite the intuitive importance of high-quality demonstrations, previous work has observed that, in some settings, ICL performance is minimally affected by irrelevant labels (Min et al., 2022). We hypothesize that LMs perform ICL with irrelevant labels via two sequential processes: an inference function that solves the task, followed by a verbalization function that maps the inferred answer to the label space. Importantly, we hypothesize that the inference function is invariant to remappings of the label space (e.g., "true"/"false" to "cat"/"dog"), enabling LMs to share the same inference function across settings with different label words. We empirically validate this hypothesis with controlled layer-wise interchange intervention experiments. Our findings confirm the hypotheses on multiple datasets and tasks (natural language inference, sentiment analysis, and topic classification) and further suggest that the two functions can be localized in specific layers across various open-sourced models, including GEMMA-7B, MISTRAL-7B-V0.3, GEMMA-2-27B, and LLAMA-3.1-70B.

artificial intelligence, large language model, natural language, (18 more...)

2410.09349

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Quebec > Montreal (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Sports (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsOct-7-2024, 09:55:53 GMT

Reviews: e-SNLI: Natural Language Inference with Natural Language Explanations

I think the idea of explicable models is worth pursuing, and this is a decent contribution to showing how one might do that. It is unfortunate that this work shows a huge tradeoff between models that perform at high levels and those that explain well (from 4.1 it seems like we can get good performance, but then can't generate correct explanations very often and from 4.2 we can generate correct explanations more often at the expense of good performance). It also seems disappointing that the BLEU scores in the PREDICT setting are already so close to the inter-annotator agreement even though they are not correct explanations very often; this seems to suggest that we really do need to rely on the percent correct given by human evaluation and that the BLEU scores are not very meaningful. This seems like a bottleneck for this resource being widely adopted. Nonetheless, these findings are a solid contribution and so is the data if others are willing to do human evaluation or work on a new automatic metric for a task like this.

bleu score, explanation, natural language explanation, (13 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.30)

Cosma, Adrian, Ruseti, Stefan, Dascalu, Mihai, Caragea, Cornelia

How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

arXiv.org Artificial IntelligenceOct-4-2024

Natural Language Inference (NLI) evaluation is crucial for assessing language understanding models; however, popular datasets suffer from systematic spurious correlations that artificially inflate actual model performance. To address this, we propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. This categorization significantly reduces spurious correlation measures, with examples labeled as having the highest difficulty showing markedly decreased performance and encompassing more realistic and diverse linguistic phenomena. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset, surpassing other dataset characterization techniques. Our research addresses limitations in NLI dataset construction, providing a more authentic evaluation of model performance with implications for diverse NLU applications.

artificial intelligence, machine learning, natural language, (15 more...)

2410.03429

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(10 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

LaBonte, Tyler, Hill, John C., Zhang, Xinchen, Muthukumar, Vidya, Kumar, Abhishek

The Group Robustness is in the Details: Revisiting Finetuning under Spurious Correlations

arXiv.org Artificial IntelligenceJul-18-2024

Modern machine learning models are prone to over-reliance on spurious correlations, which can often lead to poor performance on minority groups. In this paper, we identify surprising and nuanced behavior of finetuned models on worst-group accuracy via comprehensive experiments on four well-established benchmarks across vision and language tasks. We first show that the commonly used class-balancing techniques of mini-batch upsampling and loss upweighting can induce a decrease in worst-group accuracy (WGA) with training epochs, leading to performance no better than without class-balancing. While in some scenarios, removing data to create a class-balanced subset is more effective, we show this depends on group structure and propose a mixture method which can outperform both techniques. Next, we show that scaling pretrained models is generally beneficial for worst-group accuracy, but only in conjuction with appropriate class-balancing. Finally, we identify spectral imbalance in finetuning features as a potential source of group disparities -- minority group covariance matrices incur a larger spectral norm than majority groups once conditioned on the classes. Our results show more nuanced interactions of modern finetuned models with group robustness than was previously known. Our code is available at https://github.com/tmlabonte/revisiting-finetuning.

accuracy, cit, dataset, (16 more...)

2407.13957

Country: North America > United States > California (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)