AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceOct-13-2025

Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

Tao, Yongding, Wang, Tian, Dong, Yihong, Liu, Huanyu, Zhang, Kechi, Hu, Xiaolong, Li, Ge

Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

2510.09259

Country: Asia (0.67)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (0.46)
Education > Educational Setting > Online (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsJan-24-2025, 03:57:57 GMT

Reviews: Learning to Learn By Self-Critique

Summary: This paper considers few-shot classification and seeks to make use of the unlabeled query data during few-shot classification by training on it with a meta-learned critic loss. The algorithm builds on top of MAML, and has two stages. In the first stage, the model is adapted via gradient descent on the labeled support set. In the second stage, the model is further adapted via a meta-learned critic loss that is a function of a featurization of the model parameters and the unlabeled query set. Originality: The proposed approach strikes me as quite similar to One-Shot Imitation Learning by Domain-Adaptive Meta-Learning (Yu et al. 2018).

critic loss, learning, self-critique, (9 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.40)

Neural Information Processing SystemsJan-24-2025, 03:57:46 GMT

Reviews: Learning to Learn By Self-Critique

The reviewers agreed that this submission makes an interesting and novel contribution to NeurIPS. I strongly encourage the authors to address the remaining comments of reviewers 3 and 6, to clarify the test batch size, improve figure 2, and add citations.

learning, self-critique

Technology: Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.40)

Neural Information Processing SystemsOct-10-2024, 03:48:04 GMT

Learning to Learn By Self-Critique

In few-shot learning, a machine learning system is required to learn from a small set of labelled examples of a specific task, such that it can achieve strong generalization on new unlabelled examples of the same task. Given the limited availability of labelled examples in such tasks, we need to make use of all the information we can. For this reason we propose the use of transductive meta-learning for few shot settings to obtain state-of-the-art few-shot learning. Usually a model learns task-specific information from a small training-set (the \emph{support-set}) and subsequently produces predictions on a small unlabelled validation set (\emph{target-set}). The target-set contains additional task-specific information which is not utilized by existing few-shot learning methods.

information, loss function, self-critique, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.40)

arXiv.org Artificial IntelligenceJul-14-2024

Merging Improves Self-Critique Against Jailbreak Attacks

Gallego, Victor

The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .

jailbreak attack, self-critique

2406.07188

Genre: Research Report > New Finding (0.53)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

arXiv.org Artificial IntelligenceDec-4-2023

Distilled Self-Critique of LLMs with Synthetic Data: a Bayesian Perspective

Gallego, Victor

Review: No Country for Old Men is an extraordinary movie that seamlessly blends elements of crime, drama, and psychological suspense into a cohesive and awe-inspiring work of art. From the opening scene to the final heart-stopping moments, director Joel Cohen has crafted a visually stunning vision that both challenges and captivates the viewer. The cinematography is unparalleled in its ability to convey emotion and character without resorting to cheap tricks or manipulation. The cast members all deliver impressive performances that allow us to empathize with their characters while simultaneously questioning their motives. From Javier Bardem's chilling portrayal of the villain to Tommy Lee Jones' nuanced exploration of a man faced with an impossible moral dilemma. Despite its lengthy runtime, No Country for Old Men maintains an intense narrative that keeps audiences engaged until the very end.

information, love note, tiny paper, (15 more...)

2312.01957

Country:

North America > United States > Texas (0.14)
North America > United States > California > San Francisco County > San Francisco (0.04)
North America > United States > California > Marin County (0.04)

Genre: Research Report (0.64)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)

Hebenstreit, Konstantin, Praas, Robert, Kiesewetter, Louis P, Samwald, Matthias

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

arXiv.org Artificial IntelligenceAug-3-2023

Emergent chain-of-thought (CoT) reasoning capabilities promise to improve performance and explainability of large language models (LLMs). However, uncertainties remain about how reasoning strategies formulated for previous model generations generalize to new model generations and different datasets. In this small-scale study, we compare different reasoning strategies induced by zero-shot prompting across six recently released LLMs (davinci-002, davinci-003, GPT-3.5-turbo, GPT-4, Flan-T5-xxl and Cohere command-xlarge) on a mixture of six question-answering datasets, including datasets from scientific and medical domains. Our findings demonstrate that while some variations in effectiveness occur, gains from CoT reasoning strategies remain robust across different models and datasets. GPT-4 has the most benefit from current state-of-the-art reasoning strategies and exhibits the best performance by applying a prompt previously discovered through automated discovery.

dataset, krippendorff, language model, (16 more...)

2305.02897

Country:

Europe > Austria > Vienna (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
(2 more...)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Antoniou, Antreas, Storkey, Amos J.

Learning to Learn By Self-Critique

Neural Information Processing SystemsMar-19-2020, 00:33:03 GMT

In few-shot learning, a machine learning system is required to learn from a small set of labelled examples of a specific task, such that it can achieve strong generalization on new unlabelled examples of the same task. Given the limited availability of labelled examples in such tasks, we need to make use of all the information we can. For this reason we propose the use of transductive meta-learning for few shot settings to obtain state-of-the-art few-shot learning. Usually a model learns task-specific information from a small training-set (the \emph{support-set}) and subsequently produces predictions on a small unlabelled validation set (\emph{target-set}). The target-set contains additional task-specific information which is not utilized by existing few-shot learning methods.

information, loss function, self-critique, (5 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.40)