prompting
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Ohio (0.04)
- North America > United States > Virginia (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Appendix A Additional results This appendix section shows additional results and corresponding plots to support the insights
Section A.2 shows results using a chat-style verbalized numeric Section A.3 shows results on four extra benchmark tasks made available with Finally, Section A.5 presents and discusses results on feature In this section, we evaluate risk score calibration on the income prediction task across different subpopulations, such as typically done as part of a fairness audit. Figures A1-A2 show group-conditional calibration curves for all models on the ACSIncome task, evaluated on three subgroups specified by the race attribute in the ACS data. We show the three race categories with largest representation. The'Mixtral 8x22B' and'Yi 34B' models shown are the worst offenders, where samples belonging to the'Black' population see consistently lower scores for the same positive label probability when compared to the'Asian' or'White' populations. On average, the'Mixtral 8x22B (it)' model classifies a Black individual with a In fact, this score bias can be reversed for some base models, overestimating scores from Black individuals compared with other subgroups.
- Oceania > New Zealand (0.04)
- North America > United States > California (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- North America > United States > California (0.04)
- (6 more...)
- Research Report > New Finding (0.92)
- Questionnaire & Opinion Survey (0.68)
- Government (0.92)
- Education (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.98)
- (2 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Hawaii (0.04)
- North America > United States > Arizona (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
- North America > United States > Texas (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Fine-tuninglanguagemodelstofindagreementamong humanswithdiversepreferences Appendix
We refer to Table S2 for example questions from each a subset of clusters. Each participant first read the task instructions (see Figure S2), and completed a short comprehension test. The comprehension check was designed to test the participants' knowledge and understanding of key aspectsoftheexperiment. Once all players had joined, the group started the main experiment. In practice, data was collected in batches of around 20 groups (100 participants) in parallel.
- Asia > China > Jiangsu Province > Nanjing (0.04)
- North America > United States > California (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Research Report > Promising Solution (0.46)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales?
This paper investigates an under-explored challenge in large language models (LLMs): chain-of-thought prompting with noisy rationales, which include irrelevant or inaccurate reasoning thoughts within examples used for in-context learning. We construct NoRa dataset that is tailored to evaluate the robustness of reasoning in the presence of noisy rationales. Our findings on NoRa dataset reveal a prevalent vulnerability to such noise among current LLMs, with existing robust methods like self-correction and self-consistency showing limited efficacy. Notably, compared to prompting with clean rationales, base LLM drops by 1.4%-19.8% in accuracy with irrelevant thoughts and more drastically by 2.2%-40.4% with inaccurate thoughts.Addressing this challenge necessitates external supervision that should be accessible in practice. Here, we propose the method of contrastive denoising with noisy chain-of-thought (CD-CoT). It enhances LLMs' denoising-reasoning capabilities by contrasting noisy rationales with only one clean rationale, which can be the minimal requirement for denoising-purpose prompting. This method follows a principle of exploration and exploitation: (1) rephrasing and selecting rationales in the input space to achieve explicit denoising and (2) exploring diverse reasoning paths and voting on answers in the output space. Empirically, CD-CoT demonstrates an average improvement of 17.8% in accuracy over the base model and shows significantly stronger denoising capabilities than baseline methods.