Appendix A Additional results This appendix section shows additional results and corresponding plots to support the insights

Feb-17-2026, 12:24:05 GMT–Neural Information Processing Systems

Section A.2 shows results using a chat-style verbalized numeric Section A.3 shows results on four extra benchmark tasks made available with Finally, Section A.5 presents and discusses results on feature In this section, we evaluate risk score calibration on the income prediction task across different subpopulations, such as typically done as part of a fairness audit. Figures A1-A2 show group-conditional calibration curves for all models on the ACSIncome task, evaluated on three subgroups specified by the race attribute in the ACS data. We show the three race categories with largest representation. The'Mixtral 8x22B' and'Yi 34B' models shown are the worst offenders, where samples belonging to the'Black' population see consistently lower scores for the same positive label probability when compared to the'Asian' or'White' populations. On average, the'Mixtral 8x22B (it)' model classifies a Black individual with a In fact, this score bias can be reversed for some base models, overestimating scores from Black individuals compared with other subgroups.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Feb-17-2026, 12:24:05 GMT

Conferences PDF

Add feedback

Country:
- Oceania > New Zealand (0.04)
- North America > United States
  - California (0.04)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.77)
  - Machine Learning > Neural Networks
    - Deep Learning (0.32)

Duplicate Docs Excel Report

Title
Appendix A Additional results This appendix section shows additional results and corresponding plots to support the insights

Similar Docs Excel Report more

Title	Similarity	Source
None found