Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

Gupta, Ashim, Rajendhran, Rishanth, Stringham, Nathan, Srikumar, Vivek, Marasović, Ana

Nov-16-2023–arXiv.org Artificial Intelligence

Are the longstanding robustness issues in NLP resolved by today's larger and more performant models? To address this question, we conduct a thorough investigation using 19 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) OOD and challenge test sets, (b) CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all OOD tests provide further insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them sufficiently robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.

contradiction contradiction outside service provider, lamoreaux justice center, old ottoman mansion, (13 more...)

arXiv.org Artificial Intelligence

Nov-16-2023

arXiv.org PDF

Add feedback

Country:
- South America > Colombia
  - Bogotá D.C. > Bogotá (0.04)
- Oceania
  - Australia (0.04)
  - Nauru (0.04)
- North America
  - Panama (0.04)
  - Dominican Republic (0.04)
  - United States
    - Alaska (0.04)
    - Utah (0.04)
    - Texas (0.04)
    - Virginia (0.04)
    - Massachusetts (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - New Jersey > Essex County
      - South Orange (0.04)
      - Orange (0.04)
    - Indiana > Marion County
      - Indianapolis (0.04)
    - Nevada > Clark County
      - Las Vegas (0.04)
    - Illinois > Cook County
      - Chicago (0.04)
    - New Mexico > Santa Fe County
      - Santa Fe (0.04)
    - Washington > King County
      - Seattle (0.14)
    - California
      - San Francisco County > San Francisco (0.14)
      - Santa Clara County > Palo Alto (0.04)
      - San Diego County > San Diego (0.04)
      - Los Angeles County > Los Angeles (0.04)
    - New York > New York County
      - New York City (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Europe
  - Poland (0.04)
  - France (0.04)
  - Lithuania (0.04)
  - Austria (0.04)
  - Russia > Central Federal District
    - Moscow Oblast > Moscow (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - United Kingdom > England
    - Bath and North East Somerset (0.04)
  - Middle East
    - Cyprus (0.04)
    - Republic of Türkiye > Istanbul Province
      - Istanbul (0.04)
  - Finland > Uusimaa
    - Helsinki (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Ukraine > Volyn Oblast
    - Lutsk (0.04)
- Asia
  - Russia (0.68)
  - Afghanistan (0.04)
  - Indonesia > Bali (0.04)
  - Thailand (0.04)
  - Laos (0.04)
  - Pakistan (0.04)
  - Vietnam > Long An Province (0.04)
  - China > Beijing
    - Beijing (0.04)
  - Middle East
    - Israel (0.04)
    - Iran (0.04)
    - Jordan (0.04)
    - Yemen (0.04)
    - Saudi Arabia (0.04)
    - Palestine > Gaza Strip (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
    - Republic of Türkiye > Istanbul Province
      - Istanbul (0.04)
  - India > West Bengal
    - Kolkata (0.04)
- Africa
  - Sudan (0.04)
  - Tanzania (0.04)
  - Nigeria (0.04)
  - Kenya (0.04)
  - Rwanda > Kigali
    - Kigali (0.04)

Genre:
- Personal (0.93)
- Research Report > New Finding (0.45)

Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Law (1.00)
- Media (1.00)
- Transportation > Air (0.92)
- Banking & Finance (0.92)
- Government
  - Military (1.00)
  - Foreign Policy (0.67)
  - Regional Government
    - North America Government > United States Government (1.00)
    - Asia Government (0.67)
    - Europe Government (0.67)
- Leisure & Entertainment > Sports
  - Baseball (0.92)

Technology:
- Information Technology
  - Security & Privacy (1.00)
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.94)