Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness
Gupta, Ashim, Rajendhran, Rishanth, Stringham, Nathan, Srikumar, Vivek, Marasović, Ana
–arXiv.org Artificial Intelligence
Are the longstanding robustness issues in NLP resolved by today's larger and more performant models? To address this question, we conduct a thorough investigation using 19 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) OOD and challenge test sets, (b) CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all OOD tests provide further insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them sufficiently robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.
arXiv.org Artificial Intelligence
Nov-16-2023
- Country:
- Africa
- Asia
- Pakistan (0.04)
- Indonesia > Bali (0.04)
- India > West Bengal
- Kolkata (0.04)
- Laos (0.04)
- Middle East
- Iran (0.04)
- Israel (0.04)
- Jordan (0.04)
- Palestine > Gaza Strip (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- Saudi Arabia (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- Yemen (0.04)
- Russia (0.68)
- China > Beijing
- Beijing (0.04)
- Thailand (0.04)
- Afghanistan (0.04)
- Vietnam > Long An Province (0.04)
- Europe
- Ukraine > Volyn Oblast
- Lutsk (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Finland > Uusimaa
- Helsinki (0.04)
- Lithuania (0.04)
- Middle East
- Cyprus (0.04)
- Republic of Türkiye > Istanbul Province
- Istanbul (0.04)
- France (0.04)
- United Kingdom > England
- Bath and North East Somerset (0.04)
- Denmark > Capital Region
- Copenhagen (0.04)
- Italy > Tuscany
- Florence (0.04)
- Poland (0.04)
- Austria (0.04)
- Russia > Central Federal District
- Moscow Oblast > Moscow (0.04)
- Ukraine > Volyn Oblast
- North America
- Canada
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Dominican Republic (0.04)
- Panama (0.04)
- United States
- New York > New York County
- New York City (0.04)
- California
- Los Angeles County > Los Angeles (0.04)
- San Diego County > San Diego (0.04)
- San Francisco County > San Francisco (0.14)
- Santa Clara County > Palo Alto (0.04)
- Massachusetts (0.04)
- Washington > King County
- Seattle (0.14)
- Virginia (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Texas (0.04)
- Utah (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- Indiana > Marion County
- Indianapolis (0.04)
- Alaska (0.04)
- New Jersey > Essex County
- Orange (0.04)
- South Orange (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- New York > New York County
- Canada
- Oceania
- South America > Colombia
- Bogotá D.C. > Bogotá (0.04)
- Genre:
- Personal (0.93)
- Research Report > New Finding (0.45)
- Industry:
- Leisure & Entertainment > Sports
- Baseball (0.92)
- Media (1.00)
- Banking & Finance (1.00)
- Government
- Foreign Policy (0.67)
- Military (1.00)
- Regional Government
- Asia Government (0.67)
- Europe Government (0.67)
- North America Government > United States Government (1.00)
- Transportation > Air (0.92)
- Law (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Information Technology > Security & Privacy (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Leisure & Entertainment > Sports
- Technology: