Red Teaming Deep Neural Networks with Feature Synthesis Tools

Dec-27-2025, 07:29:44 GMT–Neural Information Processing Systems

Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified previously unknown bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze model behavior by using a particular dataset. This only allows for the study of the model in the context of features that the user can sample in advance. To address this, a growing body of research involves interpreting models using feature synthesis methods that do not depend on a dataset.

feature synthesis tool, name change, red teaming deep neural network, (7 more...)

Neural Information Processing Systems

Dec-27-2025, 07:29:44 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.35)