Adversarial Circuit Evaluation

de Bos, Niels uit, Garriga-Alonso, Adrià

Jul-21-2024–arXiv.org Artificial Intelligence

Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs. Our results show that the circuits for the IOI and docstring tasks fail to behave similarly to the full model even on completely benign inputs from the original task, indicating that more robust circuits are needed for safety-critical applications.

corrupted input, kl divergence, percentile, (15 more...)

arXiv.org Artificial Intelligence

Jul-21-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > San Diego County > San Diego (0.04)

Genre:
- Research Report > New Finding (0.86)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found