Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Open in new window