Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations

Authors: Colin, J. , Goetschalckx, L., Fel, T., Boutin, V., Serre, T , Oliver, N.

External link: https://arxiv.org/abs/2411.03993
Publication: arXiv:2411.03993, 2024
DOI: https://doi.org/10.48550/arXiv.2411.03993
PDF: Click here for the PDF paper

Interpretability research often adopts a neuron-centric lens, treating individual neurons as the fundamental units of explanation. However, neuron-level explanations can be undermined by superposition, where single units respond to mixtures of unrelated patterns. Dictionary learning methods, such as sparse autoencoders and non-negative matrix factorization, offer a promising alternative by learning a new basis over layer activations. Despite this promise, direct human evaluations comparing neuron-based and dictionary-based representations remain limited.

We conducted three large-scale online psychophysics experiments (N=481) comparing explanations derived from neuron-based and dictionary-based representations in two convolutional neural networks (ResNet50, VGG16). We operationalize interpretability via visual coherence: a basis is more interpretable if humans can reliably recognize a common visual pattern in its maximally activating images and generalize that pattern to new images. Across experiments, dictionary-based representations were consistently more interpretable than neuron-based representations, with the advantage increasing in deeper layers.

Critically, because models differ in how neuron-aligned their representations are---with ResNet50 exhibiting greater superposition, neuron-based evaluations can mask cross-model differences, such that ResNet50's higher interpretability emerges only under dictionary-based comparisons.

These results provide psychophysical evidence that dictionary-based representations offer a stronger foundation for interpretability and caution against model comparisons based solely on neuron-level analyses.