Capability ≠ Interpretability: Human Interpretability of Vision Foundation Models

Colin, Julien; Goetschalckx, L.; Oliver, N.; Serre, T

Capability ≠ Interpretability: Human Interpretability of Vision Foundation Models

Authors: Colin, J. , Goetschalckx, L., Oliver, N. , Serre, T

Publication: Under review, 2026
PDF: Click here for the PDF paper

How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability---can an observer predict where a feature fires on a novel image?---and (2) nameability---can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers---two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP)---we collect more than $15,000$ behavioral responses, analyzing the $13,400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently less interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans---models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality---and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.