Does human-alignment benefit interpretability?

Authors: Colin, J. , Oliver, N. , Serre, T

Publication: Workshop on Representational Alignment (Re^4-Align) (ICLR), 2026
PDF: Click here for the PDF paper

Aligning deep neural networks with human perception has been shown to benefit visual representations across a wide range of settings, from depth estimation to generalization capability, motivating the hope that injecting human knowledge into models may also improve their interpretability. However, whether and how alignment with human perception improves the interpretability of visual features remains unclear. In this work, we introduce an experimental protocol to quantify the interpretability of visual features in deep neural networks and use it to probe the relationship between alignment and interpretability by comparing aligned and non-aligned models of the same family. In particular, we compare the interpretability of three transformer-based models: DINOv2, a variant of DINOv2 that has been aligned with human similarity judgments, and DINOv3. Our results illustrate a trend whereby alignment benefits the interpretability of learned representations, with the aligned model being significantly more interpretable than both non-aligned counterparts. Visual inspection further highlights a profound qualitative shift in the learned features: aligned features tend to be more spatially localized, yet qualitatively less visually rich. Together, these findings provide the first concrete evidence that alignment with human perception can enhance interpretability, while also underscoring that the mechanisms by which alignment reshapes visual representations remain only partially understood.