Vision Foundation Models Rank Lower in Human Interpretability Than Supervised ViTs
Why It Matters
The study reveals a 'transparency gap' where increasing AI performance does not naturally lead to better understanding, posing risks for high-stakes deployments. It suggests that safety and interpretability must be engineered intentionally rather than expected as a byproduct of scale.
Key Points
- A new framework using localizability and nameability protocols establishes a standardized scale for measuring AI feature interpretability.
- Foundation models including DINOv3 and CLIP are significantly less interpretable than older supervised Vision Transformers.
- The study found no correlation between a model's performance on benchmarks and how understandable its internal features are to humans.
- Interpretability is driven by spatial locality and coarse categorical alignment rather than fine-grained perceptual details.
- Sparse autoencoders were used to extract and test features across six different vision transformer architectures.
Researchers have introduced a new framework to measure the human interpretability of vision models, revealing that modern foundation models consistently underperform compared to their supervised predecessors. The study, which utilized sparse autoencoders and two psychophysics protocols—localizability and nameability—analyzed over 13,000 behavioral responses from human participants. Findings indicate that performance on downstream tasks does not correlate with interpretability, debunking the idea of a natural capability-interpretability trade-off. Instead, models like DINOv2, DINOv3, CLIP, and SigLIP produced features that were harder for humans to predict or describe than those in standard Vision Transformers (ViTs). The paper identifies 'locality' of activations and coarse-grained semantic alignment as the primary drivers of interpretability, rather than fine-grained perceptual accuracy. These results suggest that as models become more capable, they are simultaneously becoming more opaque, necessitating new approaches to ensure AI systems remain understandable to human operators.
Scientists found that our newest, most powerful AI 'vision' models—the ones that identify objects in photos—are actually harder for humans to understand than the older, simpler versions. By asking hundreds of people to predict where an AI's 'neurons' would fire or to describe what the AI was seeing, they discovered that modern models like CLIP are surprisingly confusing. Just because an AI is smarter doesn't mean it's clearer; it's like a genius who can give the right answer but can't explain their work. The researchers say we need to focus on making AI activations more 'local' and aligned with human categories to fix this.
Sides
Critics
No critics identified
Defenders
Have historically prioritized scale and downstream capability over internal feature interpretability.
Neutral
Argue that interpretability is a measurable dimension of representation quality that is currently declining in foundation models.
Noise Level
Forecast
Researchers and labs are likely to shift focus toward 'locality-constrained' training methods to improve transparency without sacrificing performance. We should expect future vision foundation models to include interpretability scores as a standard metric alongside accuracy benchmarks.
Based on current signals. Events may develop differently.
Timeline
Research Paper Published
The study 'Capability ≠ Interpretability' is released on arXiv, documenting the interpretability gap in vision models.
Join the Discussion
Discuss this story
Community comments coming in a future update
Be the first to share your perspective. Subscribe to comment.