Date
May 3, 2024, 3:00 pm4:00 pm

Details

Event Description

Title: Interpretability and Decomposition

Abstract: Understanding the latent representations of models can help us to anticipate and correct model failures, and to elicit capabilities that are not already present in the outputs. However, to unlock this potential, we need scalable methods for interpreting the representations of billion-parameter models. In this talk, I'll describe an approach for interpreting complex models by mathematically decomposing them into simpler components, focusing on the CLIP image encoder. We decompose CLIP's image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.

Jacob Steinhardt, Assistant Professor, UC Berkeley Department of Statistics.