Transformers have revolutionized deep learning research across many disciplines, starting from NLP and expanding to vision, speech, and more. In my talk, I will explore several milestones toward interpreting all families of Transformers, including unimodal, bi-modal, and encoder-decoder Transformers. I will present working examples and results that cover some of the most prominent models, including CLIP, ViT, and LXMERT.
I will then present our recent explainability-driven fine-tuning technique that significantly improves the robustness of Vision Transformers (ViTs). The loss we employ ensures that the model bases its prediction on the relevant parts of the input rather than supportive cues (e.g., background).