Untitled

Peeling Back the Layers: Interpreting the Storytelling of ViT ¶

MM 2024 ViT逐层解码：揭示图像理解过程

使用了Instruct-Blip作为基础模型，包含一个40层的图像编码器（EVA-CLIP-ViT）和一个大模型作为文本解码器，逐层逐头分析了ViT的内部结构。借鉴该思路分析一下ViT-B/16。