Vision Transformer (ViT) Implementation: Continuing from Positional Encoding
Vision Transformers (ViT), since their introduction by Dosovitskiy et. al. [reference] in 2020, have dominated the field of Computer Vision, obtaining state-of-the-art performance in image classification first, and later on in other tasks as well. However, unlike other architectures, they are a bit harder to grasp, particularly if you are not already familiar with the Transformer model…