Skip to content

FYT3RP4TIL/Vision-Transformer-CIFAR10

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision-Transformer-CIFAR10

Implementation of the ViT model

Model Architecture

image

Project Description

Aim

  • Explore Transformer-based architectures for Computer Vision Tasks.
  • Transformers have been the de-facto for NLP tasks, and CNN/Resnet-like architectures have been the state of the art for Computer Vision.
  • Till date, researchers have tried using attention for Vision, but used them in conjunction with CNN.
  • This project mainly discusses the strength and versatility of vision transformers, as it kind of approves that they can be used in recognition and can even beat the state-of-the-art CNN.

Methodology

image

Transformer Encoder

image

Why do we need attention mechanism?

image

Multi-Head Attention

image

Datasets

Due to non-availability of powerful compute on Google Colab, we chose to train and test on these 2 datasets –

Inference from Results

  • Patch size in the Vision Transformer decides the length of the sequence.
  • Increasing the number of layers of the Vision Transformer should ideally lead to better results.
  • Hybrid Vision Transformer performs better on small datasets compared to ViT as the initial ResNet features are able to capture the lower level features due to the locality property of Convolutions which normal ViT is not able to capture with the limited data available for training.
  • PreTrained ViT performs much better than the other methods due to being trained on huge datasets and thus having learned the better representations than even ResNet since it can access much further information right from the very beginning unlike CNN.

Future Scope

  • Due to non-availability of better computing resources, the model could not be trained on large datasets which is the first and the foremost requirement of this architecture to produce very high accuracies. Due to this limitation, we could not produce accuracies as mentioned in the project in implementation from scratch.
  • Different Attention mechanisms could be explored that take the 2D structure of images into account.

Releases

No releases published

Packages