Vision-Transformer-CIFAR10

Implementation of the ViT model

Model Architecture

Project Description

Aim

Explore Transformer-based architectures for Computer Vision Tasks.
Transformers have been the de-facto for NLP tasks, and CNN/Resnet-like architectures have been the state of the art for Computer Vision.
Till date, researchers have tried using attention for Vision, but used them in conjunction with CNN.
This project mainly discusses the strength and versatility of vision transformers, as it kind of approves that they can be used in recognition and can even beat the state-of-the-art CNN.

Methodology

Transformer Encoder

Why do we need attention mechanism?

Multi-Head Attention

Datasets

Due to non-availability of powerful compute on Google Colab, we chose to train and test on these 2 datasets –

Inference from Results

Patch size in the Vision Transformer decides the length of the sequence.
Increasing the number of layers of the Vision Transformer should ideally lead to better results.
Hybrid Vision Transformer performs better on small datasets compared to ViT as the initial ResNet features are able to capture the lower level features due to the locality property of Convolutions which normal ViT is not able to capture with the limited data available for training.
PreTrained ViT performs much better than the other methods due to being trained on huge datasets and thus having learned the better representations than even ResNet since it can access much further information right from the very beginning unlike CNN.

Future Scope

Due to non-availability of better computing resources, the model could not be trained on large datasets which is the first and the foremost requirement of this architecture to produce very high accuracies. Due to this limitation, we could not produce accuracies as mentioned in the project in implementation from scratch.
Different Attention mechanisms could be explored that take the 2D structure of images into account.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
tmp		tmp
LICENSE		LICENSE
README.md		README.md
ViT_Classifier_CIFAR10.ipynb		ViT_Classifier_CIFAR10.ipynb
ViT_Code2.ipynb		ViT_Code2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Transformer-CIFAR10

Model Architecture

Project Description

Aim

Methodology

Transformer Encoder

Why do we need attention mechanism?

Multi-Head Attention

Datasets

Inference from Results

Future Scope

About

Releases

Packages

Languages

License

FYT3RP4TIL/Vision-Transformer-CIFAR10

Folders and files

Latest commit

History

Repository files navigation

Vision-Transformer-CIFAR10

Model Architecture

Project Description

Aim

Methodology

Transformer Encoder

Why do we need attention mechanism?

Multi-Head Attention

Datasets

Inference from Results

Future Scope

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages