Image recognition using multiple Vision Transformers in parallel having different patch sizes

Authors

  • A. M. Hafiz

DOI:

https://doi.org/10.17762/msea.v71i4.615

Abstract

With the advent of Transformers which are attention-based mechanisms, many research directions have emerged. Their prowess in natural language processing tasks is well known. Extension of Transformers to computer vision is but natural. Recently, Vision Transforms (ViT’s) have achieved very good results on popular image recognition datasets. However, training Transformers is a difficult process due to the need for large computational resources. Parallel processing is a well-known phenomenon present in Nature’s most efficient data processors. Inspired by the same,  I use a novel technique in which multiple ViT’s with different patch sizes are used in parallel. This is followed by averaging the probability vectors of the ViT’s for final classification. Using medium-sized ViT’s I show that without going for huge scales, state-of-the-art results are achieved on popular datasets.

Downloads

Published

2022-08-29

How to Cite

A. M. Hafiz. (2022). Image recognition using multiple Vision Transformers in parallel having different patch sizes. Mathematical Statistician and Engineering Applications, 71(4), 1183–1194. https://doi.org/10.17762/msea.v71i4.615

Issue

Section

Articles