University of Leicester
Browse

File(s) under embargo

7

month(s)

4

day(s)

until file(s) become available

Vision Transformer: to discover the “Four Secrets” of image patches

journal contribution
posted on 2024-01-17, 15:47 authored by T Zhou, Y Niu, H Lu, C Peng, Y Guo, Huiyu Zhou

Vision Transformer (ViT) is widely used in the field of computer vision, in ViT, there are four main steps, which are “four secrets”, such as patch division, token selection, position encoding addition, attention calculation, the existing research on transformer in computer vision mainly focuses on the above four steps. Therefore, “how to divide patch?”, “how to select token?”, “how to add position encoding?”, and “how to calculate attention?” are crucial to improve ViT performance. But so far, most of the review literatures are summarized from the perspective of application, and there is no corresponding literature to comprehensively summarize these four steps from the technology perspective, which restricts the further development of ViT in some degree. To address the above questions, the 4 major mechanisms and 5 applications of ViT are summarized in this paper, the main innovative works are as follows: Firstly, the basic principle and model structure of ViT are elaborated; Secondly, aiming to “how to divide patch?”, the 5 key techniques of patch division mechanism are summarized: from single-size division to multi-size division, from fixed number division to adaptive number division, from non-overlapping division to overlapping division, from semantic segmentation division to semantic aggregation division, and from original image division to feature map division; Thirdly, aiming to “how to select token?”, the 3 key techniques of token selection mechanism are summarized: token selection based on score, token selection based on merge, token selection based on convolution and pooling; Fourthly, aiming to “how to add position encoding?”, the 5 key techniques of position encoding mechanism are summarized: absolute position encoding, relative position encoding, conditional position encoding, locally-enhanced position encoding, and zero-padding position encoding; Fifthly, aiming to “how to calculate attention?”, 18 attention mechanisms are summarized based on the timeline; Sixthly, these models that Transformer is combined with U-Net, GAN, YOLO, ResNet, and DenseNet are discussed in the medical image processing field; Finally, around these four questions proposed in this paper, we look forward to the future development direction of frontier technologies such as patch division mechanism, token selection mechanism, position encoding mechanism, and attention mechanism et al, which play an important role in the further development of ViT.

History

Author affiliation

School of Computing and Mathematical Sciences, University of Leicester

Version

  • AM (Accepted Manuscript)

Published in

Information Fusion

Volume

105

Publisher

Elsevier

issn

1566-2535

Copyright date

2024

Available date

2025-07-11

Language

en

Usage metrics

    University of Leicester Publications

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC