Adaptive Locality Guidance: Using Locality Guidance to Initialize the Learning of Vision Transformers on Tiny Datasets

Jules Rostand, Chen Chien James Hsu, Cheng Kai Lu*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

While we keep working toward leveraging the benefits of vision transformers (VTs) on small datasets, convolutional neural networks (CNNs) still remain the choice of preference when extensive training data is unavailable. As studies show that lack of sufficient data leads VTs to mainly learn global information from the input, the recently proposed locality guidance (LG) approach uses a lightweight CNN pretrained on the same dataset to guide the VT into learning local features as well. Under a dual learning framework, the use of the LG significantly boosts the accuracy of different VTs on multiple tiny datasets, at the mere cost of a slight increase in training time. However, we also find that the use of the LG prevents the models from learning global aspects to their full ability, sometimes leading to worsened performances compared to the original baselines. In order to overcome this limitation, we propose the adaptive LG (ALG), an improved version which uses the LG as an initialization tool, and after a certain number of epochs lets the VT learn by itself in a supervised fashion. Specifically, we estimate the needed duration for the LG based on a threshold set on the evolution of the distance separating the features of the VT from those of the lightweight CNN used for guidance. Since our improved method can be used in a plug-and-play fashion, we successfully apply it across ten different VTs, and five different datasets. Experimental results show that the proposed ALG significantly reduces the computational cost added in training by the LG (by 37%∼64%), and further increases the validation accuracy by up to 6.71%.

Original languageEnglish
JournalIEEE Transactions on Neural Networks and Learning Systems
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • Convolutional neural network (CNN)
  • locality guidance (LG)
  • supervised learning
  • vision transformer (VT)

ASJC Scopus subject areas

  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Adaptive Locality Guidance: Using Locality Guidance to Initialize the Learning of Vision Transformers on Tiny Datasets'. Together they form a unique fingerprint.

Cite this