TY - JOUR
T1 - Adaptive Locality Guidance
T2 - Using Locality Guidance to Initialize the Learning of Vision Transformers on Tiny Datasets
AU - Rostand, Jules
AU - Hsu, Chen Chien James
AU - Lu, Cheng Kai
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - While we keep working toward leveraging the benefits of vision transformers (VTs) on small datasets, convolutional neural networks (CNNs) still remain the choice of preference when extensive training data is unavailable. As studies show that lack of sufficient data leads VTs to mainly learn global information from the input, the recently proposed locality guidance (LG) approach uses a lightweight CNN pretrained on the same dataset to guide the VT into learning local features as well. Under a dual learning framework, the use of the LG significantly boosts the accuracy of different VTs on multiple tiny datasets, at the mere cost of a slight increase in training time. However, we also find that the use of the LG prevents the models from learning global aspects to their full ability, sometimes leading to worsened performances compared to the original baselines. In order to overcome this limitation, we propose the adaptive LG (ALG), an improved version which uses the LG as an initialization tool, and after a certain number of epochs lets the VT learn by itself in a supervised fashion. Specifically, we estimate the needed duration for the LG based on a threshold set on the evolution of the distance separating the features of the VT from those of the lightweight CNN used for guidance. Since our improved method can be used in a plug-and-play fashion, we successfully apply it across ten different VTs, and five different datasets. Experimental results show that the proposed ALG significantly reduces the computational cost added in training by the LG (by 37%∼64%), and further increases the validation accuracy by up to 6.71%.
AB - While we keep working toward leveraging the benefits of vision transformers (VTs) on small datasets, convolutional neural networks (CNNs) still remain the choice of preference when extensive training data is unavailable. As studies show that lack of sufficient data leads VTs to mainly learn global information from the input, the recently proposed locality guidance (LG) approach uses a lightweight CNN pretrained on the same dataset to guide the VT into learning local features as well. Under a dual learning framework, the use of the LG significantly boosts the accuracy of different VTs on multiple tiny datasets, at the mere cost of a slight increase in training time. However, we also find that the use of the LG prevents the models from learning global aspects to their full ability, sometimes leading to worsened performances compared to the original baselines. In order to overcome this limitation, we propose the adaptive LG (ALG), an improved version which uses the LG as an initialization tool, and after a certain number of epochs lets the VT learn by itself in a supervised fashion. Specifically, we estimate the needed duration for the LG based on a threshold set on the evolution of the distance separating the features of the VT from those of the lightweight CNN used for guidance. Since our improved method can be used in a plug-and-play fashion, we successfully apply it across ten different VTs, and five different datasets. Experimental results show that the proposed ALG significantly reduces the computational cost added in training by the LG (by 37%∼64%), and further increases the validation accuracy by up to 6.71%.
KW - Convolutional neural network (CNN)
KW - locality guidance (LG)
KW - supervised learning
KW - vision transformer (VT)
UR - http://www.scopus.com/inward/record.url?scp=85216415631&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85216415631&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2024.3515076
DO - 10.1109/TNNLS.2024.3515076
M3 - Article
AN - SCOPUS:85216415631
SN - 2162-237X
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
ER -