TY - GEN
T1 - Adaptive Locality Guidance
T2 - 2025 IEEE International Conference on Consumer Electronics, ICCE 2025
AU - Rostand, Jules
AU - Hsu, Chen Chien
AU - Lu, Cheng Kai
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - As studies show that lack of sufficient data leads Vision Transformers (VTs) to mainly learn global information from the input, the recently proposed Locality Guidance (LG) approach [1] uses a lightweight Convolutional Neural Network (CNN) pretrained on the same dataset to guide the VT into learning local features as well. Under a dual learning framework, the use of the LG significantly boosts the accuracy of different VTs on multiple tiny datasets, at the mere cost of a slight increase in training time. However, we also find that the use of the LG prevents the models from learning global aspects to their full ability. To remedy to this limitation, we propose the Adaptive Locality Guidance (ALG), an improved version which uses the LG as an initialization tool, and after a certain number of epochs lets the VT learn by itself in supervised fashion. Specifically, we estimate the needed duration for the LG based on a threshold set on the evolution of the distance separating the features of the VT to those of the lightweight CNN used for guidance. As our improved method can be used in plug-and-play fashion, we successfully apply it across 4 different VTs on the CIFAR-100 dataset. Experimental results show that the proposed ALG significantly reduces the computational cost added in training by the LG, and further increases the validation accuracy by up to 5.49%, thereby achieving new State-Of-The-Art (SOTA) results among tiny VTs.
AB - As studies show that lack of sufficient data leads Vision Transformers (VTs) to mainly learn global information from the input, the recently proposed Locality Guidance (LG) approach [1] uses a lightweight Convolutional Neural Network (CNN) pretrained on the same dataset to guide the VT into learning local features as well. Under a dual learning framework, the use of the LG significantly boosts the accuracy of different VTs on multiple tiny datasets, at the mere cost of a slight increase in training time. However, we also find that the use of the LG prevents the models from learning global aspects to their full ability. To remedy to this limitation, we propose the Adaptive Locality Guidance (ALG), an improved version which uses the LG as an initialization tool, and after a certain number of epochs lets the VT learn by itself in supervised fashion. Specifically, we estimate the needed duration for the LG based on a threshold set on the evolution of the distance separating the features of the VT to those of the lightweight CNN used for guidance. As our improved method can be used in plug-and-play fashion, we successfully apply it across 4 different VTs on the CIFAR-100 dataset. Experimental results show that the proposed ALG significantly reduces the computational cost added in training by the LG, and further increases the validation accuracy by up to 5.49%, thereby achieving new State-Of-The-Art (SOTA) results among tiny VTs.
KW - Convolutional Neural Network
KW - Locality Guidance
KW - Supervised Learning
KW - Vision Transformer
UR - http://www.scopus.com/inward/record.url?scp=105006602504&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105006602504&partnerID=8YFLogxK
U2 - 10.1109/ICCE63647.2025.10929945
DO - 10.1109/ICCE63647.2025.10929945
M3 - Conference contribution
AN - SCOPUS:105006602504
T3 - Digest of Technical Papers - IEEE International Conference on Consumer Electronics
BT - 2025 IEEE International Conference on Consumer Electronics, ICCE 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 11 January 2025 through 14 January 2025
ER -