TY - GEN
T1 - Exploring a CLIP-Enhanced Automated Approach for Video Description Generation
AU - Zhang, Siang Ling
AU - Cheng, Huai Hsun
AU - Chen, Yen Hsin
AU - Yeh, Mei Chen
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Visual storytelling is a learned ability that humans have developed through the course of their evolution. In contrast to written records and descriptions, people are now sharing glimpses of their lives through short videos. Human conversations largely rely on visual and auditory inputs, followed by corresponding feedbacks. For machines, the conversion of images to texts serves as a bridge between visual and linguistic information, enabling machine-human interactions more naturally. Video captioning-involving automatically generating textual descriptions from videos-is one of the core technologies that enable such applications. In this work, we present CLIP-CAP, an automatic method for transforming visual contents to concise textual descriptions. We investigate the CLIP pretraining model as well as its potential in this task. Through experiments on the ActivityNet Captions dataset, we show that the proposed CLIP-CAP model outperforms existing video captioning methods in terms of several different metrics.
AB - Visual storytelling is a learned ability that humans have developed through the course of their evolution. In contrast to written records and descriptions, people are now sharing glimpses of their lives through short videos. Human conversations largely rely on visual and auditory inputs, followed by corresponding feedbacks. For machines, the conversion of images to texts serves as a bridge between visual and linguistic information, enabling machine-human interactions more naturally. Video captioning-involving automatically generating textual descriptions from videos-is one of the core technologies that enable such applications. In this work, we present CLIP-CAP, an automatic method for transforming visual contents to concise textual descriptions. We investigate the CLIP pretraining model as well as its potential in this task. Through experiments on the ActivityNet Captions dataset, we show that the proposed CLIP-CAP model outperforms existing video captioning methods in terms of several different metrics.
UR - http://www.scopus.com/inward/record.url?scp=85180010373&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85180010373&partnerID=8YFLogxK
U2 - 10.1109/APSIPAASC58517.2023.10317231
DO - 10.1109/APSIPAASC58517.2023.10317231
M3 - Conference contribution
AN - SCOPUS:85180010373
T3 - 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
SP - 1506
EP - 1511
BT - 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
Y2 - 31 October 2023 through 3 November 2023
ER -