Exploring a CLIP-Enhanced Automated Approach for Video Description Generation

Siang Ling Zhang, Huai Hsun Cheng, Yen Hsin Chen, Mei Chen Yeh

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Visual storytelling is a learned ability that humans have developed through the course of their evolution. In contrast to written records and descriptions, people are now sharing glimpses of their lives through short videos. Human conversations largely rely on visual and auditory inputs, followed by corresponding feedbacks. For machines, the conversion of images to texts serves as a bridge between visual and linguistic information, enabling machine-human interactions more naturally. Video captioning-involving automatically generating textual descriptions from videos-is one of the core technologies that enable such applications. In this work, we present CLIP-CAP, an automatic method for transforming visual contents to concise textual descriptions. We investigate the CLIP pretraining model as well as its potential in this task. Through experiments on the ActivityNet Captions dataset, we show that the proposed CLIP-CAP model outperforms existing video captioning methods in terms of several different metrics.

Original languageEnglish
Title of host publication2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1506-1511
Number of pages6
ISBN (Electronic)9798350300673
DOIs
Publication statusPublished - 2023
Event2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023 - Taipei, Taiwan
Duration: 2023 Oct 312023 Nov 3

Publication series

Name2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023

Conference

Conference2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
Country/TerritoryTaiwan
CityTaipei
Period2023/10/312023/11/03

ASJC Scopus subject areas

  • Hardware and Architecture
  • Signal Processing
  • Artificial Intelligence
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Exploring a CLIP-Enhanced Automated Approach for Video Description Generation'. Together they form a unique fingerprint.

Cite this