Exploring a CLIP-Enhanced Automated Approach for Video Description Generation

Siang Ling Zhang, Huai Hsun Cheng, Yen Hsin Chen, Mei Chen Yeh

研究成果: 書貢獻/報告類型會議論文篇章

摘要

Visual storytelling is a learned ability that humans have developed through the course of their evolution. In contrast to written records and descriptions, people are now sharing glimpses of their lives through short videos. Human conversations largely rely on visual and auditory inputs, followed by corresponding feedbacks. For machines, the conversion of images to texts serves as a bridge between visual and linguistic information, enabling machine-human interactions more naturally. Video captioning-involving automatically generating textual descriptions from videos-is one of the core technologies that enable such applications. In this work, we present CLIP-CAP, an automatic method for transforming visual contents to concise textual descriptions. We investigate the CLIP pretraining model as well as its potential in this task. Through experiments on the ActivityNet Captions dataset, we show that the proposed CLIP-CAP model outperforms existing video captioning methods in terms of several different metrics.

原文英語
主出版物標題2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
發行者Institute of Electrical and Electronics Engineers Inc.
頁面1506-1511
頁數6
ISBN(電子)9798350300673
DOIs
出版狀態已發佈 - 2023
事件2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023 - Taipei, 臺灣
持續時間: 2023 10月 312023 11月 3

出版系列

名字2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023

會議

會議2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2023
國家/地區臺灣
城市Taipei
期間2023/10/312023/11/03

ASJC Scopus subject areas

  • 硬體和架構
  • 訊號處理
  • 人工智慧
  • 電腦科學應用

指紋

深入研究「Exploring a CLIP-Enhanced Automated Approach for Video Description Generation」主題。共同形成了獨特的指紋。

引用此