Multimodal fusion using learned text concepts for image categorization

Qiang Zhu, Mei Chen Yeh, Kwang Ting Cheng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

32 Citations (Scopus)

Abstract

Conventional image categorization techniques primarily rely on low-level visual cues. In this paper, we describe a multimodal fusion scheme which improves the image classification accuracy by incorporating the information derived from the embedded texts detected in the image under classification. Specific to each image category, a text concept is first learned from a set of labeled texts in images of the target category using Multiple Instance Learning [1]. For an image under classification which contains multiple detected text lines, we calculate a weighted Euclidian distance between each text line and the learned text concept of the target category. Subsequently, the minimum distance, along with lowlevel visual cues, are jointly used as the features for SVM-based classification. Experiments on a challenging image database demonstrate that the proposed fusion framework achieves a higher accuracy than the state-of-art methods for image classification.

Original languageEnglish
Title of host publicationProceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006
Pages211-220
Number of pages10
DOIs
Publication statusPublished - 2006 Dec 1
Externally publishedYes
Event14th Annual ACM International Conference on Multimedia, MM 2006 - Santa Barbara, CA, United States
Duration: 2006 Oct 232006 Oct 27

Other

Other14th Annual ACM International Conference on Multimedia, MM 2006
CountryUnited States
CitySanta Barbara, CA
Period06/10/2306/10/27

Fingerprint

Fusion reactions
Image classification
Experiments

Keywords

  • Image annotation
  • Image categorization
  • Multimodal fusion
  • Multiple instance learning
  • Text detection

ASJC Scopus subject areas

  • Computer Science(all)
  • Media Technology

Cite this

Zhu, Q., Yeh, M. C., & Cheng, K. T. (2006). Multimodal fusion using learned text concepts for image categorization. In Proceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006 (pp. 211-220) https://doi.org/10.1145/1180639.1180698

Multimodal fusion using learned text concepts for image categorization. / Zhu, Qiang; Yeh, Mei Chen; Cheng, Kwang Ting.

Proceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006. 2006. p. 211-220.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhu, Q, Yeh, MC & Cheng, KT 2006, Multimodal fusion using learned text concepts for image categorization. in Proceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006. pp. 211-220, 14th Annual ACM International Conference on Multimedia, MM 2006, Santa Barbara, CA, United States, 06/10/23. https://doi.org/10.1145/1180639.1180698
Zhu Q, Yeh MC, Cheng KT. Multimodal fusion using learned text concepts for image categorization. In Proceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006. 2006. p. 211-220 https://doi.org/10.1145/1180639.1180698
Zhu, Qiang ; Yeh, Mei Chen ; Cheng, Kwang Ting. / Multimodal fusion using learned text concepts for image categorization. Proceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006. 2006. pp. 211-220
@inproceedings{25133e8f7b11468f948344d4b0e6a11b,
title = "Multimodal fusion using learned text concepts for image categorization",
abstract = "Conventional image categorization techniques primarily rely on low-level visual cues. In this paper, we describe a multimodal fusion scheme which improves the image classification accuracy by incorporating the information derived from the embedded texts detected in the image under classification. Specific to each image category, a text concept is first learned from a set of labeled texts in images of the target category using Multiple Instance Learning [1]. For an image under classification which contains multiple detected text lines, we calculate a weighted Euclidian distance between each text line and the learned text concept of the target category. Subsequently, the minimum distance, along with lowlevel visual cues, are jointly used as the features for SVM-based classification. Experiments on a challenging image database demonstrate that the proposed fusion framework achieves a higher accuracy than the state-of-art methods for image classification.",
keywords = "Image annotation, Image categorization, Multimodal fusion, Multiple instance learning, Text detection",
author = "Qiang Zhu and Yeh, {Mei Chen} and Cheng, {Kwang Ting}",
year = "2006",
month = "12",
day = "1",
doi = "10.1145/1180639.1180698",
language = "English",
isbn = "1595934472",
pages = "211--220",
booktitle = "Proceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006",

}

TY - GEN

T1 - Multimodal fusion using learned text concepts for image categorization

AU - Zhu, Qiang

AU - Yeh, Mei Chen

AU - Cheng, Kwang Ting

PY - 2006/12/1

Y1 - 2006/12/1

N2 - Conventional image categorization techniques primarily rely on low-level visual cues. In this paper, we describe a multimodal fusion scheme which improves the image classification accuracy by incorporating the information derived from the embedded texts detected in the image under classification. Specific to each image category, a text concept is first learned from a set of labeled texts in images of the target category using Multiple Instance Learning [1]. For an image under classification which contains multiple detected text lines, we calculate a weighted Euclidian distance between each text line and the learned text concept of the target category. Subsequently, the minimum distance, along with lowlevel visual cues, are jointly used as the features for SVM-based classification. Experiments on a challenging image database demonstrate that the proposed fusion framework achieves a higher accuracy than the state-of-art methods for image classification.

AB - Conventional image categorization techniques primarily rely on low-level visual cues. In this paper, we describe a multimodal fusion scheme which improves the image classification accuracy by incorporating the information derived from the embedded texts detected in the image under classification. Specific to each image category, a text concept is first learned from a set of labeled texts in images of the target category using Multiple Instance Learning [1]. For an image under classification which contains multiple detected text lines, we calculate a weighted Euclidian distance between each text line and the learned text concept of the target category. Subsequently, the minimum distance, along with lowlevel visual cues, are jointly used as the features for SVM-based classification. Experiments on a challenging image database demonstrate that the proposed fusion framework achieves a higher accuracy than the state-of-art methods for image classification.

KW - Image annotation

KW - Image categorization

KW - Multimodal fusion

KW - Multiple instance learning

KW - Text detection

UR - http://www.scopus.com/inward/record.url?scp=34547210642&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34547210642&partnerID=8YFLogxK

U2 - 10.1145/1180639.1180698

DO - 10.1145/1180639.1180698

M3 - Conference contribution

AN - SCOPUS:34547210642

SN - 1595934472

SN - 9781595934475

SP - 211

EP - 220

BT - Proceedings of the 14th Annual ACM International Conference on Multimedia, MM 2006

ER -