Spoken document retrieval leveraging unsupervised and supervised topic modeling techniques

Kuan Yu Chen, Hsin Min Wang, Berlin Chen

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.

Original languageEnglish
Pages (from-to)1195-1205
Number of pages11
JournalIEICE Transactions on Information and Systems
VolumeE95-D
Issue number5
DOIs
Publication statusPublished - 2012 May

Fingerprint

Information retrieval systems
Speech recognition
Feedback
Degradation
Experiments

Keywords

  • Pseudo-supervised training
  • Spoken document retrieval
  • Subword-level indexing
  • Supervised training
  • Topic model

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering
  • Artificial Intelligence

Cite this

Spoken document retrieval leveraging unsupervised and supervised topic modeling techniques. / Chen, Kuan Yu; Wang, Hsin Min; Chen, Berlin.

In: IEICE Transactions on Information and Systems, Vol. E95-D, No. 5, 05.2012, p. 1195-1205.

Research output: Contribution to journalArticle

@article{9775750e79364f299a16a9799d0643b8,
title = "Spoken document retrieval leveraging unsupervised and supervised topic modeling techniques",
abstract = "This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.",
keywords = "Pseudo-supervised training, Spoken document retrieval, Subword-level indexing, Supervised training, Topic model",
author = "Chen, {Kuan Yu} and Wang, {Hsin Min} and Berlin Chen",
year = "2012",
month = "5",
doi = "10.1587/transinf.E95.D.1195",
language = "English",
volume = "E95-D",
pages = "1195--1205",
journal = "IEICE Transactions on Information and Systems",
issn = "0916-8532",
publisher = "Maruzen Co., Ltd/Maruzen Kabushikikaisha",
number = "5",

}

TY - JOUR

T1 - Spoken document retrieval leveraging unsupervised and supervised topic modeling techniques

AU - Chen, Kuan Yu

AU - Wang, Hsin Min

AU - Chen, Berlin

PY - 2012/5

Y1 - 2012/5

N2 - This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.

AB - This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.

KW - Pseudo-supervised training

KW - Spoken document retrieval

KW - Subword-level indexing

KW - Supervised training

KW - Topic model

UR - http://www.scopus.com/inward/record.url?scp=84860624661&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84860624661&partnerID=8YFLogxK

U2 - 10.1587/transinf.E95.D.1195

DO - 10.1587/transinf.E95.D.1195

M3 - Article

AN - SCOPUS:84860624661

VL - E95-D

SP - 1195

EP - 1205

JO - IEICE Transactions on Information and Systems

JF - IEICE Transactions on Information and Systems

SN - 0916-8532

IS - 5

ER -