Spoken document retrieval leveraging unsupervised and supervised topic modeling techniques

Kuan Yu Chen, Hsin Min Wang, Berlin Chen

研究成果: 雜誌貢獻文章

5 引文 (Scopus)

摘要

This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.

原文英語
頁(從 - 到)1195-1205
頁數11
期刊IEICE Transactions on Information and Systems
E95-D
發行號5
DOIs
出版狀態已發佈 - 2012 五月

指紋

Information retrieval systems
Speech recognition
Feedback
Degradation
Experiments

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering
  • Artificial Intelligence

引用此文

Spoken document retrieval leveraging unsupervised and supervised topic modeling techniques. / Chen, Kuan Yu; Wang, Hsin Min; Chen, Berlin.

於: IEICE Transactions on Information and Systems, 卷 E95-D, 編號 5, 05.2012, p. 1195-1205.

研究成果: 雜誌貢獻文章

@article{9775750e79364f299a16a9799d0643b8,
title = "Spoken document retrieval leveraging unsupervised and supervised topic modeling techniques",
abstract = "This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.",
keywords = "Pseudo-supervised training, Spoken document retrieval, Subword-level indexing, Supervised training, Topic model",
author = "Chen, {Kuan Yu} and Wang, {Hsin Min} and Berlin Chen",
year = "2012",
month = "5",
doi = "10.1587/transinf.E95.D.1195",
language = "English",
volume = "E95-D",
pages = "1195--1205",
journal = "IEICE Transactions on Information and Systems",
issn = "0916-8532",
publisher = "Maruzen Co., Ltd/Maruzen Kabushikikaisha",
number = "5",

}

TY - JOUR

T1 - Spoken document retrieval leveraging unsupervised and supervised topic modeling techniques

AU - Chen, Kuan Yu

AU - Wang, Hsin Min

AU - Chen, Berlin

PY - 2012/5

Y1 - 2012/5

N2 - This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.

AB - This paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches.

KW - Pseudo-supervised training

KW - Spoken document retrieval

KW - Subword-level indexing

KW - Supervised training

KW - Topic model

UR - http://www.scopus.com/inward/record.url?scp=84860624661&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84860624661&partnerID=8YFLogxK

U2 - 10.1587/transinf.E95.D.1195

DO - 10.1587/transinf.E95.D.1195

M3 - Article

AN - SCOPUS:84860624661

VL - E95-D

SP - 1195

EP - 1205

JO - IEICE Transactions on Information and Systems

JF - IEICE Transactions on Information and Systems

SN - 0916-8532

IS - 5

ER -