Improved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features

Chun Jen Wang, Berlin Chen, Lin Shan Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Different models retrieve the documents based on different approaches of extracting the underlying content. Different levels of indexing features also offer different functionalities and discriminabilities when retrieving the documents. In this paper, we present results for Chinese spoken document retrieval with hybrid models to integrate the knowledge obtainable from three basic retrieval models, namely, the standard vector space model (VSM), the hidden Markov model (HMM), and the latent semantic indexing (LSI) model. The characteristics of retrieval performance using both word-level and syllable-level indexing features were extensively explored. In addition, a data-driven approach to derive variable-length indexing features is also presented. Very satisfactory performance can be achieved with these data-driven features while retaining very compact feature set size. Experiments showed that this approach has the potential to identify domain-specific terminologies or newlygenerated phrases. It is therefore very useful not only in Chinese document retrieval, but also in detecting out of vocabulary (OOV) words in Chinese. Very encouraging results were obtained when the hybrid models were used with the datadriven indexing features as well.

Original languageEnglish
Title of host publication7th International Conference on Spoken Language Processing, ICSLP 2002
PublisherInternational Speech Communication Association
Pages1985-1988
Number of pages4
Publication statusPublished - 2002 Jan 1
Event7th International Conference on Spoken Language Processing, ICSLP 2002 - Denver, United States
Duration: 2002 Sep 162002 Sep 20

Other

Other7th International Conference on Spoken Language Processing, ICSLP 2002
CountryUnited States
CityDenver
Period02/9/1602/9/20

Fingerprint

indexing
Modeling
Data-driven
Indexing
technical language
functionality
performance
vocabulary
semantics
experiment
Hybrid Model

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Wang, C. J., Chen, B., & Lee, L. S. (2002). Improved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features. In 7th International Conference on Spoken Language Processing, ICSLP 2002 (pp. 1985-1988). International Speech Communication Association.

Improved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features. / Wang, Chun Jen; Chen, Berlin; Lee, Lin Shan.

7th International Conference on Spoken Language Processing, ICSLP 2002. International Speech Communication Association, 2002. p. 1985-1988.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wang, CJ, Chen, B & Lee, LS 2002, Improved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features. in 7th International Conference on Spoken Language Processing, ICSLP 2002. International Speech Communication Association, pp. 1985-1988, 7th International Conference on Spoken Language Processing, ICSLP 2002, Denver, United States, 02/9/16.
Wang CJ, Chen B, Lee LS. Improved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features. In 7th International Conference on Spoken Language Processing, ICSLP 2002. International Speech Communication Association. 2002. p. 1985-1988
Wang, Chun Jen ; Chen, Berlin ; Lee, Lin Shan. / Improved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features. 7th International Conference on Spoken Language Processing, ICSLP 2002. International Speech Communication Association, 2002. pp. 1985-1988
@inproceedings{eb4c417ded50405ca3fafd82c307e609,
title = "Improved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features",
abstract = "Different models retrieve the documents based on different approaches of extracting the underlying content. Different levels of indexing features also offer different functionalities and discriminabilities when retrieving the documents. In this paper, we present results for Chinese spoken document retrieval with hybrid models to integrate the knowledge obtainable from three basic retrieval models, namely, the standard vector space model (VSM), the hidden Markov model (HMM), and the latent semantic indexing (LSI) model. The characteristics of retrieval performance using both word-level and syllable-level indexing features were extensively explored. In addition, a data-driven approach to derive variable-length indexing features is also presented. Very satisfactory performance can be achieved with these data-driven features while retaining very compact feature set size. Experiments showed that this approach has the potential to identify domain-specific terminologies or newlygenerated phrases. It is therefore very useful not only in Chinese document retrieval, but also in detecting out of vocabulary (OOV) words in Chinese. Very encouraging results were obtained when the hybrid models were used with the datadriven indexing features as well.",
author = "Wang, {Chun Jen} and Berlin Chen and Lee, {Lin Shan}",
year = "2002",
month = "1",
day = "1",
language = "English",
pages = "1985--1988",
booktitle = "7th International Conference on Spoken Language Processing, ICSLP 2002",
publisher = "International Speech Communication Association",

}

TY - GEN

T1 - Improved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features

AU - Wang, Chun Jen

AU - Chen, Berlin

AU - Lee, Lin Shan

PY - 2002/1/1

Y1 - 2002/1/1

N2 - Different models retrieve the documents based on different approaches of extracting the underlying content. Different levels of indexing features also offer different functionalities and discriminabilities when retrieving the documents. In this paper, we present results for Chinese spoken document retrieval with hybrid models to integrate the knowledge obtainable from three basic retrieval models, namely, the standard vector space model (VSM), the hidden Markov model (HMM), and the latent semantic indexing (LSI) model. The characteristics of retrieval performance using both word-level and syllable-level indexing features were extensively explored. In addition, a data-driven approach to derive variable-length indexing features is also presented. Very satisfactory performance can be achieved with these data-driven features while retaining very compact feature set size. Experiments showed that this approach has the potential to identify domain-specific terminologies or newlygenerated phrases. It is therefore very useful not only in Chinese document retrieval, but also in detecting out of vocabulary (OOV) words in Chinese. Very encouraging results were obtained when the hybrid models were used with the datadriven indexing features as well.

AB - Different models retrieve the documents based on different approaches of extracting the underlying content. Different levels of indexing features also offer different functionalities and discriminabilities when retrieving the documents. In this paper, we present results for Chinese spoken document retrieval with hybrid models to integrate the knowledge obtainable from three basic retrieval models, namely, the standard vector space model (VSM), the hidden Markov model (HMM), and the latent semantic indexing (LSI) model. The characteristics of retrieval performance using both word-level and syllable-level indexing features were extensively explored. In addition, a data-driven approach to derive variable-length indexing features is also presented. Very satisfactory performance can be achieved with these data-driven features while retaining very compact feature set size. Experiments showed that this approach has the potential to identify domain-specific terminologies or newlygenerated phrases. It is therefore very useful not only in Chinese document retrieval, but also in detecting out of vocabulary (OOV) words in Chinese. Very encouraging results were obtained when the hybrid models were used with the datadriven indexing features as well.

UR - http://www.scopus.com/inward/record.url?scp=85009285470&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85009285470&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85009285470

SP - 1985

EP - 1988

BT - 7th International Conference on Spoken Language Processing, ICSLP 2002

PB - International Speech Communication Association

ER -