TY - JOUR
T1 - Mandarin-English Information (MEI)
T2 - Investigating translingual speech retrieval
AU - Meng, Helen M.
AU - Chen, Berlin
AU - Khudanpur, Sanjeev
AU - Levow, Gina Anne
AU - Lo, Wai Kit
AU - Oard, Douglas
AU - Schone, Patrick
AU - Tang, Karen
AU - Wang, Hsin Min
AU - Wang, Jianqiang
N1 - Funding Information:
The MEI project was conducted during the Johns Hopkins University Summer Workshop 2000 (an NSF Workshop). 16 16 We thank Erika Grams from Advanced Analytic Tools for her active participation. This work is supported by the NSF Grant No. IIS-00712125, Gina’s work was supported by the DARPA cooperative agreement N660010028910, and Berlin’s participation was supported by Academia Sinica (Taiwan), as well as the research grant (88-S-0128) from Professor Lin-Shan Lee of National Taiwan University. Continuation work beyond year 2000 was partially supported by a grant from the Research Grants Council of the Hong Kong SAR, China (Project No. 4223/01E). We thank the Linguistic Data Consortium for providing the TDT Corpora. We also thank Charles Wayne, George Doddington, James Allan, John Garofolo, Hsin-Hsi Chen, Richard Schwartz and Ralph Weischedel for their help. We are grateful to Fred Jelinek and his staff at CLSP for organizing the workshop.
PY - 2004/4
Y1 - 2004/4
N2 - This paper describes the Mandarin-English Information (MEI) project, where we investigated the problem of cross-language spoken document retrieval (CL-SDR), and developed one of the first English-Chinese CL-SDR systems. Our system accepts an entire English news story (text) as query, and retrieves relevant Chinese broadcast news stories (audio) from the document collection. Hence, this is a cross-language and cross-media retrieval task. We applied a multi-scale approach to our problem, which unifies the use of phrases, words and subwords in retrieval. The English queries are translated into Chinese by means of a dictionary-based approach, where we have integrated phrase-based translation with word-by-word translation. Untranslatable named entities are transliterated by a novel subword translation technique. The multi-scale approach can be divided into three subtasks - multi-scale query formulation, multi-scale audio indexing (by speech recognition) and multi-scale retrieval. Experimental-results demonstrate that the use of phrase-based translation and subword translation gave performance gains, and multi-scale retrieval outperforms word-based retrieval.
AB - This paper describes the Mandarin-English Information (MEI) project, where we investigated the problem of cross-language spoken document retrieval (CL-SDR), and developed one of the first English-Chinese CL-SDR systems. Our system accepts an entire English news story (text) as query, and retrieves relevant Chinese broadcast news stories (audio) from the document collection. Hence, this is a cross-language and cross-media retrieval task. We applied a multi-scale approach to our problem, which unifies the use of phrases, words and subwords in retrieval. The English queries are translated into Chinese by means of a dictionary-based approach, where we have integrated phrase-based translation with word-by-word translation. Untranslatable named entities are transliterated by a novel subword translation technique. The multi-scale approach can be divided into three subtasks - multi-scale query formulation, multi-scale audio indexing (by speech recognition) and multi-scale retrieval. Experimental-results demonstrate that the use of phrase-based translation and subword translation gave performance gains, and multi-scale retrieval outperforms word-based retrieval.
KW - English-Chinese cross-language spoken document retrieval
KW - Multi-scale spoken document retrieval
UR - http://www.scopus.com/inward/record.url?scp=12144286470&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=12144286470&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2003.09.003
DO - 10.1016/j.csl.2003.09.003
M3 - Article
AN - SCOPUS:12144286470
SN - 0885-2308
VL - 18
SP - 163
EP - 179
JO - Computer Speech and Language
JF - Computer Speech and Language
IS - 2
ER -