TY - GEN
T1 - Weighted matrix factorization for spoken document retrieval
AU - Chen, Kuan Yu
AU - Wang, Hsin Min
AU - Chen, Berlin
AU - Chen, Hsin Hsi
PY - 2013/10/18
Y1 - 2013/10/18
N2 - Since more and more multimedia data associated with spoken documents have been made available to the public, spoken document retrieval (SDR) has become an important research subject in the past two decades. Recently, topic models have been successfully used in SDR as well as general information retrieval (IR). These models fall into two categories: probabilistic topic models (PTM) and non-probabilistic topic models (NPTM). One major difference between PTM and NPTM is that the former only takes the words occurring in a document into account, whereas the latter, such as latent semantic analysis (LSA), explicitly models all the words in the vocabulary (including both occurring and non-occurring words). We believe that the non-occurring words can provide additional information that is also useful for SDR. However, to our best knowledge, there is a dearth of work investigating the effectiveness of the non-occurring words for SDR and IR. In order to make effective use of those non-occurring words of documents for semantic analysis, we propose a weighted matrix factorization (WMF) framework, in which the impact of the non-occurring words on the semantic analysis can be modulated properly. The results of SDR experiments conducted on the TDT-2 (Topic Detection and Tracking) collection highlight the performance merits of our proposed framework when compared to several existing topic models.
AB - Since more and more multimedia data associated with spoken documents have been made available to the public, spoken document retrieval (SDR) has become an important research subject in the past two decades. Recently, topic models have been successfully used in SDR as well as general information retrieval (IR). These models fall into two categories: probabilistic topic models (PTM) and non-probabilistic topic models (NPTM). One major difference between PTM and NPTM is that the former only takes the words occurring in a document into account, whereas the latter, such as latent semantic analysis (LSA), explicitly models all the words in the vocabulary (including both occurring and non-occurring words). We believe that the non-occurring words can provide additional information that is also useful for SDR. However, to our best knowledge, there is a dearth of work investigating the effectiveness of the non-occurring words for SDR and IR. In order to make effective use of those non-occurring words of documents for semantic analysis, we propose a weighted matrix factorization (WMF) framework, in which the impact of the non-occurring words on the semantic analysis can be modulated properly. The results of SDR experiments conducted on the TDT-2 (Topic Detection and Tracking) collection highlight the performance merits of our proposed framework when compared to several existing topic models.
KW - Spoken document retrieval
KW - non-occurring words
KW - non-probabilistic
KW - topic model
UR - http://www.scopus.com/inward/record.url?scp=84890530840&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84890530840&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2013.6639330
DO - 10.1109/ICASSP.2013.6639330
M3 - Conference contribution
AN - SCOPUS:84890530840
SN - 9781479903566
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 8530
EP - 8534
BT - 2013 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013 - Proceedings
T2 - 2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013
Y2 - 26 May 2013 through 31 May 2013
ER -