TY - JOUR
T1 - Enhanced Language Modeling with Proximity and Sentence Relatedness Information for Extractive Broadcast News Summarization
AU - Liu, Shih Hung
AU - Chen, Kuan Yu
AU - Chen, Berlin
N1 - Funding Information:
This research is supported in part by the National Science Council, Taiwan, under Grant Number MOST 107-2634-F-008-004 through Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan, and Grant Numbers MOST 105-2221-E-003-018-MY3, MOST 107-2221-E-003-013-MY2, and MOST 108-2221-E-003-005-MY3. Author’s addresses: S.-H. Liu and B. Chen (corresponding author), No. 88, Sec. 4, Tingzhou Rd., Wenshan Dist., Taipei City 1677, Taiwan (R.O.C.); emails: journey0621@gmail.com, berlin@csie.ntnu.edu.tw; K.-Y. Chen, No. 43, Sec. 4, Keelung Rd., Da-An Dist., Taipei City 106, Taiwan (R.O.C); email: kychen@mail.ntust.edu.tw. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2375-4699/2020/02-ART46 $15.00 https://doi.org/10.1145/3377407
Publisher Copyright:
© 2020 ACM.
PY - 2020/2/7
Y1 - 2020/2/7
N2 - The primary task of extractive summarization is to automatically select a set of representative sentences from a text or spoken document that can concisely express the most important theme of the original document. Recently, language modeling (LM) has been proven to be a promising modeling framework for performing this task in an unsupervised manner. However, there still remain three fundamental challenges facing the existing LM-based methods, which we set out to tackle in this article. The first one is how to construct a more accurate sentence model in this framework without resorting to external sources of information. The second is how to take into account sentence-level structural relationships, in addition to word-level information within a document, for important sentence selection. The last one is how to exploit the proximity cues inherent in sentences to obtain a more accurate estimation of respective sentence models. Specifically, for the first and second challenges, we explore a novel, principled approach that generates overlapped clusters to extract sentence relatedness information from the document to be summarized, which can be used not only to enhance the estimation of various sentence models but also to render sentence-level structural relationships within the document, leading to better summarization effectiveness. For the third challenge, we investigate several formulations of proximity cues for use in sentence modeling involved in the LM-based summarization framework, free of the strict bag-of-words assumption. Furthermore, we also present various ensemble methods that seamlessly integrate proximity and sentence relatedness information into sentence modeling. Extensive experiments conducted on a Mandarin broadcast news summarization task show that such integration of proximity and sentence relatedness information is indeed beneficial for speech summarization. Our proposed summarization methods can significantly boost the performance of an LM-based strong baseline (e.g., with a maximum ROUGE-2 improvement of 26.7% relative) and also outperform several state-of-the-art unsupervised methods compared in the article.
AB - The primary task of extractive summarization is to automatically select a set of representative sentences from a text or spoken document that can concisely express the most important theme of the original document. Recently, language modeling (LM) has been proven to be a promising modeling framework for performing this task in an unsupervised manner. However, there still remain three fundamental challenges facing the existing LM-based methods, which we set out to tackle in this article. The first one is how to construct a more accurate sentence model in this framework without resorting to external sources of information. The second is how to take into account sentence-level structural relationships, in addition to word-level information within a document, for important sentence selection. The last one is how to exploit the proximity cues inherent in sentences to obtain a more accurate estimation of respective sentence models. Specifically, for the first and second challenges, we explore a novel, principled approach that generates overlapped clusters to extract sentence relatedness information from the document to be summarized, which can be used not only to enhance the estimation of various sentence models but also to render sentence-level structural relationships within the document, leading to better summarization effectiveness. For the third challenge, we investigate several formulations of proximity cues for use in sentence modeling involved in the LM-based summarization framework, free of the strict bag-of-words assumption. Furthermore, we also present various ensemble methods that seamlessly integrate proximity and sentence relatedness information into sentence modeling. Extensive experiments conducted on a Mandarin broadcast news summarization task show that such integration of proximity and sentence relatedness information is indeed beneficial for speech summarization. Our proposed summarization methods can significantly boost the performance of an LM-based strong baseline (e.g., with a maximum ROUGE-2 improvement of 26.7% relative) and also outperform several state-of-the-art unsupervised methods compared in the article.
KW - Extractive summarization
KW - language modeling
KW - overlapped clustering
KW - proximity information
UR - http://www.scopus.com/inward/record.url?scp=85083318082&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083318082&partnerID=8YFLogxK
U2 - 10.1145/3377407
DO - 10.1145/3377407
M3 - Article
AN - SCOPUS:85083318082
SN - 2375-4699
VL - 19
JO - ACM Transactions on Asian and Low-Resource Language Information Processing
JF - ACM Transactions on Asian and Low-Resource Language Information Processing
IS - 3
M1 - 3377407
ER -