TY - JOUR
T1 - An Information Distillation Framework for Extractive Summarization
AU - Chen, Kuan Yu
AU - Liu, Shih Hung
AU - Chen, Berlin
AU - Wang, Hsin Min
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2018/1
Y1 - 2018/1
N2 - In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.
AB - In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.
KW - Representation learning
KW - distilling
KW - paragraph embedding
KW - summarization
KW - unsupervised
UR - http://www.scopus.com/inward/record.url?scp=85032284759&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032284759&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2017.2764545
DO - 10.1109/TASLP.2017.2764545
M3 - Article
AN - SCOPUS:85032284759
VL - 26
SP - 161
EP - 170
JO - IEEE/ACM Transactions on Speech and Language Processing
JF - IEEE/ACM Transactions on Speech and Language Processing
SN - 2329-9290
IS - 1
ER -