An Information Distillation Framework for Extractive Summarization

Kuan Yu Chen, Shih Hung Liu, Berlin Chen, Hsin Min Wang

Research output: Contribution to journalArticle

Abstract

In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.

LanguageEnglish
Pages161-170
Number of pages10
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume26
Issue number1
DOIs
Publication statusPublished - 2018 Jan 1

Fingerprint

Distillation
distillation
Summarization
Learning
embedding
learning
Denoising
Natural Language Processing
Research Subjects
Benchmarking
natural language processing
redundancy
sentences
Model
Framework
learning process
speech recognition
Threefolds
Speech Recognition
Processing

Keywords

  • distilling
  • paragraph embedding
  • Representation learning
  • summarization
  • unsupervised

ASJC Scopus subject areas

  • Signal Processing
  • Media Technology
  • Instrumentation
  • Acoustics and Ultrasonics
  • Linguistics and Language
  • Electrical and Electronic Engineering
  • Speech and Hearing

Cite this

An Information Distillation Framework for Extractive Summarization. / Chen, Kuan Yu; Liu, Shih Hung; Chen, Berlin; Wang, Hsin Min.

In: IEEE/ACM Transactions on Audio Speech and Language Processing, Vol. 26, No. 1, 01.01.2018, p. 161-170.

Research output: Contribution to journalArticle

@article{f94ba294ec224a60bdcf83ff1f03c28b,
title = "An Information Distillation Framework for Extractive Summarization",
abstract = "In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.",
keywords = "distilling, paragraph embedding, Representation learning, summarization, unsupervised",
author = "Chen, {Kuan Yu} and Liu, {Shih Hung} and Berlin Chen and Wang, {Hsin Min}",
year = "2018",
month = "1",
day = "1",
doi = "10.1109/TASLP.2017.2764545",
language = "English",
volume = "26",
pages = "161--170",
journal = "IEEE/ACM Transactions on Speech and Language Processing",
issn = "2329-9290",
publisher = "IEEE Advancing Technology for Humanity",
number = "1",

}

TY - JOUR

T1 - An Information Distillation Framework for Extractive Summarization

AU - Chen, Kuan Yu

AU - Liu, Shih Hung

AU - Chen, Berlin

AU - Wang, Hsin Min

PY - 2018/1/1

Y1 - 2018/1/1

N2 - In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.

AB - In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.

KW - distilling

KW - paragraph embedding

KW - Representation learning

KW - summarization

KW - unsupervised

UR - http://www.scopus.com/inward/record.url?scp=85032284759&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85032284759&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2017.2764545

DO - 10.1109/TASLP.2017.2764545

M3 - Article

VL - 26

SP - 161

EP - 170

JO - IEEE/ACM Transactions on Speech and Language Processing

T2 - IEEE/ACM Transactions on Speech and Language Processing

JF - IEEE/ACM Transactions on Speech and Language Processing

SN - 2329-9290

IS - 1

ER -