Learning to distill: The essence vector modeling framework

Kuan Yu Chen, Shih Hung Liu, Berlin Chen, Hsin Min Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some tasks, such as sentiment classification and document summarization. Nevertheless, as far as we are aware, there is relatively less work focusing on the development of unsupervised paragraph embedding methods. Classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are twofold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph. We evaluate the proposed EV model on benchmark sentiment classification and multi-document summarization tasks. The experimental results demonstrate the effectiveness and applicability of the proposed embedding method. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. The utility of the D-EV model is evaluated on a spoken document summarization task, confirming the practical merits of the proposed embedding method in relation to several well-practiced and state-of-the-art summarization methods.

Original languageEnglish
Title of host publicationCOLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016
Subtitle of host publicationTechnical Papers
PublisherAssociation for Computational Linguistics, ACL Anthology
Pages358-368
Number of pages11
ISBN (Print)9784879747020
Publication statusPublished - 2016 Jan 1
Event26th International Conference on Computational Linguistics, COLING 2016 - Osaka, Japan
Duration: 2016 Dec 112016 Dec 16

Publication series

NameCOLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers

Conference

Conference26th International Conference on Computational Linguistics, COLING 2016
CountryJapan
CityOsaka
Period16/12/1116/12/16

Fingerprint

learning
Paragraph
Modeling
Essence
Processing
Speech recognition
learning process
language
school
performance
Summarization
Sentiment

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Language and Linguistics
  • Linguistics and Language

Cite this

Chen, K. Y., Liu, S. H., Chen, B., & Wang, H. M. (2016). Learning to distill: The essence vector modeling framework. In COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers (pp. 358-368). (COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers). Association for Computational Linguistics, ACL Anthology.

Learning to distill : The essence vector modeling framework. / Chen, Kuan Yu; Liu, Shih Hung; Chen, Berlin; Wang, Hsin Min.

COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers. Association for Computational Linguistics, ACL Anthology, 2016. p. 358-368 (COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Chen, KY, Liu, SH, Chen, B & Wang, HM 2016, Learning to distill: The essence vector modeling framework. in COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers. COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers, Association for Computational Linguistics, ACL Anthology, pp. 358-368, 26th International Conference on Computational Linguistics, COLING 2016, Osaka, Japan, 16/12/11.
Chen KY, Liu SH, Chen B, Wang HM. Learning to distill: The essence vector modeling framework. In COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers. Association for Computational Linguistics, ACL Anthology. 2016. p. 358-368. (COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers).
Chen, Kuan Yu ; Liu, Shih Hung ; Chen, Berlin ; Wang, Hsin Min. / Learning to distill : The essence vector modeling framework. COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers. Association for Computational Linguistics, ACL Anthology, 2016. pp. 358-368 (COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers).
@inproceedings{1101b4a677f849098129684c74a1bcc6,
title = "Learning to distill: The essence vector modeling framework",
abstract = "In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some tasks, such as sentiment classification and document summarization. Nevertheless, as far as we are aware, there is relatively less work focusing on the development of unsupervised paragraph embedding methods. Classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are twofold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph. We evaluate the proposed EV model on benchmark sentiment classification and multi-document summarization tasks. The experimental results demonstrate the effectiveness and applicability of the proposed embedding method. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. The utility of the D-EV model is evaluated on a spoken document summarization task, confirming the practical merits of the proposed embedding method in relation to several well-practiced and state-of-the-art summarization methods.",
author = "Chen, {Kuan Yu} and Liu, {Shih Hung} and Berlin Chen and Wang, {Hsin Min}",
year = "2016",
month = "1",
day = "1",
language = "English",
isbn = "9784879747020",
series = "COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers",
publisher = "Association for Computational Linguistics, ACL Anthology",
pages = "358--368",
booktitle = "COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016",

}

TY - GEN

T1 - Learning to distill

T2 - The essence vector modeling framework

AU - Chen, Kuan Yu

AU - Liu, Shih Hung

AU - Chen, Berlin

AU - Wang, Hsin Min

PY - 2016/1/1

Y1 - 2016/1/1

N2 - In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some tasks, such as sentiment classification and document summarization. Nevertheless, as far as we are aware, there is relatively less work focusing on the development of unsupervised paragraph embedding methods. Classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are twofold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph. We evaluate the proposed EV model on benchmark sentiment classification and multi-document summarization tasks. The experimental results demonstrate the effectiveness and applicability of the proposed embedding method. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. The utility of the D-EV model is evaluated on a spoken document summarization task, confirming the practical merits of the proposed embedding method in relation to several well-practiced and state-of-the-art summarization methods.

AB - In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some tasks, such as sentiment classification and document summarization. Nevertheless, as far as we are aware, there is relatively less work focusing on the development of unsupervised paragraph embedding methods. Classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are twofold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph. We evaluate the proposed EV model on benchmark sentiment classification and multi-document summarization tasks. The experimental results demonstrate the effectiveness and applicability of the proposed embedding method. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. The utility of the D-EV model is evaluated on a spoken document summarization task, confirming the practical merits of the proposed embedding method in relation to several well-practiced and state-of-the-art summarization methods.

UR - http://www.scopus.com/inward/record.url?scp=85055009338&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055009338&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85055009338

SN - 9784879747020

T3 - COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers

SP - 358

EP - 368

BT - COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016

PB - Association for Computational Linguistics, ACL Anthology

ER -