N-gram-Based Low-Dimensional Representation for Document Classification

We use cookies

This website uses cookies and other tracking technologies to improve your browsing experience for the following purposes: to enable basic functionality of the website, to provide a better experience on the website, to measure your interest in our products and services and to personalize marketing interactions, to deliver ads that are more relevant to you.

[BibTeX] [Marc21]

Type of publication:	Conference paper
Citation:	Lebret_ICLR_2015
Publication status:	Published
Booktitle:	International Conference on Learning Representations
Year:	2015
Month:	April
URL:	http://arXiv:1412.6277...
Abstract:	The bag-of-words (BOW) model is the common approach for classifying documents, where words are used as feature for training a classifier. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been designed to summarize documents in a lower dimension with the least semantic information loss. Some semantic information is nevertheless always lost, since only words are considered. Instead, we aim at using information coming from n-grams to overcome this limitation, while remaining in a low-dimension space. Many approaches, such as the Skip-gram model, provide good word vector representations very quickly. We propose to average these representations to obtain representations of n-grams. All n-grams are thus embedded in a same semantic space. A K-means clustering can then group them into semantic concepts. The number of features is therefore dramatically reduced and documents can be represented as bag of semantic concepts. We show that this model outperforms LSA and LDA on a sentiment classification task, and yields similar results than a traditional BOW-model with far less features.
Keywords:
Projects	Idiap
Authors	Lebret, Rémi Collobert, Ronan
Added by:	[UNK]
Total mark:	0
Attachments
Lebret_ICLR_2015.pdf
Notes

processing time: 0.0011 seconds.