CONF
Mai_ICLR2019_2019/IDIAP
CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model
Mai, Florian
Galke, Lukas
Scherp, Ansgar
Efficient training scheme
Sentence embedding
Text representation learning
word2vec
https://publications.idiap.ch/index.php/publications/showcite/Mai_Idiap-RR-06-2019
Related documents
International Conference on Learning Representations
New Orleans, Louisiana, USA
2019
https://openreview.net/forum?id=H1MgjoR9tQ
URL
Continuous Bag of Words (CBOW) is a powerful text embedding method. Due to its strong capabilities to encode word content, CBOW embeddings perform well on a wide range of downstream tasks while being efficient to compute. However, CBOW is not capable of capturing the word order. The reason is that the computation of CBOW's word embeddings is commutative, i.e., embeddings of XYZ and ZYX are the same. In order to address this shortcoming, we propose a learning algorithm for the Continuous Matrix Space Model, which we call Continual Multiplication of Words (CMOW). Our algorithm is an adaptation of word2vec, so that it can be trained on large quantities of unlabeled text. We empirically show that CMOW better captures linguistic properties, but it is inferior to CBOW in memorizing word content. Motivated by these findings, we propose a hybrid model that combines the strengths of CBOW and CMOW. Our results show that the hybrid CBOW-CMOW-model retains CBOW's strong ability to memorize word content while at the same time substantially improving its ability to encode other linguistic information by 8%. As a result, the hybrid also performs better on 8 out of 11 supervised downstream tasks with an average improvement of 1.2%.
REPORT
Mai_Idiap-RR-06-2019/IDIAP
CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model
Mai, Florian
Galke, Lukas
Scherp, Ansgar
Efficient training scheme
Sentence embedding
Text representation learning
word2vec
EXTERNAL
https://publications.idiap.ch/attachments/reports/2019/Mai_Idiap-RR-06-2019.pdf
PUBLIC
Idiap-RR-06-2019
2019
Idiap
July 2019
To appear at ICLR 2019
Continuous Bag of Words (CBOW) is a powerful text embedding method. Due to
its strong capabilities to encode word content, CBOW embeddings perform well on
a wide range of downstream tasks while being efficient to compute. However, CBOW is not capable of capturing the word order. The reason is that the computation of CBOW's word embeddings is commutative, i.e., embeddings of XYZ and ZYX are the same. In order to address this shortcoming, we propose a learning algorithm for the Continuous Matrix Space Model, which we call Continual Multiplication of Words (CMOW). Our algorithm is an adaptation of word2vec, so that it can be trained on large quantities of unlabeled text. We empirically show that CMOW better captures linguistic properties, but it is inferior to CBOW in memorizing word content. Motivated by these findings, we propose a hybrid model that combines the strengths of CBOW and CMOW. Our results show that the hybrid CBOW-CMOW-model retains CBOW's strong ability to memorize word content while at the same time substantially improving its ability to encode other linguistic information by 8%. As a result, the hybrid also performs better on 8 out of 11 supervised downstream tasks with an average improvement of 1.2%.
https://openreview.net/forum?id=H1MgjoR9tQ
URL