DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations
Implemented By:
Description:
We present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations, a self-supervised method for learning universal sentence embeddings that transfer to a wide variety of natural language processing (NLP) tasks. Our objective leverages recent advances in deep metric learning (DML) and has the advantage of being conceptually simple and easy to implement, requiring no specialized architectures or labelled training data. We demonstrate that our objective can be used to pretrain transformers to state-of-the-art performance on SentEval, a popular benchmark for evaluating universal sentence embeddings, outperforming existing supervised, semi-supervised and unsupervised methods. We perform extensive ablations to determine which factors contribute to the quality of the learned embeddings. Our code is publicly available and can be easily adapted to new datasets or used to embed unseen text.
Results on SentEval are presented below (as averaged scores on the downstream and probing task test sets), along with existing state-of-the-art methods.
Model | Requires labelled data? | Parameters | Embed. dim. | Downstream (-SNLI) | Probing | Δ |
---|---|---|---|---|---|---|
InferSent V2 | Yes | 38M | 4096 | 76.00 | 72.58 | -3.06 |
Universal Sentence Encoder | Yes | 147M | 512 | 78.89 | 66.70 | -0.17 |
Sentence Transformers ("roberta-base-nli-mean-tokens") | Yes | 125M | 768 | 77.19 | 63.22 | -1.87 |
Transformer-small (DistilRoBERTa-base) | No | 82M | 768 | 72.58 | 74.57 | -6.48 |
Transformer-base (RoBERTa-base) | No | 125M | 768 | 72.70 | 74.19 | -6.36 |
DeCLUTR-small (DistilRoBERTa-base) | No | 82M | 768 | 77.41 | 74.71 | -1.65 |
DeCLUTR-base (RoBERTa-base) | No | 125M | 768 | 79.06 | 74.65 | -- |
Transformer- is the same underlying architecture and pretrained weights as DeCLUTR- before continued pretraining with our contrastive objective. Transformer- and DeCLUTR- use mean pooling on their token-level embeddings to produce a fixed-length sentence representation. Downstream scores are computed without considering perfomance on SNLI (denoted "Downstream (-SNLI)") as InferSent, USE and Sentence Transformers all train on SNLI. Δ: difference to DeCLUTR-base downstream score.