Paper Summary: Evaluation of sentence embeddings in downstream and linguistic probing tasks

Breaking down a paper that broke down some of the pros and cons of different sentence embeddings.

Imagem de capa

What’s in a Sentence Embedding?

A little while ago, I wrote an article on word embeddings as an introduction into why we use word embeddings and some of the different types of word embeddings out there. We can think of a sentence embedding as just the next level of extraction: numerical representation of a sentence!

But just like we don’t want to use simple numerical representation for words, we want our sentence representations to encapsulate rich meaning as well. We want them to be responsive to changes in word order, tense, and meaning.

This is a hard task! There are some ways to go about this, but recently the paper, “Evaluation of sentence embeddings in downstream and linguistic probing tasks” decided to take a stab at unravelling the question of “what is in a sentence embedding?” In it, they take a look at how difference sentence representations perform not only on downstream tasks that would benefit from sentence representation but also on tasks that are purely linguistic (tasks that show intelligent representation of the sentence with respect to linguistic features and rules).

This paper is what we dissect in this post!

Different Sentence Representations

So what representations were evaluated?

The models’ training and dimensionality is summarized here:

Sentence Embedding List

Downstream Tasks

The downstream tasks from this paper are taken from the SentEval package. They feature five groups of tasks that were identified as key tasks that would be useful for a sentence embedding to help with. The five groups are: binary and multi-class classification, entailment and semantic relatedness, semantic textual similarity, paraphrase detection, and caption-image retrieval.

The categories give insights on what kind of tasks we are attempting to use these embeddings with (and if you’re curious, you should check out the package), however, they incorporate all the classic tasks like sentiment analysis, question type analysis, sentence inference, and more.

The full list of classification tasks with examples can be seen here:

Classification tasks

And semantic relatedness tasks can be seen here:

Semantic relatedness

To evaluate the benefits of these embeddings, simple models were used. That means simple multi-layer perceptrons with 50 neurons, logistic regression, or other very basic models. No fancy CNNs here. No RNNs. Just basic models to see how these representations fair.

Linguistic Tasks

Again taken from SentEval, there were 10 probing tasks that were conducted to evaluate different linguistic properties of sentence embeddings. These are pretty cool. They were:

The idea here was that a sentence embedding shouldn’t just perform well on downstream tasks, it should also encode some of these key linguistic properties as they will help yield a smart and explainable embedding!

This is all summarized in this table:

Linguistic probing

And the Results???

Well… There was no clear winner!

ELMo performed the best on 5/9 tasks. USE did well on product review and question classification tasks. InferSent did well on paraphrase detection and entailment tasks.

Although p-mean didn’t surpass top performers, it did surpass all baseline word embeddings like word2vec, GloVe, and FastText.

Here are the downstream results on classification:

Classification results

And on semantic relatedness:

Semantic relatedness

Information retrieval was another big test (which is scored based on how many correct results were retrieved in the top n, where n is an integer). InferSent actually performed the best, though ELMo and p-mean were other close contenders:

IR results

As for linguistic probing, ELMo again faired well though other representations were not that far behind. The results on all linguistic probing tasks can be seen here:

linguistic results

So what does this mean? Well, as the authors state, it means we aren’t at a place where we truly have a solid universal sentence encoder. There is no sentence embedding that performs best on every task and there’s still a lot we can learn through linguistic probing and testing.

To me, this means that the field is still ripe for exploring! We haven’t yet achieved a fantastic sentence embedding, though these are all solid models. Now it’s on us to get out there and try out our own sentence embedding ideas!

If you enjoyed this article or found it helpful in any way, why not pass me along a dollar or two to help fund my machine learning education and research! Every dollar helps me get a little closer and I’m forever grateful.