Paper Summary: Evaluation of sentence embeddings in downstream and linguistic probing tasks

What's in a Sentence Embedding?

A little while ago, I wrote an article on word embeddings as an introduction into why we use word embeddings and some of the different types of word embeddings out there. We can think of a sentence embedding as just the next level of extraction: numerical representation of a sentence!

But just like we don't want to use simple numerical representation for words, we want our sentence representations to encapsulate rich meaning as well. We want them to be responsive to changes in word order, tense, and meaning.

This is a hard task! There are some ways to go about this, but recently the paper, "Evaluation of sentence embeddings in downstream and linguistic probing tasks" decided to take a stab at unravelling the question of "what is in a sentence embedding?" In it, they take a look at how difference sentence representations perform not only on downstream tasks that would benefit from sentence representation but also on tasks that are purely linguistic (tasks that show intelligent representation of the sentence with respect to linguistic features and rules).

This paper is what we dissect in this post!

Different Sentence Representations

So what representations were evaluated?

  • ELMo (BoW, all layers, 5.5B): From AllenNLP, this is the pre-trained ELMo embedding. This was the English representation that was trained on the 5.5B word corpus (a combination of Wikipedia and the monolingual news crawl). Since ELMo has two layers, this representation was all the layer outputs for a total of 3072 dimensions. ELMo is just a word embedding though, so this representation was created for sentences by averaging all words together.
  • ELMo (BoW, all layers, original): Another variation of ELMo, but just trained on the news crawl. Again, it has a dimensionality of 3072.
  • ELMo (BoW, top layer, original): Another variation of ELMo with just the final output. This embedding only has 1024 dimensions.
  • FastText (BoW, Common Crawl): From Facebook, we get FastText. My previous word embedding talks about why FastText is pretty awesome, but since it's just word embeddings, these are transformed to sentence embeddings through averaging of all words. It has a dimensionality of 300.
  • GloVe (BoW, Common Crawl): GloVe, averaged together like other word embeddings. It has a dimensionality of 300.
  • Word2Vec (BoW, Google News): Word2Vec, averaged together. It has a dimensionality of 300.
  • p-mean (monolingual): A different way of average words together, p-mean is available through TensorFlow Hub. The dimensionality is huge though, reaching 3600.
  • Skip-Thought: The first actual sentence embedding we look at. Skip-Thought uses the word2vec approach of predicting surrounding sentences based on the current sentence. It does this through an encoder-decoder architecture. This is our biggest representation, with 4800 dimensions.
  • InferSent (AllNLI): Another set of embeddings trained by Facebook, InferSent is trained using the task of language inference. This is a dataset where two sentences are put together and a model needs to infer whether they are a contradiction, a neutral pairing, or an entailment. The output is an embedding of 4096 dimensions.
  • USE (DAN): Google's basic Universal Sentence Encoder (USE), the Deep Averaging Network (DAN) is available through TensorFlow Hub. USE outputs vectors of 512 dimensions.
  • USE (Transformer): Finally, Google's heavy duty USE, based on the Transformer network. USE outputs vectors of 512 dimensions.

The models' training and dimensionality is summarized here:

Sentence Embedding List

Downstream Tasks

The downstream tasks from this paper are taken from the SentEval package. They feature five groups of tasks that were identified as key tasks that would be useful for a sentence embedding to help with. The five groups are: binary and multi-class classification, entailment and semantic relatedness, semantic textual similarity, paraphrase detection, and caption-image retrieval.

The categories give insights on what kind of tasks we are attempting to use these embeddings with (and if you're curious, you should check out the package), however, they incorporate all the classic tasks like sentiment analysis, question type analysis, sentence inference, and more.

The full list of classification tasks with examples can be seen here:

Classification tasks

And semantic relatedness tasks can be seen here:

Semantic relatedness

To evaluate the benefits of these embeddings, simple models were used. That means simple multi-layer perceptrons with 50 neurons, logistic regression, or other very basic models. No fancy CNNs here. No RNNs. Just basic models to see how these representations fair.

Linguistic Tasks

Again taken from SentEval, there were 10 probing tasks that were conducted to evaluate different linguistic properties of sentence embeddings. These are pretty cool. They were:

  • Bigram Shift: Whether or not two words were inverted
  • Coordinate Inversion: Given two coordinate clauses, are they inverted?
  • Object Number: Is the object singular or plural?
  • Sentence Length
  • Semantic Odd Man Out: Random noun/verb may be replaced. Detect if it has been.
  • Subject Number: Is the subject singular or plural?
  • Past Tense: Is main verb past or present tense?
  • Top-Constituents: What is the class of the top syntax pattern?
  • Depth of Syntactic Tree: How deep is the parsed syntax tree?
  • Word Content: Which one of one thousand words is encoded in the sentence?

The idea here was that a sentence embedding shouldn't just perform well on downstream tasks, it should also encode some of these key linguistic properties as they will help yield a smart and explainable embedding!

This is all summarized in this table:

Linguistic probing

And the Results???

Well... There was no clear winner!

ELMo performed the best on 5/9 tasks. USE did well on product review and question classification tasks. InferSent did well on paraphrase detection and entailment tasks.

Although p-mean didn't surpass top performers, it did surpass all baseline word embeddings like word2vec, GloVe, and FastText.

Here are the downstream results on classification:

Classification results

And on semantic relatedness:

Semantic relatedness

Information retrieval was another big test (which is scored based on how many correct results were retrieved in the top n, where n is an integer). InferSent actually performed the best, though ELMo and p-mean were other close contenders:

IR results

As for linguistic probing, ELMo again faired well though other representations were not that far behind. The results on all linguistic probing tasks can be seen here:

linguistic results

So what does this mean? Well, as the authors state, it means we aren't at a place where we truly have a solid universal sentence encoder. There is no sentence embedding that performs best on every task and there's still a lot we can learn through linguistic probing and testing.

To me, this means that the field is still ripe for exploring! We haven't yet achieved a fantastic sentence embedding, though these are all solid models. Now it's on us to get out there and try out our own sentence embedding ideas!


Comments powered by Disqus