QuAC: Question Answering in Context

If you're in the field of natural language processing and you're not excited, you're about to be! This week, I happened to browse over to Arxiv Sanity Preserver as I normally do, only instead of being greeted by a barrage of GAN-this or CNN-that, my eyes fell upon two datasets that made me very happy. They were CoQA and QuAC, and in this blog we are going to talk about why QuAC is so exciting.

I also talk about CoQA here if you're interested in more question answering!

The Motivation to QuAC

Why was QuAC created and what is it going to do?

The idea behind QuAC stems from the idea of a student asking questions of a teacher. A teacher has the knowledge of interest and the student must have a dialog with the teacher to get the information that they want to know.

The issue? Questions in this scenario are context-dependent, can be abstract, and might not even have an answer. It's up to the teacher to use the information they know to sift through all of this and give the best answer they can give.

This is what QuAC wishes to measure, placing machines in the role of teacher so that we may develop intelligent dialog agents.

It does so by compiling 100,000+ questions across 14,000+ dialogs.

QuAC Statistics

The Creation Process

How was QuAC created?

With the help of Amazon Mechanical Turk, a marketplace that offers jobs to people that require human intelligence, such as labelling a dataset.

Two workers were assigned to a dialog. The teacher would be able to see a Wikipedia passage. The student would see the title of the article, the introduction paragraph, and the particular sub-heading. From there, the two would begin a dialog, going back and forth so that the student could learn about the article that the teacher has access to. The hope was that by not being able to see the article, the student's questions would naturally be lexically different from what the passage contained.

The teacher's answers must be straight from the passage, however. This results in a dataset that is extractive. It is easier to score, for one! It also makes evaluation less subjective and more reliable.

Dialogs continue until 12 questions were answered, one of the workers ends the dialog manually, or 2 unanswerable questions are asked in a row.

A QuAC Conversation

What Wikipedia Articles Were Used?

It requires less background knowledge to ask questions about people, so QuAC consists of only Wikipedia articles about people. These people cover a wide range of domains, but in they end, they are all people. They also make sure that they are people with at least 100 incoming links.

Comparison to Other Datasets

QuAC is a unique dataset, for sure. It contains a wide variety of question types. It is context-dependent. It has unanswerable questions. It has longer answers. Luckily, QuAC summarizes how it compares to a lot of other datasets in a little chart for us:

QuAC Comparison

What's more, QuAC is 54% non-factoid. It is also 86% contextual, requiring some sort of co-references resolution (figuring out which expressions refer to the same thing). This is largely an unsolved problem, and is difficult.

QuAC Distribution of Questions

The difficulty is compounded even further by the fact that some co-references refer to things in the article, while others refer to past dialog. And more than a tenth refer to a very abstract question of "what else?"

These are tough questions to answer, even for a human!

Co-Reference Resolution in QuAC

Baseline Performance on QuAC

QuAC is very much unsolved. Human performance is at an F1 of 81.1%, but the top performing baseline is far behind. It is a BIDAF++ with context, something that the authors of QuAC designed. It only achieves an F1 of 60.2%. That's a huge margin to improve upon. The rest of the test baselines did a lot worse:

QuAC Performance

A Human-Based Evaluation

QuAC also comes up with 2 human-based evaluations. If for a given question, a machine gets equivalent or higher F1, we say it exceed human equivalence score (HEQ). For the questions that it does that, we give a Human HEQ-Q point. If it is able to do that for an entire dialog of questions, we give it a HEQ-D point. We can calculate the percentage HEQQ and HEQD of a model this way.

By definition, human performance has an HEQQ and HEQD of 100%. Our top performing model currently has a HEQQ of 55.1% and a HEQD of 5.20%. We've got a long way to go! But when we reach, and exceed that, our dialog agents will have gotten to a point where they have human-level performance in settings like this.

And Now for a Challenge!

Will QuAC help us do to question answering what ImageNet did to image recognition? Only time will tell, but things are exciting for sure!

So now for a challenge for all of us (just like I posed with CoQA)! We all can push to try and tackle this challenge and take conversational agents to the next evolution. QuAC has a leaderboard which as of now is empty! It is our duty to go out there and try and work on this challenge, try and solve this problem, push knowledge forward, and maybe land on the leaderboard for a little bit of fame and a lot a bit of pride ;)

I'll see you all there!

CoQA: A Conversational Question Answering Challenge

This week has been a great week for natural language processing and I'm really excited (and you should be too)! This week, I happened to browse over to Arxiv Sanity Preserver as I normally do, only instead of being greeted by a barrage of GAN-this or CNN-that, my eyes fell upon two datasets that made me very happy. They were CoQA and QuAC, and in this blog we are going to talk about why CoQA is so exciting.

I also talk about QuAC here if you're interested in more question answering!

A Conversational Question Answering Challenge

What does a "conversational question answering challenge" even entail? Well, it requires not only reading a passage to uncover answers, but also having a conversation about that information. It requires using context clues in a dialog to uncover what is being asked for. And it requires that information to be rephrased in a way that is abstractive instead of extractive.


This dataset is unlike anything we've seen yet in question answering!

The Motivation

So why do we need a dataset like CoQA?

Well, simply because our current datasets just aren't doing it for us. Humans learn by interacting with one another, asking questions, understanding context, and reasoning about information. Chatbots... Well, chatbots don't do that. At least not now. And as a result, virtual agents are lacking and chatbots seem stupid (most of the time).

CoQA was created as a dataset to help us measure an algorithms ability to participate in a question-answering dialog. Essentially, to measure how conversational a machine can be.

In their paper, the creation of CoQA came with three goals:

  • To create a dataset that features question-answer pairs that depend on a conversation history
  • CoQA is the first dataset to come out to do this at a large scale!
  • After the first question in a dialog, every question is dependent on the history of what was said
  • There are 127,000 questions that span 8,000 dialogs!
  • To create a dataset that seeks natural answers
  • Many question-answering datasets are extractive, searching for lexically similar responses and pulling them directly out of the text
  • CoQA expects rationale to be extracted from the passage, but to have a re-phrased, abstracted answer
  • To ensure that question-answering systems perform well in a diverse set of domains
  • CoQA has a train set that features 5 domains and a test set that features 7 domains (2 never seen in training!)

And while human performance has an F1 score of 88.8%, the top performing model only achieves an F1 of 65.1%. That is a big margin and a lot of catching up that machines have to do!

Creating CoQA

How was a dataset like this created?

With the help of Amazon Mechanical Turk, a marketplace that offers jobs to people that require human intelligence, such as labelling a dataset. Two AMT workers would be paired together, one asking questions and the other giving answers. An answerer would be asked to highlight the portion of text in a passage that gives the answer and then respond in a way that was using different words than was given in the passage (thus becoming abstractive).

All this was done within a dialog between two workers. Seem the following example highlighted in the paper:

An example of a CoQA conversation

What Are the Domains of the Passages

The passages come from 7 domains, 5 in the training set and 2 reserved for the test set. They are:

CoQA Domains

They also went out of their way to collect multiple answers to questions, since questions are re-phrased which helps so that there are more opportunities for dialog agents to get a better score and actually hit a correct answer.

Comparing to Other Datasets

What did we have before this?

Well... We had SQuAD, the Stanford Question Answering dataset. Version 1.0 gave use 100,000+ questions on Wikipedia that required models to extract correct portions of articles in response to questions. When it was decided that that wasn't enough, we got Version 2.0. This added 50,000+ unanswerable questions to SQuAD, now requiring algorithms to not only find correct answers but also reason about whether an answer was even exists in a passage.

SQuAD and CoQA Sizes

This task is still unsolved.

But it also doesn't solve the problems of abstracting information and understanding what is being asked for in a dialog. That's why CoQA was created, as outlined in its first goal!

More Analysis

The creators of CoQA did an analysis of SQuAD versus CoQA and found that while about half of SQuAD questions are "what" questions, CoQA has a much wider, more balanced distribution of types of questions. They gave us this cool graphic to depict that:

Question Distribution in SQuAD and CoQA

What makes CoQA even more difficult is that sometimes it just features one word questions like "who?" or "where?" or even "why?"

These are totally context dependent! We as humans can't even begin to try to answer these without knowing the context that they were asked in!

SQuAD and CoQA Answer Types

A Linguistic Challenge

In case you haven't put it together yet, CoQA is very, very difficult. It is filled with things called co-references which can be something as simple as a pronoun, but in general is when two or more expressions refer to the same thing (thus co-references!).

This is something that is still an open problem in NLP (co-reference resolution, pronoun resolution, etc.), so to incorporate this step in a question-answering dataset certainly increases the difficulty some more.

About half the questions in CoQA contain an explicit co-reference (some indicator like him, it, her, that). Close to a fifth of the questions contain an implicit co-reference. This is when you ask a question like "where?" We are asking for something to be resolved, but it's implied. This is very hard for machines. Hell, it can even be hard for people sometimes!

Coreferences in CoQA

And Now for a Challenge!

How do current models hold up to this challenge? The answer is not well.

See for yourself:

Baseline CoQA Scores

Will CoQA help us do to question answering what ImageNet did to image recognition? Only time will tell, but things are exciting for sure!

So now for a challenge for all of us (just like I posed with QuAC)! We all can push to try and tackle this challenge and take conversational agents to the next evolution. CoQA has a leaderboard which as of now is empty! It is our duty to go out there and try and work on this challenge, try and solve this problem, push knowledge forward, and maybe land on the leaderboard for a little bit of fame and a lot a bit of pride ;)

I'll see you all there!

GAN Objective Functions: GANs and Their Variations

If you haven't already, you should definitely read my previous post about what a GAN is (especially if you don't know what I mean when I say GAN!). That post should give you a starting point to dive into the world of GANs and how they work. It's a solid primer for any article on GANs, not to mention this one where we will be discussing objective functions of GANs and other variations of GANs currently out there that use twists on defining their objectives for different results.

Defining an Objective

In our introductory post, we talked about generative models. We discussed how the goal of a generative model is to come up with a way of matching their generated distribution to a real data distribution. Minimizing the distance between the two distributions is critical for creating a system that generates content that looks good, new, and like it is from the original data distribution.

But how do we measure the difference between our generated data distribution and our original data distribution? That's what we call an objective function and it is the focus of this article today! We are going to look at some variations of GANs to understand how we can alter the measure of the divergence between our generated data distribution and the actual distribution and the effect that that will have.

The Original GAN

The objective function of our original GAN is essentially the minimization of something called the Jensen Shannon Divergence (JSD). Specifically it is:

Jensen Shannon Divergence

The JSD is derived from the Kullbach-Liebler Divergence (KLD) that we mentioned in the previous post.

We are already familiar with our friend, the original GAN. Instead of discussing it any further, let's just admire its performance in all its glory:

Original GAN Results

Wasserstein GAN

The Wasserstein GAN (WGAN) is a GAN you may have heard about, since it got a lot of attention. It did so for a lot of practical reasons (in general, when you train a GAN the loss values returned don't mean anything except that with WGAN they can), but what made WGAN different?

WGAN doesn't use the JSD to measure divergence, instead it uses something called the Earth-Mover (EM) distance (AKA Wasserstein distance). EM distance is defined as:

Wasserstein Metric

What does this mean?

EM Distance

Let's try and understand the intuition behind the EM distance. A probability distribution is essentially a collection of mass, with the distribution measuring the amount of mass at a given point. We give EM distance two distributions. Since the cost to move a mass a certain distance is equivalent to the product of the mass and the distance, the EM distance basically calculates the minimal cost of transforming one probability distribution into the other. This can be seen as the minimal effort needed.

But why do we care? Well, we care about EM distance because it oftentimes measures a distance of a straight line for transforming one distribution to the other. This is helpful with gradients in optimization. Not to mention, there are also a set of functions that do not converge when distance is measured with something like KLD or JSD that do actually converge for the EM distance.

This is because EM distance has guarantees of continuity and differentiability, something that distance functions like KLD and JSD lack. We want these guarantees for a loss function, making EM distance better suited to our needs. More than that, everything that would converge under JSD or KLD also converge under EM distance. It's just that EM distance encompasses that much more.

How is This Used?

Stepping away from all these thoughts about math and into the practical application of such things, how do we use this new distance when we can't directly calculate it? Well, we first take a critic function that is parameterized and train it to approximate the EM distance between our data distribution and our generated distribution. When we have achieved that, we have a good approximator for the EM distance. From there, we then optimize our generator function to reduce this EM distance.

In order to guarantee that our function lies in a compact space (this helps ensure that we meet the theoretical guarantees needed to do our calculations), we clip the weights that parametrize our critic function f.

Just a side note: Our critic function f is called a critic because it is not an explicit discriminator. A discriminator will classify its inputs as real or fake. The critic doesn't do that. The critic function just approximates a distance score. However, it plays the discriminator role in the traditional GAN framework, so its worth highlighting how it is similar and how it is different.

Wasserstein Results

Key Take-Aways

  • Meaningful loss function
  • Easier debugging
  • Easier hyperparameter searching
  • Improved stability
  • Less mode collapse (when a generator just generates one thing over and over again... More on this later)
  • Theoretical optimization guarantees

Improved WGAN

With all those good things proposed with WGAN, what still needs to be improved? Well, Improved Training of Wasserstein GANs highlights just that.

WGAN got a lot of attention, people started using it, and the benefits were there. But people began to notice that despite all the things WGAN brought to the table, it still can fail to converge or produce pretty bad generated samples. The reasoning that Improved WGAN gives is that weight clipping is an issue. It does more harm than good in some situations. We noted that the reason why we weight clip has to do with maintaining the theoretical guarantees of the critic function. But in practice, what clipping actually does is encourage very simple critic functions that are pushed to the extremes of their boundaries. This is not good.

What Improved WGAN proposes instead is that you don't clip weights but rather add a penalization term to the norm of the gradient of the critic function. They found that this produces better results and, when plugged into a bunch of different GAN architectures, produces stable training.

Key Take-Aways

  • Exactly WGAN, except no weight clipping
  • Weight regularization term to encourage theoretical guarantees

Least Squares GAN

LSGAN has a setup similar to WGAN. However, instead of learning a critic function, LSGAN learns a loss function. The loss for real samples should be lower than the loss for fake samples. This allows the LSGAN to put a high focus on fake samples that have a really high margin.

Like WGAN, LSGAN tries to restrict the domain of their function. They take a different approach instead of clipping. They introduce regularization in the form of weight decay, encouraging the weights of their function to lie within a bounded area that guarantee the theoretical needs.

Another point to note is that the loss function is setup more similarly to the original GAN, but where the original GAN uses a log loss, the LSGAN uses an L2 loss (which equates to minimizing the Pearson X^2 divergence). The reason for this has to do with the fact that a log loss will basically only care about whether or not a sample is labeled correctly or not. It will not heavily penalize based on the distance of said sample from correct classification. If a label is correct, it doesn't worry about it further. In contrast, L2 loss does care about distance. Data far away from where it should be will be penalized proportionally. What LSGAN argues is that this produces more informative gradients.

LSGAN Results

Key Take-Aways

  • Loss function instead of a critic
  • Weight decay regularization to bound loss function
  • L2 loss instead of log loss for proportional penalization

Relaxed Wasserstein GAN

Or RWGAN for short is another variation of the WGAN paper. They describe their RWGAN as the happy medium between WGAN and Improved WGAN (WGAN-GP as they cite it in the paper). Instead of symmetric clamping of weights (like in WGAN) or a gradient penalty (like proposed for Improved WGAN), RWGAN utilizes an asymmetric clamping strategy.

Beyond the specific GAN architecture they put forth, they also go onto describe what they call a statistical class of divergences (dubbed Relaxed Wasserstein divergences or RW divergences). RW divergences take the Wasserstein divergence from the WGAN paper and make it more general, outlining some key probabilistic properties that are needed in order to hold some of theoretical guarantees of our GANs.

They specifically show that RWGAN parameterized with KL divergence is extremely competitive against other state-of-the-art GANs, but with better convergence properties than even the regular WGAN. They also open their framework up to defining new loss functions and thus new cost functions for designing a GAN scheme.

Relaxed WGAN Results

Key Take-Aways

  • Asymmetric clamping of weights
  • General RW divergence framework, excellent for designing new GAN schema, costs, and loss functions


The Mean and Covariance Feature Matching GAN (McGAN) is part of the same family of GAN's that WGAN is. This family is dubbed the Integral Probability Metric (IPM) family. These GANs are the ones that use a critic architecture instead of an explicit discriminator.

The critic function for McGAN has to do with measuring the mean or the covariance features of the generated data distribution and the target data distribution. This seems pretty straight forward when looking at the name too. They define two different ways of creating a critic function, one for the mean and one for the covariance and demonstrate how to actually use them. Like WGAN, they also use clipping on their model, which ends up restricting the capacity of the model. No super eventful conclusions were drawn from this paper.

McGAN Results

Key Take-Aways

  • Mean and covariance measure of distance for a critic function

Generative Moment Matching Networks

Generative Moment Matching Networks (GMMN) focuses on minimizing something called the maximum mean discrepancy (MMD). MMD is essentially the mean of the embedding space of two distributions, and we are trying to minimize the difference between the two means here. We can use something called the kernel trick which allows us to cheat and use a Gaussian kernel to calculate this distance.

They argue that this allows for a simple objective that can easily be trained with backpropagation, and produces competitive results with a standard GAN. They also showed how you could add an auto-encoder into the architecture of this GAN to ease the amount of training needed to accurately estimate the MMD.

An additional note: Though they claim competitive results, from what I've read elsewhere, it seems that their empirical results are often lacking. What's more, this model is fairly computationally heavy, so the computational resource and performance trade-off doesn't really seem to be there in my opinion.

GMMN Results

Key Take-Aways

  • Uses maximum mean discrepancy (MMD) as distance/objective function
  • No discriminator, just measures the distance between samples
  • Adds in an auto-encoder to help measure the MMD


Maximum Mean Discrepancy GAN or MMD GAN is, you guessed it, an improvement of GMMN. Their major contributions come in the form of not using static Gaussian kernels to calculate the MMD, and instead use adversarial techniques to learn kernels. It combines ideas from the original GAN and GMMN papers to create a hybrid of the ideas of the two. The benefits it claims are an increase in performance and run time.

MMD GAN Results

Key Take-Aways

  • Iteration on GMMN: Adversarial learned kernels for estimating MMD

Cramer GAN

Cramer GAN starts by outlining an issue with the popular WGAN. It claims that there are three properties that a probability divergence should satisfy:

  • Sum invariance
  • Scale sensitivity
  • Unbiased sample gradients

Of these properties, they argue that the Wasserstein distance lacks the final property, unlike KLD or JSD which both have it. They demonstrate that this is actually an issue in practice, and propose a new distance: the Cramer distance.

The Cramer Distance

Now if we look at the Cramer distance, we can actually see it looks somewhat similar to the EM distance. However, due to its mathematical differences, it actually doesn't suffer from the biased sample gradients that EM distance will. This is proven in the paper, if you really wish to dig into the mathematics of it.

Cramer GAN

Key Take-Aways

  • Cramer distance instead of EM distance
  • Improvement over WGAN: unbiased sample gradients

Fisher GAN

The Fisher GAN is yet another iteration on IPM GAN's claiming to surpass McGAN, WGAN, and Improved WGAN in a number of aspects. What it does is sets up its objective function to have a critic that has a data dependent constraint on its second order moment (AKA its variance).

Because of this objective the Fisher GAN boasts the following:

  • Training stability
  • Unconstrained capacity
  • Efficient computation

What makes Fisher GAN's distance different? It has to do with the fact that it is essentially measure what is called the Mahalanobis distance which in simple terms is the distance between two points that have correlated variables, relative to a centroid that is believed to be the mean of the distribution of the multivariate data. This actually assures that the generator and critic will be bounded like we desire. As the parameterized critic approaches infinite capacity, it actually estimates the Chi-square distance.

Fisher GAN

Key Take-Aways

  • Improvement above WGAN and other IPM GANs
  • Boasts training stability, unconstrained capacity, and efficient computation time
  • Chi-square distance objective

Energy Based GAN

Energy Based GAN (EBGAN) is an interesting one in our collection of GANs here today. Instead of using a discriminator like how the original GAN does, it uses an autoencoder to estimate reconstruction loss. The steps to setting this up:

  • Train an autoencoder on the original data
  • Now run generated images through this autoencoder
  • Poorly generated images will have awful reconstruction loss, and thus this now becomes a good measure

This is a really cool approach to setting up the GAN, and with the right regularization to prevent mode collapse (the generator just producing the same sample over and over again), it seems to be fairly decent.

So why even do this? Well, what was empirically shown is that using the autoencoder in this fashion actually produces a GAN that is fast, stable, and robust to parameter changes. What's more, there isn't a need to try and pull a bunch of tricks to balance the training of the discriminator and the generator.

Key Take-Aways

  • Autoencoder as the discriminator
  • Reconstruction loss used as cost, setup similar to original GAN cost
  • Fast, stable, and robust

Boundary Equilibrium GAN

Boundary Equilibrium GAN (BEGAN) is an iteration on EBGAN. It instead uses the autoencoder reconstruction loss in a way that is similar to WGAN's loss function.

In order to do this, a parameter needs to be introduced to balance the training of the discriminator and generator. This parameter is weighted as a running mean over the samples, dancing at the boundary between improving the two halves (thus where it gets its name: "boundary equilibrium").


Key Take-Aways

  • Iteration of the EBGAN
  • Superficial resemblance of cost function to WGAN

Margin Adaptation GAN

Margin Adaptation GAN (MAGAN) is the last on our list. It is another variation of EBGAN. EBGAN has a margin as a part of its loss function to produce a hinge loss. What MAGAN does is reduce that margin monotonically over time, instead of keeping it constant. The result of this is that the discriminator will autoencode real samples better.

The result that we care about: better samples and more stability in training.


Key Take-Aways

  • Iteration on EBGAN
  • Adaptive margin in the hinge loss
  • More stability, better quality

Text Classification in Keras (Part 1)

The Tutorial Video

The Notebook

import keras
from keras.datasets import reuters


Using TensorFlow backend.
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)
word_index = reuters.get_word_index(path="reuters_word_index.json")
print('# of Training Samples: {}'.format(len(x_train)))
print('# of Test Samples: {}'.format(len(x_test)))
num_classes = max(y_train) + 1
print('# of Classes: {}'.format(num_classes))


# of Training Samples: 8982
# of Test Samples: 2246
# of Classes: 46
index_to_word = {}
for key, value in word_index.items():
    index_to_word[value] = key
print(' '.join([index_to_word[x] for x in x_train[0]]))


the wattie nondiscriminatory mln loss for plc said at only ended said commonwealth could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 psbr oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs
from keras.preprocessing.text import Tokenizer
max_words = 10000
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)


[0. 1. 0. ... 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


['loss', 'acc']
batch_size = 32
epochs = 3
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.1)
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])


Train on 8083 samples, validate on 899 samples
Epoch 1/3
8083/8083 [==============================] - 13s 2ms/step - loss: 1.3051 - acc: 0.7192 - val_loss: 0.9643 - val_acc: 0.7931
Epoch 2/3
8083/8083 [==============================] - 12s 2ms/step - loss: 0.5136 - acc: 0.8841 - val_loss: 0.8800 - val_acc: 0.8165
Epoch 3/3
8083/8083 [==============================] - 13s 2ms/step - loss: 0.2873 - acc: 0.9344 - val_loss: 0.9045 - val_acc: 0.8065
2246/2246 [==============================] - 0s 175us/step
Test loss: 0.8878143835789586
Test accuracy: 0.7983081033478224

What is a Generative Adversarial Network?

What’s in a Generative Model?

Before we even think about starting to talk about Generative Adversarial Networks (GANs), it is worth asking the question of what’s in a generative model? Why do we even want to have such a thing? What is the goal? These questions can help seed our thought process to better engage with GANs.

So why do we want a generative model? Well, it’s in the name! We wish to generate something. But what do we wish to generate? Typically, we wish to generate data (I know, not very specific). More than that though, it is likely that we wish to generate data that is never before seen, yet still fits into some data distribution (i.e. some pre-defined data set that we have already set aside).

And the goal of such a generative model? To get so good at coming up with new generated content that we (or any system that is observing the samples) can no longer tell the difference between what is original and what is generated. Once we have a system that can do that much, we are free to begin generating up new samples that we haven’t even seen before, yet still are believably real data.

To step into things a little deeper, we want our generative model to be able to accurately estimate the probability distribution of our real data. We will say that if we have a parameter W, we wish to find the parameter W that maximizes the likelihood of real samples. When we train our generative model, we find this ideal parameter W such that we minimize the distance between our estimate of what the data distribution is and the actual data distribution.

A good measure of distance between distributions is the Kullback-Leibler Divergence, and it is shown that maximizing the log likelihood is equivalent to minimizing this distance. Taking our parameterized, generative model and minimizing the distance between it and the actual data distribution is how we create a good generative model. It also brings us to a branching of two types of generative models.

Explicit Distribution Generative Models

An explicit distribution generative model comes up with an explicitly defined generative model distribution. It then refines this explicitly defined, parameterized estimation through training on data samples. An example of an explicit distribution generative model is a Variational Auto-Encoder (VAE). VAEs require an explicitly assumed prior distribution and likelihood distribution to be given to it. They use these two components to come up with a “variational approximation” with which to evaluate how they are performing. Because of these needs and this component, VAEs have to be explicitly distributed.

Implicit Distribution Generative Models

Much like you may have already put together, implicitly distributed generative models do not require an explicit definition for their model distribution. Instead, these models train themselves by indirectly sampling data from their parameterized distribution. And as you may have also already guessed, this is what a GAN does.

Well, how exactly does it do that? Let’s dive a little bit into GANs and then we’ll start to paint that picture.


High-Level GAN Understanding

Generative Adversarial Networks have three components to their name. We’ve touched on the generative aspect and the network aspect is pretty self-explanatory. But what about the adversarial portion?

Well, GAN’s have two components to their network, a generator (G) and a discriminator (D). These two components come together in the network and work as adversaries, pushing the performance of one another.


The Generator

The generator is responsible for producing fake examples of data. It takes as input some latent variable (which we will refer to as z) and outputs data that is of the same form as data in the original data set.

Latent variables are hidden variables. When talking about GANs we have this notion of a “latent space” that we can sample from. We can continuously slide through this latent space which, when you have a well-trained GAN, will have substantial (and oftentimes somewhat understandable effects) on the output.

If our latent variable is z and our target variable is x, we can think of the generator of network as learning a function that maps from z (the latent space) to x (hopefully, the real data distribution).

The Discriminator

The discriminator’s role is to discriminate. It is responsible for taking in a list of samples and coming up with a prediction for whether or not a given sample is real or fake. The discriminator will output a higher probability if it believes a sample is real.

We can think of our discriminator as a “bullshit detector” of sorts.


Adversarial Competition

These two components come together and battle it out. The generator and discriminator oppose one another, trying to maximize opposite goals: The generator wants to push to create samples that look more and more real and the discriminator wishes to always correctly classify where a sample comes from.

The fact that these goals are directly opposite one another is where GANs get the adversarial portion of their name.


Painting an Elaborate Metaphor

Who doesn’t love a good metaphor to learn to understand a concept?

Art Forgery

My favorite metaphor from when I was first learning about GANs was the forger versus critic metaphor. In this metaphor, our generator is a criminal who is try to forge art whereas our discriminator is an art critic who is suppose to be able to correctly identify if a piece is a forged or authentic.

The two go back and forth, directly in opposition to one another. T rying to one-up one another, because their jobs depend on it.


False Money

What if instead of an art forgery task we had a criminal who was trying to make fake money and an intern at the bank trying to make sure that they do not accept any fake money.

Maybe in the beginning the criminal is very bad. They come in and try to hand to the intern a piece of paper with a dollar bill drawn in crayon. This is obviously a fake dollar. But maybe the intern is really bad at their job as well and struggles to figure out if it is actually fake. Both the two will learn a lot from their first interaction. Come the next day, when the criminal comes in, their fake money is going to be a bit harder to tell if it is fake or not.


Day in and day out of this activity, the two go back and forth and become really good at their job. However, at a certain point, there may come a day when the two reach a sort of equilibrium. From there, the criminal’s fake dollars become so realistic, not even a seasoned expert could even begin to tell if it is fake or real.


That is the day the bank intern gets fired.

It’s also the day that we can utilize this criminal of ours and get very rich!


The previous two examples have been very visually focused. But what about an example that’s a little different.

Let’s say our generator is our pet parrot and our discriminator is our younger brother. Each day, we sit behind a curtain and our parrot sits behind another. Our parrot is going to try and mimic our voice to fool our younger brother. If he’s successful, we give him a treat. If our brother correctly guesses which curtain we are behind, we give our brother a treat instead (hopefully a different one than we give to our parrot).

Maybe in the beginning, the parrot is really bad at mimicking our voice. But day after day of practice, our parrot may be able to develop the skills to perfectly mirror our voice. At that point, we’ve trained our parrot to talk exactly like us and we can become internet famous.



The Math Behind the Monster

Before we wrap up this introduction to GANs, it is worth exploring the mathematics behind a GAN in a little bit of detail. GANs have a goal of finding equilibrium between the two halves of their network by solving the following minimax equation:


We call this equation our minimax equation because we are trying to jointly optimize two parameterized networks, G and D, to find an equilibrium between the two. We wish to maximize the confusion of D while minimizing the failures of G. When solved, our parameterized, implicit, generative data distribution should match the underlying original data distribution fairly well.

To break down the portions of our equation even more, let’s analyze and think about it a bit more. From the side of D, it wants to maximize this equation. It wants, when a real sample comes in, to maximize its output and when a fake sample comes in, to minimize it. That’s essentially where the right half of the equation falls out of. On the flip side, G is trying to trick D into maximizing its output when it is handed a fake sample. That’s why D is trying to maximize while G is trying to minimize.

And due to the minimizing/maximizing is where we get the term minimax.

Now, assuming that G and D are well parameterized and thus have enough capacity to learn, this minimax equation can help us reach the Nash equilibrium between the two. This is ideal.

How Do Achieve This?

Simple: We just iterate back and forth.

Just kidding. It’s not really simple. But we can outline it fairly simply.

To start, we will first train D to be an optimal classifier on a fixed version of G. From there, we fix D and train G to best fool a fixed D. By iterating back and forth, we can optimize our minimax equation to the point where D can no longer differentiate between real and fake samples because our generative data distribution is more or less indistinguishable from the actual data distribution. At this point, D will output a 50% probability for every sample it encounters.

Natural Language Processing: Count Vectorization with scikit-learn

Count Vectorization (AKA One-Hot Encoding)

If you haven’t already, check out my previous blog post on word embeddings: Introduction to Word Embeddings

In that blog post, we talk about a lot of the different ways we can represent words to use in machine learning. It’s a high level overview that we will expand upon here and check out how we can actually use count vectorization on some real text data.

A Recap of Count Vectorization


Today, we will be looking at one of the most basic ways we can represent text data numerically: one-hot encoding (or count vectorization). The idea is very simple.

We will be creating vectors that have a dimensionality equal to the size of our vocabulary, and if the text data features that vocab word, we will put a one in that dimension. Every time we encounter that word again, we will increase the count, leaving 0s everywhere we did not find the word even once.

The result of this will be very large vectors, if we use them on real text data, however, we will get very accurate counts of the word content of our text data. Unfortunately, this won’t provide use with any semantic or relational information, but that’s okay since that’s not the point of using this technique.

Today, we will be using the package from scikit-learn.

A Basic Example

Here is a basic example of using count vectorization to get vectors:

from sklearn.feature_extraction.text import CountVectorizer
# To create a Count Vectorizer, we simply need to instantiate one.
# There are special parameters we can set here when making the vectorizer, but
# for the most basic example, it is not needed.
vectorizer = CountVectorizer()
# For our text, we are going to take some text from our previous blog post
# about count vectorization
sample_text = ["One of the most basic ways we can numerically represent words "
               "is through the one-hot encoding method (also sometimes called "
               "count vectorizing)."]
# To actually create the vectorizer, we simply need to call fit on the text
# data that we wish to fix
# Now, we can inspect how our vectorizer vectorized the text
# This will print out a list of words used, and their index in the vectors
print('Vocabulary: ')
# If we would like to actually create a vector, we can do so by passing the
# text into the vectorizer to get back counts
vector = vectorizer.transform(sample_text)
# Our final vector:
print('Full vector: ')
# Or if we wanted to get the vector for one word:
print('Hot vector: ')
# Or if we wanted to get multiple vectors at once to build matrices
print('Hot and one: ')
print(vectorizer.transform(['hot', 'one']).toarray())
# We could also do the whole thing at once with the fit_transform method:
print('One swoop:')
new_text = ['Today is the day that I do the thing today, today']
new_vectorizer = CountVectorizer()

Our output:

{'one': 12, 'of': 11, 'the': 15, 'most': 9, 'basic': 1, 'ways': 18, 'we': 19,
'can': 3, 'numerically': 10, 'represent': 13, 'words': 20, 'is': 7,
'through': 16, 'hot': 6, 'encoding': 5, 'method': 8, 'also': 0,
'sometimes': 14, 'called': 2, 'count': 4, 'vectorizing': 17}
Full vector:
[[1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1]]
Hot vector:
[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Hot and one:
[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]]
One swoop:
[[1 1 1 1 2 1 3]]

Using It on Real Data

So let’s use it on some real data! We will check out the 20 News Group dataset that comes with scikit-learn.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# Create our vectorizer
vectorizer = CountVectorizer()
# Let's fetch all the possible text data
newsgroups_data = fetch_20newsgroups()
# Why not inspect a sample of the text data?
print('Sample 0: ')
# Create the vectorizer
# Let's look at the vocabulary:
print('Vocabulary: ')
# Converting our first sample into a vector
v0 = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]
print('Sample 0 (vectorized): ')
# It's too big to even see...
# What's the length?
print('Sample 0 (vectorized) length: ')
# How many words does it have?
print('Sample 0 (vectorized) sum: ')
# What if we wanted to go back to the source?
print('To the source:')
# So all this data has a lot of extra garbage... Why not strip it away?
newsgroups_data = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
# Why not inspect a sample of the text data?
print('Sample 0: ')
# Create the vectorizer
# Let's look at the vocabulary:
print('Vocabulary: ')
# Converting our first sample into a vector
v0 = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]
print('Sample 0 (vectorized): ')
# It's too big to even see...
# What's the length?
print('Sample 0 (vectorized) length: ')
# How many words does it have?
print('Sample 0 (vectorized) sum: ')
# What if we wanted to go back to the source?
print('To the source:')

Our output:

   Sample 0:
   From: lerxst@wam.umd.edu (where's my thing)
   Subject: WHAT car is this!?
   Nntp-Posting-Host: rac3.wam.umd.edu
   Organization: University of Maryland, College Park
   Lines: 15

   I was wondering if anyone out there could enlighten me on this car I saw
   the other day. It was a 2-door sports car, looked to be from the late 60s/
   early 70s. It was called a Bricklin. The doors were really small. In addition,
   the front bumper was separate from the rest of the body. This is
   all I know. If anyone can tellme a model name, engine specs, years
   of production, where this car is made, history, or whatever info you
   have on this funky looking car, please e-mail.

   - IL
   ---- brought to you by your neighborhood Lerxst ----

   {'from': 56979, 'lerxst': 75358, 'wam': 123162, 'umd': 118280, 'edu': 50527,
   'where': 124031, 'my': 85354, 'thing': 114688, 'subject': 111322,
   'what': 123984, 'car': 37780, 'is': 68532, 'this': 114731, 'nntp': 87620,
   'posting': 95162, 'host': 64095, 'rac3': 98949, 'organization': 90379,
   'university': 118983, 'of': 89362, 'maryland': 79666,
   'college': 40998, ... } (Abbreviated...)

   Sample 0 (vectorized):
   [0 0 0 ... 0 0 0]

   Sample 0 (vectorized) length:

   Sample 0 (vectorized) sum:

   To the source:
   [array(['15', '60s', '70s', 'addition', 'all', 'anyone', 'be', 'body',
      'bricklin', 'brought', 'bumper', 'by', 'called', 'can', 'car',
      'college', 'could', 'day', 'door', 'doors', 'early', 'edu',
      'engine', 'enlighten', 'from', 'front', 'funky', 'have', 'history',
      'host', 'if', 'il', 'in', 'info', 'is', 'it', 'know', 'late',
      'lerxst', 'lines', 'looked', 'looking', 'made', 'mail', 'maryland',
      'me', 'model', 'my', 'name', 'neighborhood', 'nntp', 'of', 'on',
      'or', 'organization', 'other', 'out', 'park', 'please', 'posting',
      'production', 'rac3', 'really', 'rest', 'saw', 'separate', 'small',
      'specs', 'sports', 'subject', 'tellme', 'thanks', 'the', 'there',
      'thing', 'this', 'to', 'umd', 'university', 'wam', 'was', 'were',
      'what', 'whatever', 'where', 'wondering', 'years', 'you', 'your'],

   Sample 0:
   I was wondering if anyone out there could enlighten me on this car I saw
   the other day. It was a 2-door sports car, looked to be from the late 60s/
   early 70s. It was called a Bricklin. The doors were really small. In addition,
   the front bumper was separate from the rest of the body. This is
   all I know. If anyone can tellme a model name, engine specs, years
   of production, where this car is made, history, or whatever info you
   have on this funky looking car, please e-mail.

   {'was': 95844, 'wondering': 97181, 'if': 48754, 'anyone': 18915, 'out': 68847,
   'there': 88638, 'could': 30074, 'enlighten': 37335, 'me': 60560, 'on': 68080,
   'this': 88767, 'car': 25775, 'saw': 80623, 'the': 88532, 'other': 68781,
   'day': 31990, 'it': 51326, 'door': 34809, 'sports': 84538, 'looked': 57390,
   'to': 89360, 'be': 21987, 'from': 41715, 'late': 55746, '60s': 9843,
   'early': 35974, '70s': 11174, 'called': 25492, 'bricklin': 24160, 'doors': 34810,
   'were': 96247, 'really': 76471, ... } (Abbreviated...)

   Sample 0 (vectorized):
   [0 0 0 ... 0 0 0]

   Sample 0 (vectorized) length:

   Sample 0 (vectorized) sum:

   To the source:
   [array(['60s', '70s', 'addition', 'all', 'anyone', 'be', 'body',
      'bricklin', 'bumper', 'called', 'can', 'car', 'could', 'day',
      'door', 'doors', 'early', 'engine', 'enlighten', 'from', 'front',
      'funky', 'have', 'history', 'if', 'in', 'info', 'is', 'it', 'know',
      'late', 'looked', 'looking', 'made', 'mail', 'me', 'model', 'name',
      'of', 'on', 'or', 'other', 'out', 'please', 'production', 'really',
      'rest', 'saw', 'separate', 'small', 'specs', 'sports', 'tellme',
      'the', 'there', 'this', 'to', 'was', 'were', 'whatever', 'where',
      'wondering', 'years', 'you'], dtype='<U81')]

Now What?

So, you may be wondering what now?
We know how to vectorize these things based on counts,
but what can we actually do with any of this information?

Well, for one, we could do a bunch of analysis.
We could look at term frequency,
we could remove stop words, we could visualize things, and we could try and cluster.
Now that we have these numeric representations of this textual data,
there is so much we can do that we couldn’t do before!

But let’s make this more concrete.
We’ve been using this text data from the 20 News Group dataset.
Why not use it on a task?

The 20 News Group dataset is a dataset of posts on a board,
split up into 20 different categories.
Why not use our vectorization to try and categorize this data?
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
# Create our vectorizer
vectorizer = CountVectorizer()
# All data
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'))
# Get the training vectors
vectors = vectorizer.fit_transform(newsgroups_train.data)
# Build the classifier
clf = MultinomialNB(alpha=.01)
#  Train the classifier
clf.fit(vectors, newsgroups_train.target)
# Get the test vectors
vectors_test = vectorizer.transform(newsgroups_test.data)
# Predict and score the vectors
pred = clf.predict(vectors_test)
acc_score = metrics.accuracy_score(newsgroups_test.target, pred)
f1_score = metrics.f1_score(newsgroups_test.target, pred, average='macro')
print('Total accuracy classification score: {}'.format(acc_score))
print('Total F1 classification score: {}'.format(f1_score))

Our output:

Total accuracy classification score: 0.6460435475305364
Total F1 classification score: 0.6203806145034193

Hmmm… So not super fantastic, but we are just using count vectors! A richer representation would do wonders for our scores!

Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning

Imagine a world where every computer system is customized specifically to your own personality. It learns the nuances of how you communicate and how you wish to be communicated with. Interacting with a computer system becomes more intuitive than ever and technological literacy sky rockets. These are the potential outcomes you could see in a future where reinforcement learning is the norm.

In this article that I wrote for my team at SAP Conversational AI, we break down reinforcement learning and dissect some of the components that come together to make up a reinforcement learning system.

Head on over to their blog and give it a read! If you have questions, drop a comment there or here.

Introduction to Word Embeddings

What is a word embedding?

A very basic definition of a word embedding is a real number, vector representation of a word. Typically, these days, words with similar meaning will have vector representations that are close together in the embedding space (though this hasn’t always been the case).

When constructing a word embedding space, typically the goal is to capture some sort of relationship in that space, be it meaning, morphology, context, or some other kind of relationship.

By encoding word embeddings in a densely populated space, we can represent words numerically in a way that captures them in vectors that have tens or hundreds of dimensions instead of millions (like one-hot encoded vectors).

A lot of word embeddings are created based on the notion introduced by Zellig Harris’ “distributional hypothesis” which boils down to a simple idea that words that are used close to one another typically have the same meaning.

The beauty is that different word embeddings are created either in different ways or using different text corpora to map this distributional relationship, so the end result are word embeddings that help us on different down-stream tasks in the world of NLP.

Why do we use word embeddings?

Words aren’t things that computers naturally understand. By encoding them in a numeric form, we can apply mathematical rules and do matrix operations to them. This makes them amazing in the world of machine learning, especially.

Take deep learning for example. By encoding words in a numerical form, we can take many deep learning architectures and apply them to words. Convolutional neural networks have been applied to NLP tasks using word embeddings and have set the state-of-the-art performance for many tasks.

Even better, what we have found is that we can actually pre-train word embeddings that are applicable to many tasks. That’s the focus of many of the types we will address in this article. So one doesn’t have to learn a new set of embeddings per task, per corpora. Instead, we can learn general representation which can then be used across tasks.

Specific examples of word embeddings

So now with that brief introduction out of the way, let’s take a brief look into some of the different ways we can numerically represent words (and at a later time, I’ll put together a more complex analysis of each and how to actually use them in a down-stream task).

One-Hot Encoding (Count Vectorizing)

One of the most basic ways we can numerically represent words is through the one-hot encoding method (also sometimes called count vectorizing).

The idea is super simple. Create a vector that has as many dimensions as your corpora has unique words. Each unique word has a unique dimension and will be represented by a 1 in that dimension with 0s everywhere else.

The result of this? Really huge and sparse vectors that capture absolutely no relational information. It could be useful if you have no other option. But we do have other options, if we need that semantic relationship information.


I also have an post now about how to use count vectorization on real text data! If you’re interested, check it out here: Natural Language Processing: Count Vectorization with scikit-learn

TF-IDF Transform

TF-IDF vectors are related to one-hot encoded vectors. However, instead of just featuring a count, they feature numerical representations where words aren’t just there or not there. Instead, words are represented by their term frequency multiplied by their inverse document frequency.

In simpler terms, words that occur a lot but everywhere should be given very little weighting or significance. We can think of this as words like the or and in the English language. They don’t provide a large amount of value.

However, if a word appears very little or appears frequently, but only in one or two places, then these are probably more important words and should be weighted as such.

Again, this suffers from the downside of very high dimensional representations that don’t capture semantic relatedness.


Co-Occurrence Matrix

A co-occurrence matrix is exactly what it sounds like: a giant matrix that is as long and as wide as the vocabulary size. If words occur together, they are marked with a positive entry. Otherwise, they have a 0. It boils down to a numeric representation that simple asks the question of “Do words occur together? If yes, then count this.”

And what can we already see becoming a big problem? Super large representation! If we thought that one-hot encoding was high dimensional, then co-occurrence is high dimensional squared. That’s a lot of data to store in memory.


Neural Probabilistic Model

Now, we can start to get into some neural networks. A neural probabilistic model learns an embedding by achieving some task like modeling or classification and is what the rest of these embeddings are more or less based on.

Typically, you clean your text and create one-hot encoded vectors. Then, you define your representation size (300 dimensional might be good). From there, we initialize the embedding to random values. It’s the entry point into the network, and back-propagation is utilized to modify the embedding based on whatever goal task we have.

This typically takes a lot of data and can be very slow. The trade-off here is that it learns an embedding that is good for the text data that the network was trained on as well as the NLP task that was jointly learned during training.



Word2Vec is a better successor to the neural probabilistic model. We still use a statistical computation method to learn from a text corpus, however, its method of training is more efficient than just simple embedding training. It is more or less the standard method for training embeddings the days.

It is also the first method that demonstrated classic vector arithmetic to create analogies:


There are two major learning approaches.

Continuous Bag-of-Words (CBOW)

This method learns an embedding by predicting the current words based on the context. The context is determined by the surrounding words.

Continuous Skip-Gram

This method learns an embedding by predicting the surrounding words given the context. The context is the current word.


Both of these learning methods use local word usage context (with a defined window of neighboring words). The larger the window is, the more topical similarities that are learned by the embedding. Forcing a smaller window results in more semantic, syntactic, and functional similarities to be learned.

So, what are the benefits? Well, high quality embeddings can be learned pretty efficiently, especially when comparing against neural probabilistic models. That means low space and low time complexity to generate a rich representation. More than that, the larger the dimensionality, the more features we can have in our representation. But still, we can keep the dimensionality a lot lower than some other methods. It also allows us to efficiently generate something like a billion word corpora, but encompass a bunch of generalities and keep the dimensionality small.


GloVe is modification of word2vec, and a much better one at that. There are a set of classical vector models used for natural language processing that are good at capturing global statistics of a corpus, like LSA (matrix factorization). They’re very good at global information, but they don’t capture meanings so well and definitely don’t have the cool analogy features built in.

GloVe’s contribution was the addition of global statistics in the language modeling task to generate the embedding. T here is no window feature for local context. Instead, there is a word-context/word co-occurrence matrix that learns statistics across the entire corpora.

The result? A much better embedding being learned than simple word2vec.



Now, with FastText we enter into the world of really cool recent word embeddings. What FastText did was decide to incorporate sub-word information. It did so by splitting all words into a bag of n-gram characters (typically of size 3-6). It would add these sub-words together to create a whole word as a final feature. The thing that makes this really powerful is it allows FastText to naturally support out-of-vocabulary words!

This is huge because in other approaches, if the system encounters a word that it doesn’t recognize, it just has to set it to the unknown word. With FastText, we can give meaning to words like circumnavigate if we only know the word navigate, because our semantic knowledge of the word navigate can help use at least provide a bit more semantic information to circumnavigate, even if it is not a word our system learned during training.

Beyond that, FastText uses the skip-gram objective with negative sampling. All sub-words are positive examples, and then random samples from a dictionary of words in the corpora are used as negative examples. These are the major things that FastText included in its training.

Another really cool thing is that Facebook, in developing FastText, has published pre-trained FastText vectors in 294 different languages. This is something extremely awesome, in my opinion, because it allows developers to jump into making projects in languages that typically don’t have pre-trained word vectors at a very low cost (since training their own word embeddings takes a lot of computational resources).

If you want to see all the languages that FastText supports, check it out here.


ELMo is a personal favorite of mine. They are state-of-the-art contextual word vectors. The representations are generated from a function of the entire sentence to create word-level representations. The embeddings are generated at a character-level, so they can capitalize on sub-word units like FastText and do not suffer from the issue of out-of-vocabulary words.

ELMo is trained as a bi-directional, two layer LSTM language model. A really interesting side-effect is that its final output is actually a combination of its inner layer outputs. What has been found is that the lowest layer is good for things like POS tagging and other more syntactic and functional tasks, whereas the higher layer is good for things like word-sense disambiguation and other higher-level, more abstract tasks. When we combine these layers, we find that we actually get incredibly high performance on downstream tasks out of the box.

The only questions on my mind? How can we reduce the dimensionality and extend to training on less popular languages like we have for FastText.


Probabilistic FastText

Probabilistic FastText is a recent paper that came out that tried to better handle the issue of words that have different meanings, but are spelled the same. Take for example the word rock. It can mean:

  • Rock music

  • A stone

  • The action of moving back and forth

How do we know what we are talking about when we encounter this word? Typically, we don’t. When learning an embedding, we just smush all the meanings together and hope for the best. That’s why things like ELMo, which use the entire sentence as a context, tend to perform better when needing to distinguish the different meanings.

That’s also what Probabilistic FastText does really well. Instead of representing words as vectors, words are represented as Gaussian mixture models. Now, I still don’t understand the math really well, but a lot of the training schema is still similar to FastText, just instead of learning a vector, we learn something probabilistic.

I’m very curious to see if this stream yields future research as I think moving away from vectors is a very curious way to go.