If you haven't already, you should definitely read my previous post
about what a GAN is (especially if you don't know
what I mean when I say GAN!). That post should give you a starting point to dive into the world of GANs and how they
work. It's a solid primer for any article on GANs, not to mention this one
where we will be discussing objective functions of GANs and other variations
of GANs currently out there that use twists on defining their objectives for different results.
Defining an Objective
In our introductory post, we talked about generative models. We discussed how the goal of a generative model is to come
up with a way of matching their generated distribution to a real data distribution. Minimizing the distance between the
two distributions is critical for creating a system that generates content that looks good, new, and like it is from
the original data distribution.
But how do we measure the difference between our generated data distribution and our original data distribution? That's
what we call an objective function
and it is the focus of this article today! We are going to look at some variations
of GANs to understand how we can alter the measure of the divergence between our generated data distribution and the
actual distribution and the effect that that will have.
The Original GAN
The objective function of our original GAN is essentially the minimization of something called the Jensen Shannon Divergence (JSD). Specifically it is:
The JSD is derived from the Kullbach-Liebler Divergence
(KLD) that we mentioned in the previous post.
We are already familiar with our friend, the original GAN. Instead of discussing it any further, let's just admire
its performance in all its glory:
The Wasserstein GAN (WGAN) is a GAN you may have heard about, since it got a
lot of attention. It did so for a lot of practical reasons (in general, when you train a GAN the loss values returned
don't mean anything except that with WGAN they can), but what made WGAN different?
WGAN doesn't use the JSD to measure divergence, instead it uses something called the Earth-Mover (EM) distance
(AKA Wasserstein distance). EM distance is defined as:
What does this mean?
Let's try and understand the intuition behind the EM distance. A probability distribution is essentially a collection of
mass, with the distribution measuring the amount of mass at a given point. We give EM distance two distributions. Since
the cost to move a mass a certain distance is equivalent to the product of the mass and the distance, the EM distance
basically calculates the minimal cost of transforming one probability distribution into the other. This can be seen as
the minimal effort needed.
But why do we care? Well, we care about EM distance because it oftentimes measures a distance of a straight line for
transforming one distribution to the other. This is helpful with gradients in optimization. Not to mention, there are
also a set of functions that do not converge when distance is measured with something like KLD or JSD that do actually
converge for the EM distance.
This is because EM distance has guarantees of continuity and differentiability, something that distance functions like
KLD and JSD lack. We want these guarantees for a loss function, making EM distance better suited to our needs. More
than that, everything that would converge under JSD or KLD also converge under EM distance. It's just that EM
distance encompasses that much more.
How is This Used?
Stepping away from all these thoughts about math and into the practical application of such things, how do we use this
new distance when we can't directly calculate it? Well, we first take a critic function that is parameterized and train
it to approximate the EM distance between our data distribution and our generated distribution. When we have achieved
that, we have a good approximator for the EM distance. From there, we then optimize our generator function to reduce
this EM distance.
In order to guarantee that our function lies in a compact space (this helps ensure that we meet the theoretical
guarantees needed to do our calculations), we clip the weights that parametrize our critic function f.
Just a side note: Our critic function f is called a critic because it is not an explicit discriminator. A discriminator
will classify its inputs as real or fake. The critic doesn't do that. The critic function just approximates a distance
score. However, it plays the discriminator role in the traditional GAN framework, so its worth highlighting how it is
similar and how it is different.
- Meaningful loss function
- Easier debugging
- Easier hyperparameter searching
- Improved stability
- Less mode collapse (when a generator just generates one thing over and over again... More on this later)
- Theoretical optimization guarantees
With all those good things proposed with WGAN, what still needs to be improved? Well,
Improved Training of Wasserstein GANs highlights just that.
WGAN got a lot of attention, people started using it, and the benefits were there. But people began to notice that
despite all the things WGAN brought to the table, it still can fail to converge or produce pretty bad generated
samples. The reasoning that Improved WGAN gives is that weight clipping is an issue. It does more harm than good in
some situations. We noted that the reason why we weight clip has to do with maintaining the theoretical guarantees
of the critic function. But in practice, what clipping actually does is encourage very simple critic functions that are
pushed to the extremes of their boundaries. This is not good.
What Improved WGAN proposes instead is that you don't clip weights but rather add a penalization term to the norm of the
gradient of the critic function. They found that this produces better results and, when plugged into a bunch of
different GAN architectures, produces stable training.
- Exactly WGAN, except no weight clipping
- Weight regularization term to encourage theoretical guarantees
Least Squares GAN
LSGAN has a setup similar to WGAN. However, instead of learning a critic
function, LSGAN learns a loss function. The loss for real samples should be lower than the loss for fake samples. This
allows the LSGAN to put a high focus on fake samples that have a really high margin.
Like WGAN, LSGAN tries to restrict the domain of their function. They take a different approach instead of clipping.
They introduce regularization in the form of weight decay, encouraging the weights of their function to lie within a
bounded area that guarantee the theoretical needs.
Another point to note is that the loss function is setup more similarly to the original GAN, but where the original
GAN uses a log loss, the LSGAN uses an L2 loss
(which equates to minimizing the Pearson X^2 divergence). The reason for
this has to do with the fact that a log loss will basically only care about whether or not a sample is labeled correctly
or not. It will not heavily penalize based on
the distance of said sample from correct classification. If a label is correct, it doesn't worry about it further. In
contrast, L2 loss does care about distance. Data far away from where it should be will be penalized proportionally. What
LSGAN argues is that this produces more informative gradients.
- Loss function instead of a critic
- Weight decay regularization to bound loss function
- L2 loss instead of log loss for proportional penalization
Relaxed Wasserstein GAN
Or RWGAN for short is another variation of the WGAN paper. They describe their
RWGAN as the happy medium between WGAN and Improved WGAN (WGAN-GP as they cite it in the paper). Instead of symmetric
clamping of weights (like in WGAN) or a gradient penalty (like proposed for Improved WGAN), RWGAN utilizes an
asymmetric clamping strategy.
Beyond the specific GAN architecture they put forth, they also go onto describe what they call a statistical class of
divergences (dubbed Relaxed Wasserstein divergences or RW divergences). RW divergences take the Wasserstein divergence
from the WGAN paper and make it more general, outlining some key probabilistic properties that are needed in order to
hold some of theoretical guarantees of our GANs.
They specifically show that RWGAN parameterized with KL divergence is extremely competitive against other
state-of-the-art GANs, but with better convergence properties than even the regular WGAN. They also open their framework
up to defining new loss functions and thus new cost functions for designing a GAN scheme.
- Asymmetric clamping of weights
- General RW divergence framework, excellent for designing new GAN schema, costs, and loss functions
The Mean and Covariance Feature Matching GAN (McGAN) is part of the same family
of GAN's that WGAN is. This family is dubbed the Integral Probability Metric (IPM) family. These GANs are the ones that
use a critic architecture instead of an explicit discriminator.
The critic function for McGAN has to do with measuring the mean or the covariance features of the generated data
distribution and the target data distribution. This seems pretty straight forward when looking at the name too.
They define two different ways of creating a critic function, one for the mean and one for the covariance and
demonstrate how to actually use them. Like WGAN, they also use clipping on their model, which ends up restricting the
capacity of the model. No super eventful conclusions were drawn from this paper.
- Mean and covariance measure of distance for a critic function
Generative Moment Matching Networks
Generative Moment Matching Networks (GMMN)
focuses on minimizing something called the
maximum mean discrepancy
(MMD). MMD is essentially the mean of the embedding space of two distributions, and we are
trying to minimize the difference between the two means here. We can use something called the kernel trick
which allows us to cheat and use a Gaussian kernel to calculate this distance.
They argue that this allows for a simple objective that can easily be trained with backpropagation, and produces
competitive results with a standard GAN. They also showed how you could add an auto-encoder into the architecture of this
GAN to ease the amount of training needed to accurately estimate the MMD.
An additional note: Though they claim competitive results, from what I've read elsewhere, it seems that their empirical
results are often lacking. What's more, this model is fairly computationally heavy, so the computational resource and
performance trade-off doesn't really seem to be there in my opinion.
- Uses maximum mean discrepancy (MMD) as distance/objective function
- No discriminator, just measures the distance between samples
- Adds in an auto-encoder to help measure the MMD
Maximum Mean Discrepancy GAN or MMD GAN is, you guessed it, an improvement of
GMMN. Their major contributions come in the form of not using static Gaussian kernels to calculate the MMD, and instead
use adversarial techniques to learn kernels. It combines ideas from the original GAN and GMMN papers to create a
hybrid of the ideas of the two. The benefits it claims are an increase in performance and run time.
- Iteration on GMMN: Adversarial learned kernels for estimating MMD
Cramer GAN starts by outlining an issue with the popular WGAN. It claims that
there are three properties that a probability divergence should satisfy:
- Sum invariance
- Scale sensitivity
- Unbiased sample gradients
Of these properties, they argue that the Wasserstein distance lacks the final property, unlike KLD or JSD which both
have it. They demonstrate that this is actually an issue in practice, and propose a new distance: the Cramer distance.
The Cramer Distance
Now if we look at the Cramer distance, we can actually see it looks somewhat similar to the EM distance. However, due
to its mathematical differences, it actually doesn't suffer from the biased sample gradients that EM distance will.
This is proven in the paper, if you really wish to dig into the mathematics of it.
- Cramer distance instead of EM distance
- Improvement over WGAN: unbiased sample gradients
The Fisher GAN is yet another iteration on IPM GAN's claiming to surpass McGAN,
WGAN, and Improved WGAN in a number of aspects. What it does is sets up its objective function to have a critic that
has a data dependent constraint on its second order moment (AKA its variance).
Because of this objective the Fisher GAN boasts the following:
- Training stability
- Unconstrained capacity
- Efficient computation
What makes Fisher GAN's distance different? It has to do with the fact that it is essentially measure what is called
the Mahalanobis distance which in simple terms is the
distance between two points that have correlated variables, relative to a centroid that is believed to be the mean of
the distribution of the multivariate data. This actually assures that the generator and critic will be bounded like
we desire. As the parameterized critic approaches infinite capacity, it actually estimates the Chi-square distance.
- Improvement above WGAN and other IPM GANs
- Boasts training stability, unconstrained capacity, and efficient computation time
- Chi-square distance objective
Energy Based GAN
Energy Based GAN (EBGAN) is an interesting one in our collection of GANs here
today. Instead of using a discriminator like how the original GAN does, it uses an autoencoder to estimate
reconstruction loss. The steps to setting this up:
- Train an autoencoder on the original data
- Now run generated images through this autoencoder
- Poorly generated images will have awful reconstruction loss, and thus this now becomes a good measure
This is a really cool approach to setting up the GAN, and with the right regularization to prevent mode collapse
(the generator just producing the same sample over and over again), it seems to be fairly decent.
So why even do this? Well, what was empirically shown is that using the autoencoder in this fashion actually produces
a GAN that is fast, stable, and robust to parameter changes. What's more, there isn't a need to try and pull a bunch
of tricks to balance the training of the discriminator and the generator.
- Autoencoder as the discriminator
- Reconstruction loss used as cost, setup similar to original GAN cost
- Fast, stable, and robust
Boundary Equilibrium GAN
Boundary Equilibrium GAN (BEGAN) is an iteration on EBGAN. It instead uses
the autoencoder reconstruction loss in a way that is similar to WGAN's loss function.
In order to do this, a parameter
needs to be introduced to balance the training of the discriminator and generator. This parameter is weighted as a
running mean over the samples, dancing at the boundary between improving the two halves (thus where it gets its name:
- Iteration of the EBGAN
- Superficial resemblance of cost function to WGAN
Margin Adaptation GAN
Margin Adaptation GAN (MAGAN) is the last on our list. It is another variation
of EBGAN. EBGAN has a margin as a part of its loss function to produce a hinge loss. What MAGAN does is reduce that
margin monotonically over time, instead of keeping it constant. The result of this is that the discriminator will
autoencode real samples better.
The result that we care about: better samples and more stability in training.
- Iteration on EBGAN
- Adaptive margin in the hinge loss
- More stability, better quality