If you haven’t already, you should definitely read my previous post about what a GAN is (especially if you don’t know what I mean when I say GAN!). That post should give you a starting point to dive into the world of GANs and how they work. It’s a solid primer for any article on GANs, not to mention this one where we will be discussing objective functions of GANs and other variations of GANs currently out there that use twists on defining their objectives for different results.

Don’t have time to read the whole thing? Here’s the TL;DR

GAN TypeKey Take-Away
GANThe original (JSD divergence)
WGANEM distance objective
Improved WGANNo weight clipping on WGAN
LSGANL2 loss objective
RWGANRelaxed WGAN framework
McGANMean/covariance minimization objective
GMMNMaximum mean discrepancy objective
MMD GANAdversarial kernel to GMMN
Cramer GANCramer distance
Fisher GANChi-square objective
EBGANAutoencoder instead of discriminator
BEGANWGAN and EBGAN merged objectives
MAGANDynamic margin on hinge loss from EBGAN

Defining an Objective

In our introductory post, we talked about generative models. We discussed how the goal of a generative model is to come up with a way of matching their generated distribution to a real data distribution. Minimizing the distance between the two distributions is critical for creating a system that generates content that looks good, new, and like it is from the original data distribution.

But how do we measure the difference between our generated data distribution and our original data distribution? That’s what we call an objective function and it is the focus of this article today! We are going to look at some variations of GANs to understand how we can alter the measure of the divergence between our generated data distribution and the actual distribution and the effect that that will have.

The Original GAN

The objective function of our original GAN is essentially the minimization of something called the Jensen Shannon Divergence (JSD). Specifically it is: $$ \text{JSD}(P, Q) = \frac{1}{2} \text{KL}(P || M) + \frac{1}{2} \text{KL}(Q || M) $$ where M is the average of P and Q, and KL is the Kullbach-Liebler Divergence. that we mentioned in our previous post.

Wasserstein GAN

The Wasserstein GAN (WGAN) is a GAN you may have heard about, since it got a lot of attention. It did so for a lot of practical reasons (in general, when you train a GAN the loss values returned don’t mean anything except that with WGAN they can), but what made WGAN different?

WGAN doesn’t use the JSD to measure divergence, instead it uses something called the Earth-Mover (EM) distance (AKA Wasserstein distance). EM distance is defined as:

$$ W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{M \times M} d(x, y)^p , d\gamma(x, y) \right)^{1/p} $$

What does this mean?

EM Distance

Let’s try and understand the intuition behind the EM distance. A probability distribution is essentially a collection of mass, with the distribution measuring the amount of mass at a given point. We give EM distance two distributions. Since the cost to move a mass a certain distance is equivalent to the product of the mass and the distance, the EM distance basically calculates the minimal cost of transforming one probability distribution into the other. This can be seen as the minimal effort needed.

But why do we care? Well, we care about EM distance because it oftentimes measures a distance of a straight line for transforming one distribution to the other. This is helpful with gradients in optimization. Not to mention, there are also a set of functions that do not converge when distance is measured with something like KLD or JSD that do actually converge for the EM distance.

This is because EM distance has guarantees of continuity and differentiability, something that distance functions like KLD and JSD lack. We want these guarantees for a loss function, making EM distance better suited to our needs. More than that, everything that would converge under JSD or KLD also converge under EM distance. It’s just that EM distance encompasses that much more.

How is This Used?

Stepping away from all these thoughts about math and into the practical application of such things, how do we use this new distance when we can’t directly calculate it? Well, we first take a critic function that is parameterized and train it to approximate the EM distance between our data distribution and our generated distribution. When we have achieved that, we have a good approximation for the EM distance. From there, we then optimize our generator function to reduce this EM distance.

In order to guarantee that our function lies in a compact space (this helps ensure that we meet the theoretical guarantees needed to do our calculations), we clip the weights that parametrize our critic function f.

Just a side note: Our critic function f is called a critic because it is not an explicit discriminator. A discriminator will classify its inputs as real or fake. The critic doesn’t do that. The critic function just approximates a distance score. However, it plays the discriminator role in the traditional GAN framework, so its worth highlighting how it is similar and how it is different.

Wasserstein GAN Results

Wasserstein GAN Results: Taken from the WGAN paper source

Key Take-Aways

  • Meaningful loss function
    • Easier debugging
    • Easier hyperparameter searching
  • Improved stability
    • Less mode collapse (when a generator just generates one thing over and over again… More on this later)
  • Theoretical optimization guarantees

Improved WGAN

With all those good things proposed with WGAN, what still needs to be improved? Well, Improved Training of Wasserstein GANs highlights just that.

WGAN got a lot of attention, people started using it, and the benefits were there. But people began to notice that despite all the things WGAN brought to the table, it still can fail to converge or produce pretty bad generated samples. The reasoning that Improved WGAN gives is that weight clipping is an issue. It does more harm than good in some situations. We noted that the reason why we weight clip has to do with maintaining the theoretical guarantees of the critic function. But in practice, what clipping actually does is encourage very simple critic functions that are pushed to the extremes of their boundaries. This is not good.

What Improved WGAN proposes instead is that you don’t clip weights but rather add a penalization term to the norm of the gradient of the critic function. They found that this produces better results and, when plugged into a bunch of different GAN architectures, produces stable training.

Key Take-Aways

  • Exactly WGAN, except no weight clipping
    • Weight regularization term to encourage theoretical guarantees

Least Squares GAN

LSGAN has a setup similar to WGAN. However, instead of learning a critic function, LSGAN learns a loss function. The loss for real samples should be lower than the loss for fake samples. This allows the LSGAN to put a high focus on fake samples that have a really high margin.

Like WGAN, LSGAN tries to restrict the domain of their function. They take a different approach instead of clipping. They introduce regularization in the form of weight decay, encouraging the weights of their function to lie within a bounded area that guarantee the theoretical needs.

Another point to note is that the loss function is setup more similarly to the original GAN, but where the original GAN uses a log loss, the LSGAN uses an L2 loss (which equates to minimizing the Pearson X^2 divergence). The reason for this has to do with the fact that a log loss will basically only care about whether or not a sample is labeled correctly or not. It will not heavily penalize based on the distance of said sample from correct classification. If a label is correct, it doesn’t worry about it further. In contrast, L2 loss does care about distance. Data far away from where it should be will be penalized proportionally. What LSGAN argues is that this produces more informative gradients.

LSGAN Results

LSGAN Results: Taken from the LSGAN paper source

Key Take-Aways

  • Loss function instead of a critic
  • Weight decay regularization to bound loss function
  • L2 loss instead of log loss for proportional penalization

Relaxed Wasserstein GAN

Or RWGAN for short is another variation of the WGAN paper. They describe their RWGAN as the happy medium between WGAN and Improved WGAN (WGAN-GP as they cite it in the paper). Instead of symmetric clamping of weights (like in WGAN) or a gradient penalty (like proposed for Improved WGAN), RWGAN utilizes an asymmetric clamping strategy.

Beyond the specific GAN architecture they put forth, they also go onto describe what they call a statistical class of divergences (dubbed Relaxed Wasserstein divergences or RW divergences). RW divergences take the Wasserstein divergence from the WGAN paper and make it more general, outlining some key probabilistic properties that are needed in order to hold some of theoretical guarantees of our GANs.

They specifically show that RWGAN parameterized with KL divergence is extremely competitive against other state-of-the-art GANs, but with better convergence properties than even the regular WGAN. They also open their framework up to defining new loss functions and thus new cost functions for designing a GAN scheme.

Key Take-Aways

  • Asymmetric clamping of weights
  • General RW divergence framework, excellent for designing new GAN schema, costs, and loss functions


The Mean and Covariance Feature Matching GAN (McGAN) is part of the same family of GAN’s that WGAN is. This family is dubbed the Integral Probability Metric (IPM) family. These GANs are the ones that use a critic architecture instead of an explicit discriminator.

The critic function for McGAN has to do with measuring the mean or the covariance features of the generated data distribution and the target data distribution. This seems pretty straight forward when looking at the name too. They define two different ways of creating a critic function, one for the mean and one for the covariance and demonstrate how to actually use them. Like WGAN, they also use clipping on their model, which ends up restricting the capacity of the model. No super eventful conclusions were drawn from this paper.

Key Take-Aways

  • Mean and covariance measure of distance for a critic function

Generative Moment Matching Networks

Generative Moment Matching Networks (GMMN) focuses on minimizing something called the maximum mean discrepancy (MMD). MMD is essentially the mean of the embedding space of two distributions, and we are trying to minimize the difference between the two means here. We can use something called the kernel trick which allows us to cheat and use a Gaussian kernel to calculate this distance.

They argue that this allows for a simple objective that can easily be trained with backpropagation, and produces competitive results with a standard GAN. They also showed how you could add an auto-encoder into the architecture of this GAN to ease the amount of training needed to accurately estimate the MMD.

An additional note: Though they claim competitive results, from what I’ve read elsewhere, it seems that their empirical results are often lacking. What’s more, this model is fairly computationally heavy, so the computational resource and performance trade-off doesn’t really seem to be there in my opinion.

Key Take-Aways

  • Uses maximum mean discrepancy (MMD) as distance/objective function
  • No discriminator, just measures the distance between samples
  • Adds in an auto-encoder to help measure the MMD


Maximum Mean Discrepancy GAN or MMD GAN is, you guessed it, an improvement of GMMN. Their major contributions come in the form of not using static Gaussian kernels to calculate the MMD, and instead use adversarial techniques to learn kernels. It combines ideas from the original GAN and GMMN papers to create a hybrid of the ideas of the two. The benefits it claims are an increase in performance and run time.

Key Take-Aways

  • Iteration on GMMN: Adversarial learned kernels for estimating MMD

Cramer GAN

Cramer GAN starts by outlining an issue with the popular WGAN. It claims that there are three properties that a probability divergence should satisfy:

  • Sum invariance
  • Scale sensitivity
  • Unbiased sample gradients

Of these properties, they argue that the Wasserstein distance lacks the final property, unlike KLD or JSD which both have it. They demonstrate that this is actually an issue in practice, and propose a new distance: the Cramer distance.

The Cramer Distance

Now if we look at the Cramer distance, we can actually see it looks somewhat similar to the EM distance. However, due to its mathematical differences, it actually doesn’t suffer from the biased sample gradients that EM distance will. This is proven in the paper, if you really wish to dig into the mathematics of it.

Key Take-Aways

  • Cramer distance instead of EM distance
  • Improvement over WGAN: unbiased sample gradients

Fisher GAN

The Fisher GAN is yet another iteration on IPM GAN’s claiming to surpass McGAN, WGAN, and Improved WGAN in a number of aspects. What it does is sets up its objective function to have a critic that has a data dependent constraint on its second order moment (AKA its variance).

Because of this objective the Fisher GAN boasts the following:

  • Training stability
  • Unconstrained capacity
  • Efficient computation

What makes Fisher GAN’s distance different? It has to do with the fact that it is essentially measure what is called the Mahalanobis distance which in simple terms is the distance between two points that have correlated variables, relative to a centroid that is believed to be the mean of the distribution of the multivariate data. This actually assures that the generator and critic will be bounded like we desire. As the parameterized critic approaches infinite capacity, it actually estimates the Chi-square distance.

Key Take-Aways

  • Improvement above WGAN and other IPM GANs
  • Boasts training stability, unconstrained capacity, and efficient computation time
  • Chi-square distance objective

Energy Based GAN

Energy Based GAN (EBGAN) is an interesting one in our collection of GANs here today. Instead of using a discriminator like how the original GAN does, it uses an autoencoder to estimate reconstruction loss. The steps to setting this up:

  • Train an autoencoder on the original data
  • Now run generated images through this autoencoder
  • Poorly generated images will have awful reconstruction loss, and thus this now becomes a good measure

This is a really cool approach to setting up the GAN, and with the right regularization to prevent mode collapse (the generator just producing the same sample over and over again), it seems to be fairly decent.

So why even do this? Well, what was empirically shown is that using the autoencoder in this fashion actually produces a GAN that is fast, stable, and robust to parameter changes. What’s more, there isn’t a need to try and pull a bunch of tricks to balance the training of the discriminator and the generator.

Key Take-Aways

  • Autoencoder as the discriminator
  • Reconstruction loss used as cost, setup similar to original GAN cost
  • Fast, stable, and robust

Boundary Equilibrium GAN

Boundary Equilibrium GAN (BEGAN) is an iteration on EBGAN. It instead uses the autoencoder reconstruction loss in a way that is similar to WGAN’s loss function.

In order to do this, a parameter needs to be introduced to balance the training of the discriminator and generator. This parameter is weighted as a running mean over the samples, dancing at the boundary between improving the two halves (thus where it gets its name: “boundary equilibrium”).

Key Take-Aways

  • Iteration of the EBGAN
  • Superficial resemblance of cost function to WGAN

Margin Adaptation GAN

Margin Adaptation GAN (MAGAN) is the last on our list. It is another variation of EBGAN. EBGAN has a margin as a part of its loss function to produce a hinge loss. What MAGAN does is reduce that margin monotonically over time, instead of keeping it constant. The result of this is that the discriminator will auto-encode real samples better.

The result that we care about: better samples and more stability in training.

Key Take-Aways

  • Iteration on EBGAN
  • Adaptive margin in the hinge loss
  • More stability, better quality

Wrapping Up

That was a lot of different GANs! And a lot of content! I think it’s worth a summary in a table just to keep us organized:

GAN TypeKey Take-Away
GANThe original (JSD divergence)
WGANEM distance objective
Improved WGANNo weight clipping on WGAN
LSGANL2 loss objective
RWGANRelaxed WGAN framework
McGANMean/covariance minimization objective
GMMNMaximum mean discrepancy objective
MMD GANAdversarial kernel to GMMN
Cramer GANCramer distance
Fisher GANChi-square objective
EBGANAutoencoder instead of discriminator
BEGANWGAN and EBGAN merged objectives
MAGANDynamic margin on hinge loss from EBGAN

Whew… Pat yourself on the back, that was a lot of GAN content.

If I missed something or misinterpreted something, please correct me!