What’s in a Generative Model?

Before we even think about starting to talk about Generative Adversarial Networks (GANs), we ask what’s in a generative model? Why do we even want to have such a thing? What is the goal? These questions can help seed our thought process to better engage with GANs.

So why do we want a generative model? Well, it’s in the name! We wish to generate something. But what do we wish to generate? Typically, we wish to generate data (I know, not very specific). More than that though, it is likely that we wish to generate data that is never before seen, yet still fits into some data distribution (i.e. some pre-defined data set that we have already set aside).

And the goal of such a generative model? To get so good at coming up with new generated content that we (or any system that is observing the samples) can no longer tell the difference between what is original and what is generated. Once we have a system that can do that much, we are free to begin generating up new samples that we haven’t even seen before, yet still are believably real data.

To step into things a little deeper, we want our generative model to be able to accurately estimate the probability distribution of our real data. We will say that if we have a parameter W, we wish to find the parameter W that maximizes the likelihood of real samples. When we train our generative model, we find this ideal parameter W such that we minimize the distance between our estimate of what the data distribution is and the actual data distribution.

A good measure of distance between distributions is the Kullback-Leibler Divergence, and it is shown that maximizing the log likelihood is equivalent to minimizing this distance. Taking our parameterized, generative model and minimizing the distance between it and the actual data distribution is how we create a good generative model. It also brings us to a branching of two types of generative models.

Explicit Distribution Generative Models

An explicit distribution generative model comes up with an explicitly defined generative model distribution. It then refines this explicitly defined, parameterized estimation through training on data samples. An example of an explicit distribution generative model is a Variational Auto-Encoder (VAE). VAEs require an explicitly assumed prior distribution and likelihood distribution to be given to it. They use these two components to come up with a “variational approximation” with which to evaluate how they are performing. Because of these needs and this component, VAEs have to be explicitly distributed.

Implicit Distribution Generative Models

Much like you may have already put together, implicitly distributed generative models do not require an explicit definition for their model distribution. Instead, these models train themselves by indirectly sampling data from their parameterized distribution. And as you may have also already guessed, this is what a GAN does.

Well, how exactly does it do that? Let’s dive into GANs, and then we’ll start to paint that picture.

Taxonomy of Deep Generative Models: Based on Figure 9 of NeurIPS 2016 tutorial: Generative adversarial networks (source, license)

Taxonomy of Deep Generative Models: Based on Figure 9 of NeurIPS 2016 tutorial: Generative adversarial networks (source, license)

High-Level GAN Understanding

Generative Adversarial Networks have three components to their name. We’ve touched on the generative aspect and the network aspect is pretty self-explanatory. But what about the adversarial portion?

Well, GAN’s have two components to their network, a generator (G) and a discriminator (D). These two components come together in the network and work as adversaries, pushing the performance of one another.

Data flow through a GAN: The generator takes random noise as input and produces a sample, and the discriminator takes a sample as input and produces a probability of whether the sample is real or fake. (source, license)

Data flow through a GAN: The generator takes random noise as input and produces a sample, and the discriminator takes a sample as input and produces a probability of whether the sample is real or fake. (source, license)

The Generator

The generator is responsible for producing fake examples of data. It takes as input some latent variable (which we will refer to as z) and outputs data that is of the same form as data in the original data set.

Latent variables are hidden variables. When talking about GANs we have this notion of a “latent space” that we can sample from. We can continuously slide through this latent space which, when you have a well-trained GAN, will have substantial (and oftentimes somewhat understandable effects) on the output.

If our latent variable is z and our target variable is x, we can think of the generator of network as learning a function that maps from z (the latent space) to x (hopefully, the real data distribution).

The Discriminator

The discriminator’s role is to discriminate. It is responsible for taking in a list of samples and coming up with a prediction for whether or not a given sample is real or fake. The discriminator will output a higher probability if it believes a sample is real.

We can think of our discriminator as a “bullshit detector” of sorts.

Adversarial Competition

These two components come together and battle it out. The generator and discriminator oppose one another, trying to maximize opposite goals: The generator wants to push to create samples that look more and more real and the discriminator wishes to always correctly classify where a sample comes from.

The fact that these goals are directly opposite one another is where GANs get the adversarial portion of their name.

Summary of the GAN training process: The generator and discriminator are trained in an adversarial manner. The generator tries to produce samples that are indistinguishable from real data, while the discriminator tries to distinguish between real and fake data. (source, license)

Summary of the GAN training process: The generator and discriminator are trained in an adversarial manner. The generator tries to produce samples that are indistinguishable from real data, while the discriminator tries to distinguish between real and fake data. (source, license)

Painting an Elaborate Metaphor

Who doesn’t love a good metaphor to learn to understand a concept?

Art Forgery

My favorite metaphor from when I was first learning about GANs was the forger versus critic metaphor. In this metaphor, our generator is a criminal who is try to forge art whereas our discriminator is an art critic who is suppose to be able to correctly identify if a piece is a forged or authentic.

The two go back and forth, directly in opposition to one another. Trying to one-up one another, because their jobs depend on it.

False Money

What if instead of an art forgery task we had a criminal who was trying to make fake money and an intern at the bank trying to make sure that they do not accept any fake money.

Maybe in the beginning the criminal is very bad. They come in and try to hand to the intern a piece of paper with a dollar bill drawn in crayon. This is obviously a fake dollar. But maybe the intern is really bad at their job as well and struggles to figure out if it is actually fake. Both the two will learn a lot from their first interaction. Come the next day, when the criminal comes in, their fake money is going to be a bit harder to tell if it is fake or not.

Day in and day out of this activity, the two go back and forth and become really good at their job. However, at a certain point, there may come a day when the two reach a sort of equilibrium. From there, the criminal’s fake dollars become so realistic, not even a seasoned expert could even begin to tell if it is fake or real.

That is the day the bank intern gets fired.

It’s also the day that we can utilize this criminal of ours and get very rich!

Parroting

The previous two examples have been very visually focused. But what about an example that’s a little different.

Let’s say our generator is our pet parrot and our discriminator is our younger brother. Each day, we sit behind a curtain and our parrot sits behind another. Our parrot is going to try and mimic our voice to fool our younger brother. If he’s successful, we give him a treat. If our brother correctly guesses which curtain we are behind, we give our brother a treat instead (hopefully a different one than we give to our parrot).

Maybe in the beginning, the parrot is bad at mimicking our voice. But day after day of practice, our parrot may be able to develop the skills to perfectly mirror our voice. At that point, we’ve trained our parrot to talk exactly like us, and we can become internet famous.

Score!

The Math Behind the Monster

Before we wrap up this introduction to GANs, it is worth exploring the mathematics behind a GAN in a little bit of detail. GANs have a goal of finding equilibrium between the two halves of their network by solving the following minimax equation:

$$ \min_{G} \max_{D} V(D, G) = E_{x \sim p_{\text{data}}(x)}[\log D(x)] + E_{z \sim p_z(z)}[\log(1 - D(G(z)))] $$

We call this equation our minimax equation because we are trying to jointly optimize two parameterized networks, $G$ and $D$. to find an equilibrium between the two. We wish to maximize the confusion of $D$ and minimize the confusion of $G$. solved, our parameterized, implicit, generative data distribution should match the underlying original data distribution fairly well.

To break down the portions of our equation even more, let’s analyze and think about it a bit more. From the side of $D$, it wants to maximize this equation. It wants, when a real sample comes in, to maximize its output and when a fake sample comes in, to minimize it. That’s essentially where the right half of the equation falls out of. On the flip side, $G$ is trying to trick $D$ into maximizing its output when it is handed a fake sample. That’s why $D$ is trying to maximize while $G$ is trying to minimize.

And due to the minimizing/maximizing is where we get the term minimax.

Now, assuming that $G$ and $D$ are well parameterized and thus have enough capacity to learn, this minimax equation can help us reach the Nash equilibrium between the two. This is ideal.

How Do Achieve This?

Simple: We just iterate back and forth.

Just kidding. It’s not really simple. But we can outline it fairly simply.

To start, we will first train $D$ to be an optimal classifier on a fixed version of G. From there, we fix $D$ and train $G$ to best fool a fixed $D$. By iterating back and forth, we can optimize our minimax equation to the point where $D$ can no longer differentiate between real and fake samples because our generative data distribution is more or less indistinguishable from the actual data distribution. At this point, $D$ will output a 50% probability for every sample it encounters.


P.S. Many thanks the authors of How Generative Adversarial Networks and Its Variants Work: An Overview of GAN who helped to inspire me and give me more insight into how GANs work. Without their paper, this series would not have been fully possible.