**Generative adversarial networks**, popularly known as **GANs**, are generative models that learn a specific probability distribution through a generator, *G*. The generator *G* plays a zero sum minimax game with a discriminator *D* and both evolve over time, before the Nash equilibrium is reached. The generator tries to produce samples similar to the ones generated by a given probability distribution, *P(x)*, while the discriminator *D* tries to distinguish those fake data samples generated by the generator *G* from the data sample from the original distribution. The generator *G* tries to generate samples similar to the ones from *P(x)**,* by converting samples, *z*, drawn from a noise distribution, *P(z)*. The discriminator, *D*, learns to tag samples generated by the generator *G* as *G(z)* when fake; *x* belongs to *P(x)* when they are original. At the equilibrium of the minimax game, the generator will learn to produce samples similar to the ones generated by the original distribution, *P(x)*, so that the following is true:

The following diagram illustrates a GAN network learning the probability distribution of the MNIST digits:

The cost function minimized by the discriminator is the binary cross-entropy for distinguishing the real data points belonging to the probability distribution *P(x) *from the fake ones generated by the generator (that is, *G(z)*):

The generator will try to maximize the same cost function given by (1). This means that, the optimization problem can be formulated as a minimax player with the utility function *U(G,D)*, as illustrated here:

Generally, to measure how far a given probability distribution matches that of a given distribution, *f*-divergence measures are used, such as the **Kullback–Leibler** (**KL**) divergence, the Jensen Shannon divergence, and the Bhattacharyya distance. For example, the KL divergence between two probability distributions, *P* and *Q*, is given by the following, where the expectation is with respect to the distribution, *P*:

Similarly, the Jensen Shannon divergence between *P* and *Q* is given as follows:

Now, coming back to *(2)*, the expression can be written as follows:

Here, *G(x)* is the probability distribution for the generator. Expanding the expectation into its integral form, we get the following:

For a fixed generator distribution, *G(x)**,* the utility function will be at a minimum with respect to the discriminator if the following is true:

Substituting *D(x)* from *(5)* in *(3)*, we get the following:

Now, the task of the generator is to maximize the utility, , or minimize the utility, . The expression for can be rearranged as follows:

Hence, we can see that the generator minimizing is equivalent to minimizing the Jensen Shannon divergence between the real distribution, *P(x)*, and the distribution of the samples generated by the generator, *G* (that is, *G(x)*).

Training a GAN is not a straightforward process, and there are several technical considerations that we need to take into account while training such a network. We will be using an advanced GAN network to build a cross-domain style transfer application in Chapter 4, *Style Transfer in Fashion Industry using GANs*.