Generative Models

Typical scenario: a model finds a representation $h = f(x)$ of a datapoint (image) $x$
Generative models can produce an image $x$ from its representation $h$

Autoencoders

An encoder network followed by a decoder network
Encoder compresses the data into a lower-dimensional vector
Given powerful enough decoder, in theory the original datapoint could be perfectly reconstructed even from a one-dimensional latent representation
A standard autoencoder learns representations with distinct clusters and lack of regularity in the latent space
In order to generate new content we would need a way to sample meaningful latent representations

VAE encoder produces two vectors: means and variances for a set of random variables
The input of the decoder is a sample of these random variables
The latent space is continuous

Simultaneously train two models, generator $G$ and discriminator $D$
$G$ learns a mapping from a prior noise distribution to the data space
$D$ predicts the probability that a sample is from the training data rather than was generated by $G$
A conditional GAN is obtained by adding an additional input to both $G$ and $D$ (for example, image category or an input image)

The difficulty with training an image-to-image network is what loss to optimize
For example, Euclidean distance is minimized by averaging all plausible outputs, producing blurry results
GAN learns the loss function automatically
A conditional GAN that is conditioned on the input image is suitable for image-to-image translation
Generator is based on the U-Net architecture
PatchGAN discriminator penalizes structure at the scale of local image patches only
The discriminator is run convolutionally across the image, averaging responses from all patches
Additional L1 loss encourages the output to be similar to the ground truth

$$ \begin{align} &G: X \to Y \\ &F: Y \to X \end{align} $$

Generator is a CNN consisting of an encoder, transformer, and a decoder
Two discriminators, $D_Y$ and $D_X$, try to distinguish real images from generated images
Discriminator is a CNN that follows the PatchGAN architecture
Adversarial loss for $G$ makes $D_Y$ distinguish $G(x)$ from $y$:

$$ \begin{align} &D_Y(y) \to 0 \\ &D_Y(G(x)) \to 1 \end{align} $$

$$ \begin{align} &D_X(x) \to 0 \\ &D_X(F(y)) \to 1 \end{align} $$

Cycle consistency loss expresses that an image translation cycle should bring back the original image:

$$ \begin{align} &F(G(x)) \to x \\ &G(F(y)) \to y \end{align} $$

GANs cannot encode images into the latent space
VAEs support only approximate inference of latent variables from an image
Normalizing flow is a sequence of invertible transformations
Reversible generative models can encode an image into a latent space, making it possible to interpolate between two images