Is Diffusion Models Some Kind of Deep Learning? Exploring Generative AI's Deep Dive

Diffusion Models have surged to the forefront of generative models in recent years, and for compelling reasons. Landmark studies in the 2020s have showcased the remarkable capabilities of Diffusion Models, notably outperforming Generative Adversarial Networks (GANs) in image synthesis[6]. The integration of Diffusion Models into OpenAI’s DALL-E 2, a groundbreaking image generation model, has further cemented their significance.

Various images generated by DALL-E 2 (source).

Given the impressive strides made by Diffusion Models, many in the machine learning community are keen to understand their underlying mechanisms. This article will delve into the theoretical underpinnings of Diffusion Models and illustrate how to generate images using a Diffusion Model in PyTorch. For a more intuitive, less technical overview, you might find our piece on how physics propelled Generative AI insightful. Let’s embark on this exploration!

Diffusion Models: An Introduction to Deep Generative Learning

Diffusion Models are a class of generative deep learning models. This means they are designed to generate data that mirrors the data they were trained on. But Is Diffusion Models Some Kind Of Deep Learning in the traditional sense? The answer is a resounding yes. They leverage the power of neural networks, a cornerstone of deep learning, to achieve their impressive generative feats.

At their core, Diffusion Models operate by systematically destroying training data through the gradual addition of Gaussian noise. Crucially, they then learn to reverse this process, effectively learning to denoise the data. Once trained, a Diffusion Model can generate new data by simply feeding randomly sampled noise through this learned denoising process.

Diffusion Models can be used to generate images from noise (adapted from source)

More formally, a Diffusion Model is a latent variable model that utilizes a fixed Markov chain to map data to a latent space. This chain progressively adds noise to the data, approximating the posterior ( q(textbf{x}_{1:T}|textbf{x}_0) ), where ( textbf{x}_1, … , textbf{x}_T ) are latent variables with the same dimensionality as the original data ( textbf{x}_0 ). As depicted below, this Markov chain transforms image data into a state of pure Gaussian noise over time.

The training objective of a diffusion model is to learn the reverse process – specifically, to train ( p_theta(x_{t-1}|x_t) ). By traversing this chain in reverse, starting from random noise, we can generate novel data samples.

Advantages of Diffusion Models in Deep Learning

The research landscape surrounding Diffusion Models has exploded recently, driven by their exceptional performance. Drawing inspiration from non-equilibrium thermodynamics[1], Diffusion Models currently achieve state-of-the-art image quality. Examples of this superior quality are shown below:

(adapted from source)

Beyond exceptional image quality, Diffusion Models offer several other advantages within the realm of deep learning. Notably, they do not require adversarial training. The complexities and instabilities associated with GAN training are well-recognized. In scenarios where non-adversarial alternatives, like Diffusion Models, offer comparable performance and training efficiency, they often represent a more practical and robust choice. Furthermore, Diffusion Models exhibit excellent scalability and parallelizability in training, making them efficient to work with on large datasets.

While the results produced by Diffusion Models may seem almost magical, they are built upon a foundation of carefully considered mathematical principles and intricate details. Best practices in this field are still actively evolving. Let’s now delve deeper into the mathematical theory that underpins Diffusion Models and clarifies their connection to deep learning architectures.

Diffusion Models: A Deeper Dive into the Deep Learning Framework

As mentioned, a Diffusion Model comprises a forward process (or diffusion process) and a reverse process (or reverse diffusion process). In the forward process, data, typically an image, is progressively corrupted with noise. In the reverse process, this noise is transformed back into a coherent sample from the target data distribution using deep learning models.

The forward process’s sampling chain transitions can be modeled as conditional Gaussians when the noise level is sufficiently low. This, combined with the Markov assumption, simplifies the parameterization of the forward process:

We’ve discussed data corruption through the addition of Gaussian noise. To clarify where this addition occurs, consider the equation above. At each step in the chain, we are essentially sampling from a Gaussian distribution where the mean is the previous value (image) in the chain.

This is mathematically equivalent to stating:

To understand this equivalence, we can use a slight notational simplification:

This final step arises from the mathematical principle that the sum of random variables is equivalent to the convolution of their distributions. More information can be found on this Wikipedia page.

In essence, we’ve demonstrated that defining the distribution of a timestep conditioned on the previous one using the mean of a Gaussian distribution is the same as defining the distribution of a given timestep as that of the previous one with added Gaussian noise. For simplicity, scalars introduced by the variance schedule were omitted and the derivation was shown in one dimension, but a similar proof applies to multivariate Gaussians.

Where ( beta_1, …, beta_T ) represents a variance schedule (which can be fixed or learned). A well-behaved schedule ensures that ( x_T ) approximates an isotropic Gaussian for sufficiently large T.

Given the Markov assumption, the joint distribution of latent variables is the product of Gaussian conditional chain transitions (modified from source).

The “magic” of diffusion models, and their deep learning aspect, truly emerges in the reverse process. During training, the model, typically a neural network, learns to reverse this diffusion process to generate new data. Starting from pure Gaussian noise ( p(textbf{x}_{T}) := mathcal{N}(textbf{x}_T, textbf{0}, textbf{I}) ), the model learns the joint distribution ( p_theta(textbf{x}_{0:T}) ) as:

Here, the time-dependent parameters of the Gaussian transitions are learned through deep learning. It’s important to note that the Markov formulation dictates that a reverse diffusion transition distribution depends only on the preceding timestep:

(Modified from source)

Training Diffusion Models with Deep Learning Techniques

A Diffusion Model is trained by finding the reverse Markov transitions that maximize the likelihood of the training data. In practice, training is achieved by minimizing the variational upper bound on the negative log likelihood. This optimization process is where deep learning techniques become central, as neural networks are used to parameterize and learn these reverse transitions.

Note that Lvlb is technically an upper bound (the negative of the ELBO), but we refer to it as Lvlb for consistency with the literature.

We aim to rewrite ( L_{vlb} ) in terms of Kullback-Leibler (KL) Divergences. The KL Divergence is an asymmetric measure of how much one probability distribution P differs from a reference distribution Q. Formulating ( L_{vlb} ) using KL divergences is beneficial because the transition distributions in our Markov chain are Gaussians, and the KL divergence between Gaussians has a closed-form solution, making calculations efficient and tractable.

Understanding KL Divergence in Deep Learning Context

The mathematical definition of KL divergence for continuous distributions is:

The double bars indicate that the function is not symmetric with respect to its arguments. The visualization below illustrates the KL divergence of a varying distribution P (blue) from a reference distribution Q (red). The green curve represents the function within the integral of the KL divergence definition, and the area under this curve represents the KL divergence of P from Q at any given point.

0:00 /0:16

Casting ( L_{vlb} ) in Terms of KL Divergences for Deep Learning Optimization

As previously mentioned, it’s possible[1] to express ( L_{vlb} ) almost entirely in terms of KL divergences:

where

The variational bound expands to:

Replacing distributions with their definitions based on our Markov assumption yields:

Using log rules to transform the expression into a sum of logs and extracting the first term:

Applying Bayes’ Theorem and the Markov assumption, this becomes:

Splitting the middle term using log rules:

Isolating the second term reveals:

Substituting this back into our equation for Lvlb:

Rearranging using log rules:

Noting the equivalence for KL divergence for any two distributions:

Finally, applying this equivalence to the preceding expression:

Conditioning the forward process posterior on ( x_0 ) in ( L_{t-1} ) leads to a tractable form where all KL divergences are comparisons between Gaussians. This tractability allows for the exact calculation of divergences using closed-form expressions, eliminating the need for computationally intensive Monte Carlo estimations[3]. This efficient calculation is crucial for effective deep learning training of diffusion models.

Model Choices in Deep Learning Diffusion Models

With the mathematical groundwork for our objective function laid, we now address key implementation choices for our Diffusion Model. For the forward process, defining the variance schedule is the primary decision. Typically, variance values are set to increase throughout the forward process.

For the reverse process, we must select the Gaussian distribution parameterization and, critically, the deep learning model architecture. Diffusion Models offer significant flexibility here. The only architectural constraint is that the input and output dimensions must match, allowing for a wide range of neural network designs to be employed.

We will now delve into the specifics of these choices.

Forward Process, ( L_T ), and Variance Scheduling

As mentioned, the main requirement for the forward process is defining the variance schedule. We often set these variances as time-dependent constants, although they can also be learned. For instance[3], a linear schedule from (beta_1=10^{-4}) to (beta_T=0.2) might be used, or a geometric series could be implemented.

Regardless of the specific schedule, the fixed nature of the variance schedule renders ( L_{T} ) a constant with respect to learnable parameters, allowing us to disregard it during training.

Reverse Process, ( L_{1:T-1} ), and Neural Network Parameterization

Now, we consider the choices in defining the reverse process, which heavily involves deep learning architectures. Recall that reverse Markov transitions are defined as a Gaussian:

We must define the functional forms for ( pmb{mu}_theta ) or ( pmb{Sigma}_theta ), which are typically parameterized by neural networks. While more complex parameterizations for ( pmb{Sigma}_theta ) exist[5], a common simplification is to set:

This assumes a multivariate Gaussian as a product of independent Gaussians with identical, time-varying variance, set to match the forward process variance schedule.

With this simplification of ( pmb{Sigma}_theta ), we have:

This allows us to transform:

where the first term in the difference is a linear combination of (x_t) and (x_0) dependent on the variance schedule (beta_t). The precise form is detailed in [3].

Crucially, the most direct parameterization of ( mu_theta ) simply predicts the diffusion posterior mean. However, the authors of [3] found that training (mu_theta) to predict the noise component at each timestep yields superior results. Specifically, let:

where

This leads to the following alternative loss function, which the authors of [3] demonstrated results in more stable training and improved outcomes:

The authors also highlight connections between this Diffusion Model formulation and score-matching generative models based on Langevin dynamics. Diffusion Models and Score-Based models appear to be closely related, analogous to the independent developments of wave-based and matrix-based quantum mechanics, both representing equivalent formulations of the same phenomena[2]. This connection further solidifies diffusion models within the broader landscape of deep generative learning techniques.

Neural Network Architecture for Diffusion Models

While our simplified loss function trains a model ( pmb{epsilon}_theta ), the specific architecture of this model remains to be defined. As noted, the only constraint is matching input and output dimensionality.

Given this, it is not surprising that image Diffusion Models frequently utilize U-Net-like architectures, which are a staple in deep learning for image processing tasks.

Architecture of U-Net (source)

Reverse Process Decoder and ( L_0 )

The reverse diffusion process involves a series of transformations under continuous conditional Gaussian distributions. Ultimately, we aim to generate an image composed of discrete integer pixel values. Therefore, we need a method to obtain discrete (log) likelihoods for each possible pixel value across all pixels.

This is achieved by setting the final transition in the reverse diffusion chain to an independent discrete decoder. To determine the likelihood of an image (x_0) given (x_1), we first assume independence between data dimensions:

where D is the data dimensionality, and the superscript i denotes the extraction of a single coordinate. The goal is to find the likelihood of each integer value for a given pixel, conditioned on the distribution of possible values for the corresponding pixel in the slightly noised image at time (t=1):

Pixel distributions for (t=1) are derived from the multivariate Gaussian below, whose diagonal covariance matrix allows splitting the distribution into a product of univariate Gaussians, one per data dimension:

Assuming images consist of integers in ({0, 1, …, 255}) (standard RGB images) scaled linearly to ([-1, 1]), we divide the real line into small “buckets.” For a scaled pixel value x, the bucket is ([x-1/255, x+1/255]). The probability of a pixel value x, given the univariate Gaussian distribution of the corresponding pixel in (x_1), is the area under that univariate Gaussian distribution within the bucket centered at x.

The visualization below shows the area for each bucket and their probabilities for a mean-0 Gaussian, representing a distribution with an average pixel value of (255/2) (half brightness). The red curve shows the distribution of a specific pixel in the t=1 image, and the areas indicate the probability of the corresponding pixel value in the t=0 image.

0:00 /1:19

The first and last buckets extend to -inf and +inf to maintain total probability.

Given a t=0 pixel value for each pixel, ( p_theta(x_0 | x_1) ) is simply their product, encapsulated by:

where

and

Using this equation for ( p_theta(x_0 | x_1) ), we can calculate the final term of (L_{vlb}) not formulated as a KL Divergence.

Final Objective for Deep Learning Diffusion Models

As noted, predicting the noise component at each timestep yields the best results[3]. The ultimate training objective is:

The training and sampling algorithms for our Diffusion Model are summarized in the figure below:

(source: source)

Diffusion Model Theory Summary: Deep Learning at its Core

We’ve explored the theory of Diffusion Models in detail. To maintain a clear perspective amidst the mathematical intricacies, here are key takeaways highlighting the deep learning aspects:

Diffusion Models are parameterized as Markov chains, where latent variables (x_1, … , x_T) depend only on adjacent timesteps.
Transition distributions in the Markov chain are Gaussian, with the forward process using a variance schedule and the reverse process parameters learned by neural networks.
The diffusion process ensures (x_T) becomes asymptotically Gaussian for large T.
Variance schedules are often fixed but can be learned. Geometric progressions may outperform linear ones. Variances generally increase with time ((beta_i < beta_j) for (i < j)).
Diffusion Models are highly flexible, allowing for diverse neural network architectures (e.g., U-Net) due to the input-output dimensionality constraint.
The training objective is to maximize training data likelihood, achieved by minimizing the variational upper bound of the negative log likelihood.
Most objective function terms are KL Divergences, efficiently calculable for Gaussians, avoiding Monte Carlo approximations.
A simplified training objective predicting the noise component yields stable training and superior results.
A discrete decoder obtains pixel value log likelihoods as the final reverse diffusion step.

With this overview, let’s see how to implement Diffusion Models in PyTorch.

Diffusion Models in PyTorch: Deep Learning Implementation

While Diffusion Models are still becoming more accessible compared to older machine learning architectures, user-friendly implementations are available. The denoising-diffusion-pytorch package provides a straightforward way to use Diffusion Models in PyTorch, implementing an image diffusion model similar to the one discussed. Install it via:

pip install denoising_diffusion_pytorch

Minimal Example of Deep Learning Diffusion Model in PyTorch

To train a model and generate images, import necessary packages:

import torch
from denoising_diffusion_pytorch import Unet, GaussianDiffusion

Define the network architecture, here a U-Net. dim sets feature maps before down-sampling, and dim_mults provides multipliers for down-sampling:

model = Unet(
    dim = 64,
    dim_mults = (1, 2, 4, 8)
)

Define the Diffusion Model, passing the U-Net model and parameters like image size, timesteps, and loss type:

diffusion = GaussianDiffusion(
    model,
    image_size = 128,
    timesteps = 1000,   # number of steps
    loss_type = 'l1'    # L1 or L2
)

Train the Diffusion Model. Generate random data and train as usual:

training_images = torch.randn(8, 3, 128, 128)
loss = diffusion(training_images)
loss.backward()

Generate images using diffusion.sample() after training. Here, generate 4 noise images due to random training data:

sampled_images = diffusion.sample(batch_size = 4)

Training on Custom Datasets with Deep Learning Diffusion Models

The denoising-diffusion-pytorch package allows training on custom datasets. Replace 'path/to/your/images' with your dataset path in Trainer() and adjust image_size. Ensure PyTorch is CUDA-enabled to use Trainer:

from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer

model = Unet(
    dim = 64,
    dim_mults = (1, 2, 4, 8)
).cuda()

diffusion = GaussianDiffusion(
    model,
    image_size = 128,
    timesteps = 1000,
    loss_type = 'l1'
).cuda()

trainer = Trainer(
    diffusion,
    'path/to/your/images',
    train_batch_size = 32,
    train_lr = 2e-5,
    train_num_steps = 700000,    # total training steps
    gradient_accumulate_every = 2,
    ema_decay = 0.995,
    amp = True                       # turn on mixed precision
)

trainer.train()

Below shows progressive denoising from Gaussian noise to MNIST digits, demonstrating reverse diffusion:

Final Words: Diffusion Models as Deep Learning Powerhouses

Diffusion Models offer a conceptually simple yet powerful approach to data generation. Their state-of-the-art performance and non-adversarial training have propelled them to prominence. Further advancements are anticipated given their ongoing development. Diffusion Models are integral to cutting-edge models like DALL-E 2. Therefore, to answer the question “is diffusion models some kind of deep learning?“, the answer is definitively yes. Diffusion models are deeply rooted in deep learning principles and leverage neural networks to achieve their impressive generative capabilities.