Generalization of Diffusion Models: Principles, Theory, and Implications
![<strong>Figure 1.</strong> Overview of forward and reverse processes for diffusion models. Figure adapted from [7].](/media/agxdzywa/figure1.jpg)
Diffusion models have recently emerged as a new family of generative models [2, 6, 7, 10] that demonstrate remarkable performance across diverse domains, including text-to-image generation, video content generation, speech and audio synthesis, and solving inverse problems. The training and sampling of these models involves two stages: (i) a forward diffusion process that incrementally adds Gaussian noise to a training sample at each time step, and (ii) a backward sampling process that progressively removes noise via a neural network that is trained to approximate the score function at all time steps (see Figure 1).
Despite the remarkable empirical success of diffusion models, their fundamental mechanisms are still poorly understood. In theory, the training of diffusion models is designed to learn the empirical distribution over training samples via neural networks (a sum of \(\delta\) functions of the training data points), suggesting that the perfectly learned neural network would simply memorize the training data. But in practice, diffusion models can creatively generate new, sensible examples that differ significantly from the training data. This surprising discrepancy poses an important question about the generalizability of diffusion models.
Model Reproducibility Provides Insight Into Generalization
Our recent work reveals consistent model reproducibility among different diffusion models, which provides deep insight into the mystery of generalizability [10]. In this intriguing, widespread phenomenon, different diffusion models produce remarkably similar outputs when initialized with the same noise and sampled via the same sampling procedure. This outcome holds true across a variety of diffusion model frameworks, architectures, and training procedures (see Figure 2).
![<strong>Figure 2.</strong> Consistent model reproducibility across different diffusion models. Visual content generated by three different diffusion models from the same initial noise (i.e., the noise in each row and column is the same) looks nearly identical. All models were trained on the CIFAR-10 dataset. <strong>2a.</strong> Result of a denoising diffusion probabilistic model (DDPM). <strong>2b.</strong> Result of consistency training (CT). <strong>2c.</strong> Result of a vision transformers (ViT)-based architecture called U-ViT. Figure courtesy of [10].](/media/zyfhs41b/figure2.jpg)
Quantitatively, we measure the reproducibility (RP) with the following probability metric:
\[\textrm{RP score} := \mathbb{P}(\mathcal{M}_{\textrm{SSCD}}(\boldsymbol{x}_1, \boldsymbol{x}_2)>0.6),\]
where \(\mathcal{M}_{\textrm{SSCD}}(\boldsymbol{x}_1,\boldsymbol{x}_2)\) measures the cosine similarity between \((\textrm{SSCD}(\boldsymbol{x}_1),\textrm{SSCD}(\boldsymbol{x}_2))\). Here, \((\boldsymbol{x}_1,\boldsymbol{x}_2)\) is a generated sample pair from two distinct diffusion models with the same initial noise and \(\textrm{SSCD}(\cdot)\) represents a neural descriptor for self-supervised copy detection (SSCD).
Strong model reproducibility implies that different diffusion models learn the same score functions. However, we have not yet answered the critical question: Why are diffusion models able to generalize? To answer this question, we introduce a generalization (GL) score that quantifies the distinction between a generated sample \(\boldsymbol{x}\) and the training samples:
\[\textrm{GL score} :=1-\mathbb{P}\bigg(\underset{i\in[N]}{\max}\mathcal{M}_\textrm{SSCD}(\boldsymbol{x},\boldsymbol{y}_i)>0.6\bigg),\]
where \(\{\boldsymbol{y}_i\}^N_{i=1}\) are the training samples. A smaller GL score indicates stronger memorization, while a higher score indicates stronger generalizability.
![<strong>Figure 3.</strong> Co-emergence of reproducibility <strong>(3a)</strong> and generalizability <strong>(3b)</strong> of UNet architectures with three different parameter numbers (UNet-64, UNet-128, and UNet-256) in the denoising diffusion probabilistic model, which is trained on a varying number of images from the CIFAR-10 dataset. Figure courtesy of [10].](/media/ovhfo40w/figure3.jpg)
Figure 3 depicts the RP and GL scores for diffusion models that are trained with varying data sizes and model capacities, implying that these models learn two distinct distributions that correspond to two separate training regimes:
(i) Memorization regime: When model capacity far exceeds training data size, all diffusion models overfit by memorizing the same empirical distribution of training data, ultimately leading to strong model reproducibility without generalizability.
(ii) Generalization regime: When model capacity is insufficient for memorization, diffusion models generate new samples. Here, reproducibility and generalizability co-emerge as all models learn the score function of the true distribution instead of memorizing the training data.
Moreover, within a fixed capacity, diffusion models transition from memorization to generalization as the number of training samples increases. Empirically, this phase transition happens with a finite number of samples (e.g., 10,000 samples from the CIFAR-10 dataset). This is fundamentally different from when the empirical distribution merely interpolates the underlying distribution, where the required sample size grows exponentially with data dimensions in the worst case.
Towards a Mathematical Theory of Generalization
We argue that diffusion models avoid the curse of dimensionality because they effectively learn low-dimensional distributions, where the required sample size for generalization scales linearly with the intrinsic dimension in high-dimensional space. While real-world image datasets are high-dimensional in terms of pixel count and data volume, extensive empirical evidence implies that they have a significantly lower intrinsic dimension and reside on the union of low-dimensional manifolds [5].
![<strong>Figure 4.</strong> Phase transition of learning the mixture of low-rank Gaussian (MoLRG) distribution. The \(x\)-axis is the number of training samples, and the \(y\)-axis is the dimension of subspaces. Darker pixels represent a lower empirical probability of success. <strong>4a.</strong> Phase transition of subspace recovery via subspace clustering. <strong>4b.</strong> Phase transition of subspace recovery via the training of diffusion models. Figure courtesy of [8].](/media/zvgf5npt/figure4.jpg)
However, the complexity of image distributions makes it difficult to rigorously verify our claim on image datasets. To mathematically understand generalization behavior, let us consider a simple example of the mixture of low-rank Gaussian (MoLRG) distribution:
\[\boldsymbol{x} \sim \sum_{k=1}^K \pi_k\mathcal{N}(\mathbf{0}, \boldsymbol{U}^\star_k\boldsymbol{U}^{\star T}_k),\]
where \(\pi_k\) is the mixing proportion for each \(k\) that satisfies \(\sum^K_{k=1} \pi_k=1\) and \(\boldsymbol{U}^\star_k\) is the orthonormal basis of the \(k\)th component for each \(k\). Note that MoLRG data lie on a union of linear subspaces and locally approximate the low-dimensional manifolds of image data. Furthermore, the MoLRG model is amenable to mathematical analysis with closed-form score functions while also preserving the low-dimensional characteristics of image distributions.
Importantly, MoLRG differs from mixture of Gaussian (MoG) distributions in diffusion models; the use of MoLRG generally focuses on learning the covariance matrices, whereas prior studies of generic MoG have primarily emphasized learning the means.
If we consider the denoising autoencoder (DAE) formulation for training diffusion models, then
\[\underset{\boldsymbol{\theta}}{\min} \, l(\boldsymbol{\theta}):= \frac{1}{N}\sum^N_{i=1}\int^1_0 \lambda_t\mathbb{E}_{\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\boldsymbol{I}_n)}\bigg[\bigg{\|}\boldsymbol{x_\theta}(s_t\boldsymbol{x}^{(i)}+\gamma_t\boldsymbol{\epsilon},t)-\boldsymbol{x}^{(i)}\bigg{\|}^2\bigg] \textrm{d}t.\]
Here, \(s_t\) and \(\gamma_t\) are the predefined hyperparameters that control the added noise schedule, and \(\lambda_t\) is a weight function for the training loss. If we parameterize the DAE \(\boldsymbol{x_\theta}\) according to the form of the optimal posterior estimator of the above MoLRG distribution—where the subspace \(\boldsymbol{U}_k\) is learnable—we can then show that the training loss is equivalent to the canonical subspace clustering problem [8]:
\[\underset{\boldsymbol{\theta}}{\max}\frac{1}{N}\sum^K_{k=1}\sum_{i\in C_k(\boldsymbol{\theta})}\|\boldsymbol{U}^T_k\boldsymbol{x}^{(i)}\|^2 \qquad \textrm{s.t.} \quad [\boldsymbol{U}_1,...,\boldsymbol{U}_K]\in \mathcal{O}^{n\times dK},\]
where \(C_k\) is a cluster assignment of the training samples that depends on \(\boldsymbol{\theta}\) (i.e., \([\boldsymbol{U}_1,...,\boldsymbol{U}_K]\)) and \(\mathcal{O}^{n\times d}\) denotes the set of all \(n \times d\) orthonormal matrices. In particular, the above problem reduces to principal component analysis (PCA) for \(K=1\). This occurrence is particularly interesting because a phase transition is well-known in PCA, where the number of required samples to learn the underlying subspace scales linearly with its intrinsic dimension. Figure 4 illustrates the extension of this intuition to subspace clustering, thus providing a foundation to better understand the generalization of diffusion models and explain why they can effectively capture low-dimensional distributions without the curse of dimensionality.
![<strong>Figure 5.</strong> Low rankness of the Jacobian and local linearity of the denoising autoencoder (DAE). <strong>5a.</strong> The rank ratio of the Jacobian against timestep \(t\). <strong>5b.</strong> The norm ratio (top) and cosine similarity (bottom) between the DAE and its first-order Taylor expansion against step size \(\lambda\) at timestep \(t=0.7\). Figure courtesy of [1].](/media/krxhepcb/figure5.jpg)
Inductive Bias in the Learned Models and Practical Implications
Due to the low dimensionality in data, training diffusion models often display bias towards certain benign solutions. Empirically, we find that such inductive bias only happens in the generalization regime, which leads to two benign properties of the learned DAE [4]:
(i) Low-rank Jacobian of the DAE: The Jacobian matrix \(\boldsymbol{J}_{\boldsymbol{\theta},t}(\boldsymbol{x}_t)=\nabla_{x_t}\boldsymbol{x}_{\boldsymbol{\theta},t}(\boldsymbol{x}_t,\boldsymbol{t})\) of the DAE is low rank at intermediate noise levels, and its rank exhibits a U-shaped curve across time steps (see Figure 5a) [1].
(ii) Local linearity of the DAE: The learned DAE exhibits strong linearity within a large range of noise levels (see Figure 5b), which means that it approximately equals its first-order Taylor expansion [1, 4]:
\[\boldsymbol{x}_{\boldsymbol{\theta},t} (\boldsymbol{x}_t+\lambda\Delta\boldsymbol{x})\approx \boldsymbol{x}_{\boldsymbol{\theta},t}(\boldsymbol{x}_t)+ \lambda\boldsymbol{J}_{\boldsymbol{\theta},t}(\boldsymbol{x}_t)\Delta\boldsymbol{x}.\]
If we pick \(\Delta\boldsymbol{x}\) in the low-rank subspace of \(\boldsymbol{J}_{\boldsymbol{\theta},t}\), we can hence manipulate the image generation. In contrast, picking \(\Delta\boldsymbol{x}\) in the nullspace of \(\boldsymbol{J}_{\boldsymbol{\theta},t}\) has minimal impact on the generated image.
![<strong>Figure 6.</strong> Examples of controlled generation. <strong>6a.</strong> The proposed low-rank controllable image editing method can perform precise localized editing in the region of interest. <strong>6b-6d.</strong> The editing direction is homogeneous <strong>(6b)</strong>, composable <strong>(6c)</strong>, and linear <strong>(6d)</strong>. Figure courtesy of [1].](/media/lounypnb/figure6.jpg)
These findings are useful in many applications, a few of which we highlight below:
- Controlled generation: Unlike methods based on generative adversarial networks, unsupervised manipulation of diffusion model generation is notoriously difficult. By leveraging the two benign properties via low-rank controllable image editing, we manipulate the low-dimensional semantic subspace of the Jacobian to enable precise, disentangled image editing [1]. Our method identifies editing directions with key properties: homogeneity, transferability, composability, and linearity (see Figure 6).
- Copyright protection and safety of generative artificial intelligence (AI): Watermarking is vital for the detection and prevention of AI-generated content misuse. We propose a technique called shallow diffuse, which leverages the nullspace of a low-rank Jacobian at intermediate noise levels to embed more imperceptible watermarks [3]. Our approach outperforms existing methods that operate at high noise levels with nearly full-rank Jacobians.
- Improving training efficiency: Training diffusion models is computationally intensive because of their large capacity. However, the benign inductive bias indicates that effective parameters vary across noise levels and are often much smaller than the model size for most timesteps, revealing substantial parameter redundancy. As we have demonstrated in Figure 5a, leveraging this fact enables the design of more efficient network architectures [9].
Qing Qu delivered a minisymposium presentation on this research at the 2024 SIAM Conference on Mathematics of Data Science, which took place in Atlanta, Ga., last October.
References
[1] Chen, S., Zhang, H., Guo, M., Lu, Y., Wang, P., & Qu, Q. (2024). Exploring low-dimensional subspace in diffusion models for controllable image editing. In Advances in neural information processing systems 37 (NeurIPS 2024). Vancouver, Canada.
[2] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. In NIPS'20: Proceedings of the 34th international conference on neural information processing systems (pp. 6840-6851). Vancouver, Canada: Curran Associates Inc.
[3] Li, W., Zhang, H., & Qu, Q. (2024). Shallow diffuse: Robust and invisible watermarking through low-dimensional subspaces in diffusion models. Preprint, arXiv:2410.21088.
[4] Li, X., Dai, Y., & Qu, Q. (2024). Understanding generalizability of diffusion models requires rethinking the hidden Gaussian structure. In Advances in neural information processing systems 37 (NeurIPS 2024). Vancouver, Canada.
[5] Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., & Goldstein, T. (2021). The intrinsic dimension of images and its impact on learning. In Proceedings of the ninth international conference on learning representations (ICLR 2021). OpenReview.
[6] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2022 (CVPR 2022) (pp. 10674-10685). New Orleans, LA: IEEE Computer Vision Society.
[7] Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. In Proceedings of the ninth international conference on learning representations (ICLR 2021). OpenReview.
[8] Wang, P., Zhang, H., Zhang, Z., Chen, S., Ma, Y., & Qu, Q. (2024). Diffusion models learn low-dimensional distributions via subspace clustering. Preprint, arXiv:2409.02426.
[9] Zhang, H., Lu, Y., Alkhouri, I., Ravishankar, S., Song, D., & Qu, Q. (2024). Improving training efficiency of diffusion models via multi-stage framework and tailored multi-decoder architecture. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2024 (CVPR 2024) (pp. 7372-7381). Seattle, WA: IEEE Computer Vision Society.
[10] Zhang, H., Zhou, J., Lu, Y., Guo, M., Wang, P., Shen, L., & Qu, Q. (2024). The emergence of reproducibility and consistency in diffusion models. In Proceedings of the 41st international conference on machine learning (ICML 2024) (pp. 60558-60590) (Vol. 235). Vienna, Austria: Proceedings of Machine Learning Research.
About the Authors
Huijie Zhang
Ph.D. student, University of Michigan
Huijie Zhang is a Ph.D. student in electrical and computer engineering at the University of Michigan. His research interests lie in generative models and diffusion models, including the generalizability and low-dimensional structures in diffusion models.

Peng Wang
Postdoctoral research fellow, University of Michigan
Peng Wang is a postdoctoral research fellow in electrical and computer engineering at the University of Michigan. His research focuses on mathematical foundations of deep learning models, including supervised learning models, diffusion models, and large language models.

Siyi Chen
Ph.D. student, University of Michigan
Siyi Chen is a Ph.D. student in electrical and computer engineering at the University of Michigan. She is interested in controllable and interpretable deep learning, such as diffusion models, multimodality, and unified models.

Zekai Zhang
Ph.D. student, University of Michigan
Zekai Zhang is a Ph.D. student in in electrical and computer engineering at the University of Michigan. His research focuses on deep learning theory for modern applications, including diffusion models and vision language models.

Qing Qu
Assistant professor, University of Michigan
Qing Qu is an assistant professor in the Electrical Engineering and Computer Science Department at the University of Michigan. His research interests lie at the intersection of signal processing, machine learning, and numerical optimization, with a particular focus on computational methods that learn low-complexity models from high-dimensional data.

Related Reading
Stay Up-to-Date with Email Alerts
Sign up for our monthly newsletter and emails about other topics of your choosing.