13.5 VAE: Advantages, Limitations, and Future Developments

Author

jshn9515

Published

2026-03-28

Modified

2026-04-04

In the previous few sections, we have discussed:

How AutoEncoder learns latent representation through compression and reconstruction;
How VAE puts latent representation into a probabilistic modeling framework;
Why ELBO becomes the training objective of VAE;
What typical phenomena appear during VAE training, such as latent space interpolation and mode collapse.

At this point, we should already be able to answer what VAE is and why VAE is designed this way. But why did VAE not become the mainstream method for image generation today? What are its advantages and limitations?

In this section, we look at the advantages and limitations of VAE, and then look at its position in the field of generative models.

13.5.1 Advantages of VAE: Why It Is Important

VAE is classic not only because it can generate images, but because it naturally combines several very important ideas for the first time:

Latent variable modeling
Parameterizing complex distributions with neural networks
Variational inference
End-to-end training

That is to say, VAE is not simply adding noise to AutoEncoder. Instead, it unifies representation learning, probabilistic modeling, and neural network training into one framework. This gives it strong foundational significance in history.

Training is relatively stable

Compared with many later generative models, one huge advantage of VAE is that its training is relatively stable. Its objective function comes from a clear probabilistic derivation, and usually it is optimizing ELBO, that is:

\[ \mathcal{L}(\theta,\phi;x) = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - D_{\mathrm{KL}}(q_\phi(z\mid x)\,\|\,p(z)) \]

This means it does not need two networks to play an adversarial game like GAN. There is no pulling back and forth between generator and discriminator, and there are no such obvious problems as training oscillation and mode collapse. So, from the perspective of training, VAE is a relatively friendly generative model. This is also why VAE is very suitable as an important starting point for learning generative models.

It has a clear probabilistic interpretation

A normal AutoEncoder can also reconstruct and learn latent representation, but it lacks a clear probabilistic semantics. Every part of VAE, however, can be understood in the language of probabilistic graphical models:

\(p(z)\): prior distribution
\(p_\theta(x\mid z)\): generative model
\(q_\phi(z\mid x)\): approximate posterior
ELBO: an optimizable lower bound of the log-likelihood

So, we know what assumptions VAE is making, where the loss function comes from, why sampling generation is reasonable, and what each regularization term is constraining. This is often easier to understand and analyze than the more “black-box” generation of GAN.

The latent space is continuous, interpretable, and interpolatable

One of the most attractive parts of VAE is that the latent space it learns is usually more regular than that of a normal AE. Because of the KL term, the model is encouraged to make the encoded distribution close to the prior, usually a standard normal distribution:

\[ q_\phi(z\mid x) \approx \mathcal{N}(0, I) \]

This means the latent space usually has some very useful properties:

Continuity: nearby latent codes often correspond to similar samples
Smoothness: when moving along the latent space, the generated results gradually change
Interpolatability: when doing latent interpolation between two samples, natural transitions often appear

This is also why many tutorials on generative models use VAE to show the geometry of latent space.

13.5.2 Limitations of VAE: Why the Generated Results Are Often Blurry

Although VAE is elegant, it also has some very typical limitations. The most commonly mentioned one is:

Images generated by VAE are often relatively smooth, and even a bit blurry.

This is the most intuitive feeling many beginners have when they first run VAE. It looks like the model understands the rough outline, but the details are not sharp enough, and the edges are not as clear as GAN or later Diffusion models. This is not accidental. It is related to the objective function, probabilistic assumptions, and training method of VAE.

The reconstruction objective is more biased toward “reasonable in an average sense”

VAE is often trained by maximizing the log-likelihood, and in many implementations, this is equivalent to minimizing some kind of pixel-wise reconstruction error, such as MSE or BCE. When one input may correspond to multiple possible details, pixel-wise loss often encourages the model to output an averaged result.

For example, for a face image:

Some local textures may have multiple reasonable values
The tiny direction of hair may not be unique
Background details may have uncertainty

If the model tries to take multiple possibilities into account at the pixel level at the same time, the safest way is often to output a compromise average result. This averaging will visually appear as blur. So, the blur of VAE does not necessarily mean that it has not learned. It more often means that what it optimizes is probabilistic likelihood and overall structure, not visually sharpest sample quality.

KL regularization compresses expressive capacity

Another key objective of VAE is to make the encoded distribution close to the prior:

\[ D_{\mathrm{KL}}(q_\phi(z\mid x)\,\|\,p(z)) \]

This term makes the latent space more regular, but it also means the model cannot encode every sample in an unrestricted and highly personalized way. In other words, the KL term is encouraging the model not to hide every sample in a very remote corner of the latent space, not to make the posterior distribution too complex, and to express the data in a more compact and more unified way as much as possible. This of course helps when sampling from the prior during generation, but the cost is that some high-frequency details, local differences, and fine textures may be sacrificed.

So, VAE is making a trade-off: it wants reconstruction to look more like the original image, and it also wants the latent space to be more regular. These two goals are not always completely consistent.

When the decoder is too strong, the latent variable may be ignored

This leads to a deeper problem: posterior collapse.

Posterior collapse refers to a situation that gradually appears during training:

\[ q_\phi(z\mid x) \approx p(z) \]

That is to say, no matter what the input \(x\) is, the posterior output by the encoder looks almost like the prior. This means the latent variable carries almost no information related to \(x\). If the decoder is strong enough at this time, it may rely only on its own modeling ability to reconstruct the data, and not really use \(z\) very much.

The consequence is that the KL term becomes very small, the latent variable loses information, and the model can still seem to train on the surface, but the latent representation becomes meaningless. This is especially common in sequence modeling and text modeling, because a powerful autoregressive decoder can easily bypass the latent variable.

13.5.3 Several Common Improvements of VAE

The classic structure of VAE is already elegant, but researchers soon found that if they wanted to further improve representation quality, generation quality, or make the latent variables have more semantic structure, they needed to make various extensions. Below are several of the most common and most worth knowing directions.

\(\beta\)-VAE: putting more emphasis on latent space regularity (Higgins et al. 2017)

\(\beta\)-VAE is based on the standard VAE, but multiplies the KL term by a coefficient \(\beta\):

\[ \mathcal{L}_{\beta\text{-VAE}} = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - \beta D_{\mathrm{KL}}(q_\phi(z\mid x)\,\|\,p(z)) \]

When \(\beta > 1\), the model is more strongly required to stay close to the prior. In this way, the latent space becomes more regular and it becomes easier to decompose independent factors, but the cost is that reconstruction quality may decrease. The independent factors here refer to the fact that different dimensions in the latent space are more likely to correspond to different semantic attributes, such as pose, scale, rotation, lighting, color, and so on. This kind of representation is usually called disentangled representation.

Conditional VAE: conditional generation (Sohn et al. 2015)

If we want the model not only to generate randomly, but to generate according to a condition, we can add the conditional variable \(y\) into the model. For example, given conditions such as class labels, attributes, or text descriptions, the model generates the corresponding image. At this point, the generation process becomes:

\[ p_\theta(x\mid z, y) \]

And the encoder can also be written as:

\[ q_\phi(z\mid x, y) \]

In this way, the model is not only learning what the overall data looks like, but learning how the data changes under a certain condition.

Conditional VAE is also a basic starting point for many conditional generation methods.

VQ-VAE: turning continuous latent variables into a discrete codebook (Oord et al. 2018)

The standard VAE uses continuous latent variables, usually assuming a Gaussian distribution. VQ-VAE (Vector Quantized VAE) takes another path: it discretizes the latent representation.

Simply speaking, it maintains a codebook. After the encoder outputs a continuous vector, the model maps it to the nearest discrete embedding. In this way, the latent variable can be more like a discrete symbol, making it more suitable to combine with autoregressive models or token-based models. VQ-VAE has had a large influence in images, speech, and discrete representation learning. The ideas of many later visual token methods can be connected to VQ-VAE.

13.5.4 Summary

At this point, let us first compare several generative models together:

Table 1: Comparison of the advantages and disadvantages of generative models
Model	Core idea	Advantages	Limitations
GAN	Adversarial training between generator and discriminator	Sharp and realistic samples	Unstable training, prone to mode collapse
AutoEncoder	Compress and reconstruct the input	Simple structure, suitable for representation learning	Latent space is irregular, not suitable for direct sampling
VAE	Latent variable probabilistic modeling and variational inference	Stable training, smooth latent space, can sample	Results tend to be blurry, posterior collapse may exist
Diffusion	Generation through gradual noising and denoising	High quality, relatively stable training	Slow sampling, relatively complex system

So, VAE is not a model that pursues the strongest visual quality. Instead, it pursues a more regular, more interpretable, and more probabilistically meaningful generation framework. Therefore, if our question is how to combine probabilistic modeling and representation learning, and how to make the latent space more structured, then VAE is almost an unavoidable step.

VAE is an important milestone in the history of generative models. Before VAE appeared, people already knew that they could use AutoEncoder to learn data representations, use probabilistic graphical models for latent variable modeling, and use variational methods to approximate complex posterior distributions. The key contribution of VAE is that it unified these ideas into the same framework and formed an end-to-end trainable generative model. After VAE, diffusion models gradually became mainstream. LDM (Latent Diffusion Model) further proposed that first compressing data into a more compact latent space with clearer structure, and then completing the generation process in this latent space, can significantly improve generation efficiency. This idea also gave rise to the first widely commercialized image generation model: Stable Diffusion. In a sense, the latent space that Stable Diffusion relies on is exactly the latent representation learned by a VAE.

In the next chapter, we will talk about the last part of generative models: Diffusion Model.

References

Higgins, Irina, Loic Matthey, Arka Pal, et al. 2017. Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. https://openreview.net/forum?id=Sy2fzU9gl.

Oord, Aaron van den, Oriol Vinyals, and Koray Kavukcuoglu. 2018. Neural Discrete Representation Learning. https://arxiv.org/abs/1711.00937.

Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation Using Deep Conditional Generative Models. 28.

Reuse

CC BY-NC 4.0