Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

Shizhan Liu*, Xinran Deng*, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, Jie Tang
Zhipu AI
*Equal Contribution;   Project Leader;   Corresponding Author.

Video Generation Comparison

Visual comparison of 17Ă—512Ă—512 generation results using 4B diffusion models. SSVAE demonstrates fewer artifacts and better temporal consistency.

Wan 2.2 VAE
SSVAE (Ours)

Overview

Latent video diffusion models have advanced text-to-video generation by coupling VAE-based tokenizers with diffusion backbones. However, existing video VAEs primarily pursue reconstruction fidelity, often overlooking how the structure of VAE latents shapes downstream diffusion training dynamics. This objective-target mismatch leads to a phenomenon where stronger reconstruction does not necessarily translate to better generative utility.

In this work, we bridge this gap by conducting a comprehensive statistical analysis of video VAE latent spaces. We identify two spectral properties that are essential for facilitating diffusion training:

  1. A spatio-temporal frequency spectrum biased toward low frequencies.
  2. A channel-wise eigenspectrum dominated by a few modes (Few-Mode Bias).

Teaser

Figure 1. Our SSVAE identifies and induces two critical spectral properties, achieving a 3Ă— speedup in convergence and superior generation quality compared to open-source baselines.


To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization (LCR) and Latent Masked Reconstruction (LMR). Our resulting model, Spectral-Structured VAE (SSVAE), accelerates convergence by 3$\boldsymbol{\times}$ and improves video reward by 10%, consistently outperforming state-of-the-art open-source VAEs. Comparison with Wan 2.2 VAE shows that SSVAE with a 1.3B diffusion backbone achieves a higher UnifiedReward score than Wan 2.2 VAE with a 4B diffusion model, while using 67.5% fewer parameters.

1. Spatio-Temporal Frequency Spectrum Shaping

Frequency Analysis

SER [1] observes that a spatial spectrum biased toward low frequencies (or, equivalently, the suppression of high-frequency components) correlates with improved diffusion training. Intuitively, high-SNR low-frequency components facilitate the recovery of low-SNR high-frequency details during denoising, simplifying optimization. We find that this low-frequency biasing principle likewise applies to the spatio-temporal frequency spectrum.

We analyze the frequency characteristics of latents using 3D DCT. Existing works on image VAEs, such as SER [1] (Scale-Equivariant Regularization) and VA-VAE [2], attempt to bias the spectrum but often fail to adequately address the temporal dimension in video latents. As shown in the figure, they do not sufficiently suppress high-frequency components compared to our proposed method.

Solution: Local Correlation Regularization (LCR)

Motivated by the Wiener–Khinchin theorem, we observe that low-frequency energy is governed by the similarity of latent vectors at neighboring spatio-temporal positions. To efficiently induce this bias, we propose LCR.

LCR Method

Figure 3. LCR explicitly enhances correlation within local spatio-temporal patches, effectively biasing the spectrum toward low frequencies.

2. Channel Eigenspectrum Shaping

We further investigate the channel-wise covariance of VAE latents. We observe a distinct difference in the cumulative explained variance between VAEs with different channel counts (e.g., 48 vs. 128 channels). High-channel VAEs tend to distribute eigenvalues evenly, whereas low-channel VAEs exhibit a Few-Mode Bias (FMB).

PCA Analysis

Figure 4. Comparative analysis of latent channel covariance. A few-mode-biased latent space is associated with lower diffusion loss scale and faster convergence.

Our experiments show that artificially penalizing the covariance to force a few-mode bias leads to lower diffusion loss and higher generation rewards. To understand why, we analyze the learning dynamics from a cross-correlation perspective. We theoretically derive that:

"Few-mode biased latent spaces can accelerate diffusion model convergence by amplifying the absolute mode strengths in the output-input cross-correlation matrix."

Essentially, when energy is concentrated in a few dominant modes, the diffusion model learns the signal-noise relationship more efficiently.

Solution: Latent Masked Reconstruction (LMR)

To reliably promote this property while ensuring decoder robustness, we introduce Latent Masked Reconstruction (LMR).

LMR Method

LMR randomly masks latent tokens across spatio-temporal dimensions and forces the decoder to reconstruct the video. This mechanism compels the encoder to compress essential information into a few dominant modes (promoting FMB) and simultaneously trains the decoder to handle noisy/incomplete inputs, which is critical for generating high-quality videos from diffusion outputs.

Key Results

We conduct extensive experiments across different diffusion backbones and resolutions. As shown in Table 1, our SSVAE consistently achieves the best generation performance across multiple benchmarks (VBench, MovieGenBench, and MovieValid).

Table 1 Experiments

Table 1. Text-to-Video generation comparison across various video VAEs. Best results are bolded.

Notably, for 17Ă—512Ă—512 video generation, SSVAE achieves a 3Ă— convergence speedup in terms of UnifiedReward (UR) compared to the 48-channel baseline. Furthermore, SSVAE demonstrates a ~10% gain in video reward over the strong open-source Wan 2.2 VAE, validating that the induced spectral properties significantly benefit downstream diffusion training.

Reference

[1] Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders. In Forty-second International Conference on Machine Learning, 2025.
[2] Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025.

Cite Our Work

@misc{liu2025delvinglatentspectralbiasing,
      title={Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability}, 
      author={Shizhan Liu and Xinran Deng and Zhuoyi Yang and Jiayan Teng and Xiaotao Gu and Jie Tang},
      year={2025},
      eprint={2512.05394},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.05394}, 
}