Visual comparison of 17Ă—512Ă—512 generation results using 4B diffusion models. SSVAE demonstrates fewer artifacts and better temporal consistency.
Latent video diffusion models have advanced text-to-video generation by coupling VAE-based tokenizers with diffusion backbones. However, existing video VAEs primarily pursue reconstruction fidelity, often overlooking how the structure of VAE latents shapes downstream diffusion training dynamics. This objective-target mismatch leads to a phenomenon where stronger reconstruction does not necessarily translate to better generative utility.
In this work, we bridge this gap by conducting a comprehensive statistical analysis of video VAE latent spaces. We identify two spectral properties that are essential for facilitating diffusion training:
Figure 1. Our SSVAE identifies and induces two critical spectral properties, achieving a 3Ă— speedup in convergence and superior generation quality compared to open-source baselines.
To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization (LCR) and Latent Masked Reconstruction (LMR). Our resulting model, Spectral-Structured VAE (SSVAE), accelerates convergence by 3$\boldsymbol{\times}$ and improves video reward by 10%, consistently outperforming state-of-the-art open-source VAEs. Comparison with Wan 2.2 VAE shows that SSVAE with a 1.3B diffusion backbone achieves a higher UnifiedReward score than Wan 2.2 VAE with a 4B diffusion model, while using 67.5% fewer parameters.
SER [1] observes that a spatial spectrum biased toward low frequencies (or, equivalently, the suppression of high-frequency components) correlates with improved diffusion training. Intuitively, high-SNR low-frequency components facilitate the recovery of low-SNR high-frequency details during denoising, simplifying optimization. We find that this low-frequency biasing principle likewise applies to the spatio-temporal frequency spectrum.
We analyze the frequency characteristics of latents using 3D DCT. Existing works on image VAEs, such as SER [1] (Scale-Equivariant Regularization) and VA-VAE [2], attempt to bias the spectrum but often fail to adequately address the temporal dimension in video latents. As shown in the figure, they do not sufficiently suppress high-frequency components compared to our proposed method.
Motivated by the Wiener–Khinchin theorem, we observe that low-frequency energy is governed by the similarity of latent vectors at neighboring spatio-temporal positions. To efficiently induce this bias, we propose LCR.
Figure 3. LCR explicitly enhances correlation within local spatio-temporal patches, effectively biasing the spectrum toward low frequencies.
We further investigate the channel-wise covariance of VAE latents. We observe a distinct difference in the cumulative explained variance between VAEs with different channel counts (e.g., 48 vs. 128 channels). High-channel VAEs tend to distribute eigenvalues evenly, whereas low-channel VAEs exhibit a Few-Mode Bias (FMB).
Figure 4. Comparative analysis of latent channel covariance. A few-mode-biased latent space is associated with lower diffusion loss scale and faster convergence.
Our experiments show that artificially penalizing the covariance to force a few-mode bias leads to lower diffusion loss and higher generation rewards. To understand why, we analyze the learning dynamics from a cross-correlation perspective. We theoretically derive that:
"Few-mode biased latent spaces can accelerate diffusion model convergence by amplifying the absolute mode strengths in the output-input cross-correlation matrix."
Essentially, when energy is concentrated in a few dominant modes, the diffusion model learns the signal-noise relationship more efficiently.
To reliably promote this property while ensuring decoder robustness, we introduce Latent Masked Reconstruction (LMR).
LMR randomly masks latent tokens across spatio-temporal dimensions and forces the decoder to reconstruct the video. This mechanism compels the encoder to compress essential information into a few dominant modes (promoting FMB) and simultaneously trains the decoder to handle noisy/incomplete inputs, which is critical for generating high-quality videos from diffusion outputs.
We conduct extensive experiments across different diffusion backbones and resolutions. As shown in Table 1, our SSVAE consistently achieves the best generation performance across multiple benchmarks (VBench, MovieGenBench, and MovieValid).
Table 1. Text-to-Video generation comparison across various video VAEs. Best results are bolded.
Notably, for 17Ă—512Ă—512 video generation, SSVAE achieves a 3Ă— convergence speedup in terms of UnifiedReward (UR) compared to the 48-channel baseline. Furthermore, SSVAE demonstrates a ~10% gain in video reward over the strong open-source Wan 2.2 VAE, validating that the induced spectral properties significantly benefit downstream diffusion training.
[1] Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders. In Forty-second International Conference on Machine Learning, 2025.
[2] Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025.
@misc{liu2025delvinglatentspectralbiasing,
title={Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability},
author={Shizhan Liu and Xinran Deng and Zhuoyi Yang and Jiayan Teng and Xiaotao Gu and Jie Tang},
year={2025},
eprint={2512.05394},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.05394},
}