VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Longbin Ji  ·  Xiaoxiong Liu  ·  Junyuan Shang  ·  Shuohuan Wang  ·  Yu Sun  ·  Hua Wu  ·  Haifeng Wang

ERNIE Team, Baidu

Abstract

Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74—competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.

Touch videos for playing. Video Loading would take a while. Please wait patiently.

VideoAR-Pro Gallery

VideoAR-Pro is a preliminary internal prototype built upon VideoAR, exploring unified video-audio AR generation.

VideoAR-4B Gallery

BibTeX

@misc{ji2026videoarautoregressivevideogeneration,
      title={VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction},
      author={Longbin Ji and Xiaoxiong Liu and Junyuan Shang and Shuohuan Wang and Yu Sun and Hua Wu and Haifeng Wang},
      year={2026},
      eprint={2601.05966},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.05966},
}

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .

We borrow the source code of this website from Nerfies source code.