arXiv · 2026

Native Audio-Visual Alignment for Generation

A native alignment framework that achieves state-of-the-art audio-visual synchronization with only 6.3B parameters.

ERNIE Team

Baidu Inc.

720p 1min Fast Generation
🎵Dual-Channel Audio
🎤Precise Multi-Timbre Control
🎥Language-Described Camera Control
📹Multi-Resolution (Landscape / Portrait / Square)

A native space where audio and video co-evolve.

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual and acoustic content. Existing open-source methods mainly follow dual-tower designs, which generate audio and video in separate streams and rely on posterior alignment, or fully unified tri-modal designs, which mix textual context, audio, and video in a single shared space. These paradigms either weaken fine-grained audio-video co-evolution or couple semantic conditioning with low-level synchronization.

We propose NAVA, a Native Audio-Visual Alignment framework that formulates generation as context-conditioned native audio-visual alignment. NAVA first establishes audio-video correspondence in a dedicated alignment space and then applies context as external conditioning to guide the aligned representation.

We instantiate this formulation with an Align-then-Fuse MMDiT architecture, which progressively bridges modality-aware alignment and unified audio-video denoising. To support controllable speech generation, we further introduce Timbre-in-Context Conditioning, which binds reference timbre cues to corresponding speech spans through the context pathway.

Experiments on Verse-Bench and the Seed-TTS benchmark demonstrate that NAVA achieves superior audio-visual synchronization and video quality, competitive audio quality, and substantially improved reference-timbre controllability with only 6.3B parameters.

Align-then-Fuse MMDiT.

NAVA Architecture

Figure 1. Overview of NAVA. Hierarchical Alignment Layers establish audio-video correspondence in a dedicated alignment space; Unified Fusion Layers then perform context-conditioned denoising. Timbre-in-Context Conditioning binds reference timbre cues to speech spans via the context pathway.

State-of-the-art with 6.3B parameters.

Table 1.General Capability on VerseBench

NAVA achieves the best AV synchronization (Sync-C / Sync-D / IB) and video quality with the smallest parameter budget.

Model Params Resolution AV-Align Video Quality ↑ Audio
Sync-C ↑Sync-D ↓IB ↑ WER ↓PQ ↑FD ↓
Ovi 1.110B720p7.48397.97910.1990.6360.1025.84320.9418
MOVAA18B (32B)720p7.28887.8080.2690.6030.1267.23310.9222
Davinci15B540p7.14877.81580.2690.6000.1515.95590.9307
LTX 2.319B512p7.24767.69020.3370.5760.1066.94590.8287
NAVA (ours)6.3B720p7.79147.56550.3130.6590.0996.86090.8328

higher is better lower is betterBold = bestUnderline = 2nd best

Table 2.Timbre-Control Speech Performance

Audio-only models are listed as reference only — they are dedicated speech systems and not directly comparable. Among joint audio-video models, NAVA delivers speech quality close to dedicated audio-only systems.

CategoryModelWER ↓Speaker Similarity ↑
Audio-OnlyreferenceCosyVoice4.2960.9
CosyVoice22.5765.2
Qwen2.5-Omni2.7263.2
Audio-VideoDreamID-Omni31.7635.7
NAVA (ours)4.2066.7

Table 3.User Study

We conduct human GSB (Win / Tie / Lose) preference studies on both T2AV and TI2AV against open-source baselines (Ovi-1.1, LTX-2.3, MoVA, daVinci). NAVA achieves competitive Overall Quality across all comparisons and wins on Audio-Visual Alignment against all baselines.

User study GSB results

Generated samples.

All samples below are generated end-to-end by NAVA. Audio and video share a single denoising trajectory — No posterior alignment, No extra components.

Cite this work.

@misc{ji2026nava, title = {Native Audio-Visual Alignment for Generation}, author = {Longbin Ji and Guan Wang and Xuan Wei and Chenye Yang and Xiangrui Liu and Zhenyu Zhang and Shuohuan Wang and Yu Sun and Jingzhou He}, year = {2026}, eprint = {2605.30073}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2605.30073}, }

We would like to thank the contributors to Wan2.2-TI2V-5B, LTX-Video, ReDimNet, Qwen3, and Ovi for their great open-source work, which is helpful to this project.