NAVA · Native Audio-Visual Alignment for Generation

Highlights

⚡720p 1min Fast Generation

🎵Dual-Channel Audio

🎤Precise Multi-Timbre Control

🎥Language-Described Camera Control

📹Multi-Resolution (Landscape / Portrait / Square)

Abstract

A native space where audio and video co-evolve.

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual and acoustic content. Existing open-source methods mainly follow dual-tower designs, which generate audio and video in separate streams and rely on posterior alignment, or fully unified tri-modal designs, which mix textual context, audio, and video in a single shared space. These paradigms either weaken fine-grained audio-video co-evolution or couple semantic conditioning with low-level synchronization.

We propose NAVA, a Native Audio-Visual Alignment framework that formulates generation as context-conditioned native audio-visual alignment. NAVA first establishes audio-video correspondence in a dedicated alignment space and then applies context as external conditioning to guide the aligned representation.

We instantiate this formulation with an Align-then-Fuse MMDiT architecture, which progressively bridges modality-aware alignment and unified audio-video denoising. To support controllable speech generation, we further introduce Timbre-in-Context Conditioning, which binds reference timbre cues to corresponding speech spans through the context pathway.

Experiments on Verse-Bench and the Seed-TTS benchmark demonstrate that NAVA achieves superior audio-visual synchronization and video quality, competitive audio quality, and substantially improved reference-timbre controllability with only 6.3B parameters.

Method

Align-then-Fuse MMDiT.

Figure 1. Overview of NAVA. Hierarchical Alignment Layers establish audio-video correspondence in a dedicated alignment space; Unified Fusion Layers then perform context-conditioned denoising. Timbre-in-Context Conditioning binds reference timbre cues to speech spans via the context pathway.

Quantitative Results

State-of-the-art with 6.3B parameters.

Table 1.General Capability on VerseBench

NAVA achieves the best AV synchronization (Sync-C / Sync-D / IB) and video quality with the smallest parameter budget.

Model	Params	Resolution	AV-Align			Video Quality ↑	Audio
Model	Params	Resolution	Sync-C ↑	Sync-D ↓	IB ↑	Video Quality ↑	WER ↓	PQ ↑	FD ↓
Ovi 1.1	10B	720p	7.4839	7.9791	0.199	0.636	0.102	5.8432	0.9418
MOVA	A18B (32B)	720p	7.2888	7.808	0.269	0.603	0.126	7.2331	0.9222
Davinci	15B	540p	7.1487	7.8158	0.269	0.600	0.151	5.9559	0.9307
LTX 2.3	19B	512p	7.2476	7.6902	0.337	0.576	0.106	6.9459	0.8287
NAVA (ours)	6.3B	720p	7.7914	7.5655	0.313	0.659	0.099	6.8609	0.8328

↑ higher is better↓ lower is betterBold = bestUnderline = 2nd best

Table 2.Timbre-Control Speech Performance

Audio-only models are listed as reference only — they are dedicated speech systems and not directly comparable. Among joint audio-video models, NAVA delivers speech quality close to dedicated audio-only systems.

Category	Model	WER ↓	Speaker Similarity ↑
Audio-Onlyreference	CosyVoice	4.29	60.9
	CosyVoice2	2.57	65.2
	Qwen2.5-Omni	2.72	63.2
Audio-Video	DreamID-Omni	31.76	35.7
Audio-Video	NAVA (ours)	4.20	66.7

Table 3.User Study

We conduct human GSB (Win / Tie / Lose) preference studies on both T2AV and TI2AV against open-source baselines (Ovi-1.1, LTX-2.3, MoVA, daVinci). NAVA achieves competitive Overall Quality across all comparisons and wins on Audio-Visual Alignment against all baselines.

Showcase

Generated samples.

All samples below are generated end-to-end by NAVA. Audio and video share a single denoising trajectory — No posterior alignment, No extra components.

BibTeX

Cite this work.

@misc{ji2026nava, title = {Native Audio-Visual Alignment for Generation}, author = {Longbin Ji and Guan Wang and Xuan Wei and Chenye Yang and Xiangrui Liu and Zhenyu Zhang and Shuohuan Wang and Yu Sun and Jingzhou He}, year = {2026}, eprint = {2605.30073}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2605.30073}, }

Acknowledgements

We would like to thank the contributors to Wan2.2-TI2V-5B, LTX-Video, ReDimNet, Qwen3, and Ovi for their great open-source work, which is helpful to this project.