Memento — Reconstruct to Remember for Consistent Long Video Generation

Showcases

Story of Robinson Crusoe

Footprints on the Moon

The Lost City of Eternis

Story of Robinson Crusoe

A castaway's journey from shipwreck to survival, companionship, and rescue — consistent character identity across 12 shots, 4 scenes.

1 / 12

Video Prompt ▼

Footprints on the Moon

A lone astronaut embarks on a tranquil yet emotional mission to the Moon — consistent character identity across 11 shots, 3 scenes.

1 / 11

Video Prompt ▼

The Lost City of Eternis

An intrepid explorer ventures into the jungle and uncovers a long-lost city of forgotten relics — consistent character identity across 12 shots, 3 scenes.

1 / 12

Video Prompt ▼

Prompt Format: Each shot is conditioned on a two-part prompt: a global caption describing the character's persistent appearance (e.g., identity, clothing), and a shot caption describing the specific action and scene of the current shot.

Cut Field: cut = true indicates a scene transition (new background, camera reset); cut = false indicates temporal continuity within the same scene (smooth motion, same camera).

Abstract

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

Pipeline

Overview of our framework. We employ split self-attention over overlapping local groups, allowing reconstruction–memory and memory–target interactions while avoiding full global attention. Split cross-attention injects both global story-level and local shot-level captions. Story and shot captions condition separate learnable queries via caption-to-query cross-attention. The fused query retrieves relevant candidates from the memory bank and the last shot, and updates global and local memory states for scalable long-form generation.

Benchmark

Quantitative comparison of video generation capabilities. We report aesthetic quality, semantic consistency at both story and shot levels, background consistency, and subject consistency across different granularities. The best results are highlighted in bold, and the second-best results are underlined.

Method	Semantic Consistency		Background Consistency	Subject Consistency			Aesthetic
Method	Global	Shot	Background Consistency	Inter-shot	Intra-shot	Inter-scene	Aesthetic
StoryDiffusion + Wan2.2-I2V	0.2671	0.2689	0.9767	0.5525	0.8448	0.6732	0.5310
StoryMem	0.2793	0.2681	0.9732	0.6606	0.8146	0.6692	0.4937
HoloCine	0.2720	0.2854	0.9770	0.5791	0.8128	0.6594	0.4568
Ours	0.3063	0.2893	0.9805	0.7338	0.8578	0.7268	0.4977

Comparison

Simple PhD Life

1 scene, 9 shots

StoryMem

HoloCine

Memento Ours

Jane Eyre: A Quiet Strength

4 scenes, 11 shots

StoryMem

HoloCine

Memento Ours

The Light of Van Gogh

3 scenes, 10 shots

StoryMem

HoloCine

Memento Ours

Cite This Work

@misc{wei2026memento,
  title        = {Memento: Reconstruct to Remember for Consistent Long Video Generation},
  author       = {Xuan Wei and Xiangrui Liu and Longbin Ji and Guan Wang and Zhenyu Zhang and Shuohuan Wang and Yu Sun and Qingqi Hong},
  year         = {2026},
  note         = {Preprint}
}

MementoReconstruct to Remember for Consistent Long Video Generation

Cite This Work

Memento
Reconstruct to Remember for Consistent Long Video Generation