Recently, the open-source framework StoryMem, jointly developed by ByteDance and Nanyang Technological University, has attracted widespread attention in the field of AI video generation. This framework uses an innovative "visual memory" mechanism to transform existing single-shot video diffusion models into multi-shot long video storytellers. It can automatically generate videos lasting more than one minute, containing multiple shot transitions, with highly coherent characters and scenes. This marks a key step forward for open-source AI video technology towards cinematic storytelling.
Core Innovation of StoryMem: Memory Mechanism Driven Shot-by-Shot Generation
The core of StoryMem lies in introducing a "Memory-to-Video (M2V)" design inspired by human memory. It maintains a compact dynamic memory library that stores key frame information from previously generated shots. First, the initial shot is generated using a text-to-video (T2V) module as the initial memory. Then, for each new shot generated, the M2V LoRA injects the memory key frames into the diffusion model, ensuring high consistency in character appearance, scene style, and narrative logic across shots.
After generation, the framework automatically extracts semantic key frames and performs aesthetic screening to further update the memory library. This iterative generation approach avoids common issues such as character "face changes" and scene jumps in traditional long video models, while requiring only lightweight LoRA fine-tuning without the need for large-scale long video data training.

Outstanding Consistency and Cinematic Quality
Experiments show that StoryMem significantly outperforms existing methods in cross-shot consistency, with an improvement of up to 29%, and receives higher preference in human subjective evaluation. At the same time, it retains the high-quality visuals, prompt adherence, and shot control capabilities of the base model (such as Wan2.2), supporting natural transitions and custom story generation.
The framework also released the ST-Bench benchmark dataset, containing 300 diverse multi-shot story prompts, used for standardized evaluation of long video narrative quality.
Wide Application Scenarios: A Quick Preview and A/B Testing Tool
StoryMem is particularly suitable for fields that require rapid iteration of visual content:
- Marketing and Advertising: Quickly generate dynamic storyboards from scripts for various A/B testing versions
- Film Pre-production: Assist crews in visualizing storyboards and reducing pre-concept costs
- Short Videos and Independent Creation: Easily produce coherent narrative short films, enhancing the professionalism of content
Community Rapid Response: ComfyUI Integration is Emerging
Soon after the project's release, the community has started exploring local deployment. Some developers have already implemented a preliminary workflow in ComfyUI, supporting local generation of long videos, further lowering the usage barrier.
AIbase's View: Long video consistency has always been a pain point in AI generation. StoryMem solves this problem in a lightweight and efficient way, greatly advancing the evolution of open-source video models into practical narrative tools. With the integration of more multimodal capabilities, its potential in advertising, film, and content creation will be further unleashed.
Project Address: https://github.com/Kevin-thu/StoryMem
