The AI video generation and editing field is undergoing a fundamental rethinking of its underlying logic. ByteDance's commercial technology team has recently officially open-sourced a unified framework for video generation and video editing called Bernini. The framework primarily emphasizes a "understand first, then generate" collaborative mechanism, aiming to solve industry pain points such as image instability and frame flickering caused by traditional models' inability to accurately understand complex text instructions.
Traditional video editing often faces technical bottlenecks such as subject deformation, background drift, or action disruption. To break this deadlock, Bernini cleverly divides the workflow into two parts: "semantic planning" and "visual rendering." The system first uses a multimodal large model planner (MLLM-based planner) to deeply analyze input materials such as text, video, and reference images, predicting the target semantic representation in the feature space, that is, drawing an "semantic sketch" without pixel constraints; then, a Diffusion Transformer-based renderer (DiT-based renderer) performs high-quality visual rendering, converting the planned semantic goals into stable and continuous video scenes.

Thanks to this division of labor, Bernini has shown significant practical value in controllable editing. Users can not only make realistic and natural changes to weather, seasons, materials, and visual styles within a single instruction, but also achieve precise semantic control over camera angles, focus, and subject actions. For example, under the premise of maintaining a highly stable environment and camera, the system can naturally change the animal's actions in the video, making AI video editing closer to the precision of traditional post-production software.
In addition to text manipulation, Bernini also supports images and videos as visual references, greatly improving the consistency of creation. In video editing scenarios, it can accurately embed specific materials, designated subjects, or even advertisements into the target area of video materials, ensuring no boundary breaches or perspective distortions; while in new video generation scenarios, the model supports single-image reference generation, multi-angle reference generation, evolution from key frames to continuous shots, and even perfectly combining several unrelated product images into the same video character.
To solve the problem of models easily confusing multiple visual segments, the team introduced the SA-3D RoPE positional encoding mechanism, giving different visual segments unique markers, thus distinguishing reference materials and output targets while preserving spatial-temporal relationships. Currently, in ByteDance's own testing, the framework has firmly ranked among the industry's top tier. It is reported that the inference code and the second-stage model Bernini-R have been officially opened up, and the full version with the complete MLLM planner will be fully released shortly.