Bernini: Latent Semantic Planning for Video Diffusion Review

2026-06-29 11 분 소요

0. Introduction

Bernini는 “MLLM을 붙인 video diffusion” 정도로 읽으면 핵심을 놓치기 쉬운 논문이다. 이 논문이 던지는 질문은 조금 더 구조적이다. 비디오 생성과 편집에서 MLLM은 무엇을 해야 하고, diffusion renderer는 무엇을 해야 하는가?

이 논문의 중심 문장은 이쪽에 가깝다. MLLM은 pixel을 직접 만들지 말고 semantic plan을 만들고, diffusion model은 그 semantic plan을 고품질 pixel로 렌더링하자는 것이다. 즉 Bernini는 multimodal understanding과 visual synthesis를 한 모델 안에서 억지로 합치지 않고, semantic space를 interface로 두는 분업 구조를 제안한다.

한 줄 요약: Bernini는 MLLM-based planner가 target semantic representation을 ViT embedding space에서 예측하고, DiT-based renderer가 이 semantic plan과 text feature, source VAE feature를 조건으로 video를 생성하거나 편집하는 unified video generation and editing framework다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

MLLM을 prompt parser가 아니라 semantic planner로 사용한다.
ViT embedding space를 planner와 renderer 사이의 interface로 둔다.
V2V, RV2V, content insertion, R2V를 하나의 framework에서 다룬다.
planner와 renderer를 별도로 학습한 뒤 light co-training하는 recipe를 제시한다.

Bernini의 핵심은 video generation의 성능표 하나보다 설계 관점에 있다. 최근 video diffusion model은 점점 강해지고 있지만, 실제 editing task에서는 여전히 instruction understanding, reference binding, object identity preservation, motion consistency가 동시에 필요하다. Bernini는 이 문제를 renderer scale만 키워서 해결하기보다, semantic planning을 별도 역할로 분리한다.

1. Problem Setting

1-1. Problem definition

비디오 생성과 편집은 단순한 pixel synthesis 문제가 아니다. 특히 editing setting에서는 source video를 이해하고, instruction이 바꾸려는 대상과 유지해야 할 부분을 구분하고, reference image나 reference video가 주는 semantic constraint를 output에 반영해야 한다.

예를 들어 video-to-video editing에서는 아래 요소가 동시에 필요하다.

source video의 layout, motion, identity를 유지해야 한다.
prompt가 요구하는 변화는 정확히 반영해야 한다.
reference-guided editing에서는 reference object, material, style, weather 같은 조건을 output에 붙여야 한다.
content insertion에서는 새 content가 기존 scene과 spatially and temporally compatible해야 한다.
reference-to-video generation에서는 여러 reference image를 보고 하나의 coherent video로 렌더링해야 한다.

이 문제를 확률적으로 쓰면 아래처럼 볼 수 있다.

\[p(y_{video} | x_{text}, x_{src}, x_{ref})\]

여기서 $x_{src}$는 source image 또는 source video이고, $x_{ref}$는 reference image 또는 reference video다. 어려운 지점은 $x_{text}$, $x_{src}$, $x_{ref}$가 서로 다른 modality와 time structure를 가진다는 것이다. 즉 모델은 단순히 text prompt를 video latent로 바꾸는 것이 아니라, multimodal evidence를 하나의 target semantic plan으로 정렬해야 한다.

1-2. Why previous approaches are insufficient

기존 video diffusion pipeline은 보통 text condition과 visual condition을 renderer에 직접 넣는다. 이 방식은 end-to-end로 단순하지만, semantic reasoning이 약한 경우가 많다.

Approach	Main idea	Limitation
Text-to-video diffusion	text prompt를 video latent로 변환	복잡한 reference binding과 fine-grained editing에 약함
Video editing diffusion	source video와 prompt를 조건으로 denoising	instruction reasoning과 object-level 계획이 renderer 내부에 묻힘
MLLM plus generator	MLLM이 prompt나 caption을 보조 생성	MLLM의 understanding이 pixel generation interface로 충분히 전달되지 않을 수 있음
Unified multimodal generator	여러 입력을 하나의 model에 직접 투입	학습 비용과 modality alignment 부담이 커짐

Bernini가 보기에 MLLM과 diffusion model은 이미 각자의 강점이 다르다. MLLM은 heterogeneous multimodal inputs를 이해하고 reasoning하는 데 강하고, diffusion model은 photorealistic rendering에 강하다. 그래서 Bernini는 두 능력을 하나의 black-box model로 섞지 않고, MLLM은 semantic plan을 만들고 diffusion renderer는 이를 pixel로 바꾸게 한다.

이 관점에서 Bernini의 문제 설정은 “더 큰 video model”이 아니라 “semantic understanding을 renderer가 실제로 쓸 수 있는 latent interface로 어떻게 넘길 것인가”다.

2. Core Idea

2-1. Main contribution

Bernini의 핵심 기여는 크게 5가지로 정리할 수 있다.

MLLM-based semantic planner
- text, source image, source video, reference input, target placeholder를 함께 보고 target semantic representation을 예측한다.
- output은 natural language plan이 아니라 ViT embedding space의 semantic embedding이다.
DiT-based renderer
- semantic plan, text feature, editing을 위한 source VAE feature를 조건으로 VAE latent space에서 video를 생성한다.
- renderer는 flow-matching denoising을 수행한다.
Semantic interface between understanding and rendering
- planner와 renderer 사이의 interface를 pixel이나 caption이 아니라 ViT embedding으로 둔다.
- 이 덕분에 MLLM의 understanding을 renderer condition으로 직접 전달할 수 있다.
Segment-Aware 3D RoPE
- 여러 visual segment에서 온 token을 구분한다.
- source video, reference image, target placeholder가 섞이는 상황에서 segment identity와 spatiotemporal position을 함께 다루기 위한 장치다.
Separate training plus light co-training
- planner와 renderer를 각각 학습한 뒤, 가벼운 co-training으로 interface를 맞춘다.
- pretrained MLLM과 pretrained diffusion renderer의 장점을 가능한 한 유지하는 방향이다.

2-2. Design intuition

Bernini의 설계 직관은 아래처럼 정리된다.

먼저 MLLM은 input을 이해한다. 여기에는 text instruction뿐 아니라 source video, source image, reference image, reference video, target placeholder가 포함된다. 그 다음 MLLM planner는 output video가 가져야 할 high-level semantics를 ViT embedding space에서 예측한다.

\[z_{sem} = P_{\theta}(x_{text}, x_{src}, x_{ref}, x_{placeholder})\]

그 다음 renderer는 semantic embedding $z_{sem}$을 받아 VAE latent token을 denoise한다.

\[\hat{y}_{video} = R_{\phi}(z_{sem}, h_{text}, v_{src})\]

여기서 중요한 점은 $z_{sem}$이 text caption이 아니라는 것이다. caption으로 바꾸면 MLLM의 visual understanding이 언어 bottleneck을 통과하면서 많은 정보가 사라질 수 있다. 반대로 ViT embedding space를 쓰면 visual semantic structure를 더 직접적으로 renderer에 넘길 수 있다.

내 해석으로는 Bernini는 video generation을 두 단계로 재정의한다.

첫째, 무엇을 만들지 semantic space에서 계획한다.
둘째, 어떻게 보이게 만들지 VAE latent space에서 렌더링한다.

이 분업이 좋은 이유는 editing task에서 더 뚜렷하다. editing은 새 video를 무에서 만드는 문제가 아니라, 유지해야 할 것과 바꿔야 할 것을 동시에 다루는 문제다. 이때 planner가 target semantic representation을 먼저 만들면 renderer는 pixel-level detail preservation과 realistic synthesis에 더 집중할 수 있다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	video generation과 video editing을 하나의 semantic planning plus rendering framework로 통합
Planner	MLLM-based semantic planner
Planner output	target semantic representation in ViT embedding space
Renderer	DiT-based renderer
Renderer space	VAE latent space
Editing condition	text feature and source VAE feature
Position design	Segment-Aware 3D RoPE
Main tasks	V2V, RV2V, content insertion, R2V

Bernini의 architecture를 간단히 쓰면 아래와 같다.

\[multimodal\ input -> semantic\ planner -> semantic\ embedding -> DiT\ renderer -> video\]

이 구조에서 planner는 visual understanding과 instruction reasoning을 담당하고, renderer는 visual fidelity와 temporal consistency를 담당한다.

3-2. Module breakdown

1) Multimodal input encoding

Bernini는 input을 하나의 sequence로 정렬한다. 이 sequence에는 text, source image, source video, reference input, target placeholder가 들어간다. target placeholder가 중요하다. 모델이 target frame을 어디에 생성해야 하는지, 어떤 temporal slot을 채워야 하는지 알 수 있기 때문이다.

이 방식은 video editing에서 특히 유용하다. 단순히 source video와 prompt를 renderer에 넣는 것이 아니라, MLLM planner가 source와 target의 semantic relationship을 먼저 구성한다.

2) MLLM-based semantic planner

Planner는 text와 visual input을 함께 보고 target semantic embedding을 예측한다. 프로젝트 페이지는 planner가 text, source images, source videos, target placeholders over reasoning한다고 설명한다.

여기서 planner의 output은 사람이 읽는 plan이 아니다. Bernini는 target semantic representation을 ViT embedding space에서 직접 예측한다. 이 선택이 논문의 가장 중요한 설계다. 자연어 plan은 interpretable하지만 visual detail을 잃기 쉽고, pixel latent는 renderer 쪽 부담이 크다. ViT embedding은 그 중간에 있는 semantic interface 역할을 한다.

3) DiT-based renderer

Renderer는 VAE latent token 위에서 flow-matching denoising을 수행한다. 프로젝트 페이지는 DiT renderer가 VAE latent space에서 video를 생성한다고 설명한다.

Editing setting에서는 renderer가 source VAE feature도 활용한다. 이는 detail preservation을 위한 장치로 볼 수 있다. Semantic plan만 있으면 output이 instruction에는 맞더라도 source video identity와 local detail을 잃을 수 있다. Source VAE feature는 그 손실을 줄이는 쪽에 가깝다.

4) Segment-Aware 3D RoPE

SA-3D RoPE는 여러 visual segment에서 온 token을 구분하기 위한 positional design이다. 일반 3D RoPE가 spatiotemporal coordinate를 표현한다면, SA-3D RoPE는 visual segment identity까지 더 신경 쓴다.

이게 필요한 이유는 Bernini의 input이 하나의 video만이 아니기 때문이다. Source video, reference image, reference video, target placeholder가 함께 들어오면 같은 spatial-temporal coordinate라도 서로 다른 segment에서 온 token일 수 있다. Segment distinction이 없으면 reference와 source가 섞이거나, target placeholder가 source token처럼 오해될 수 있다.

5) Chain-of-thought reasoning in planner

arXiv abstract는 planner에 chain-of-thought reasoning을 넣어 understanding을 generation으로 더 잘 transfer한다고 설명한다. 여기서 CoT는 output text를 길게 생성하는 목적이라기보다, complex editing instruction을 semantic plan으로 바꾸기 위한 내부 reasoning scaffold에 가깝다.

이 부분은 Bernini가 MLLM을 단순 encoder로 쓰지 않는다는 신호다. Planner가 visual input을 압축하는 모듈이라면 CoT가 크게 필요하지 않을 수 있다. 그런데 Bernini는 planner가 instruction과 visual evidence를 비교하고 target semantics를 구성하는 역할이라고 본다. 그래서 reasoning step을 명시적으로 넣는 설계가 자연스럽다.

4. Training / Data / Recipe

4-1. Data

논문 abstract와 공개 페이지 기준으로 Bernini는 video generation과 video editing을 모두 다룬다. 프로젝트 페이지가 보여주는 task type은 아래와 같다.

V2V: prompt-driven video editing
RV2V: reference-guided video editing
Content insertion: image 또는 video reference를 source video에 삽입
R2V: 최대 5개 reference image에서 video 생성

다만 공개 abstract와 project page만으로는 training data 규모와 세부 dataset mixture를 정확히 확정하기 어렵다. 따라서 blog draft 단계에서는 data scale을 추정하지 않고, 원문 PDF의 dataset section과 appendix에서 다시 확인해야 한다.

4-2. Training strategy

Bernini의 training recipe는 분업 구조와 연결된다. 논문 abstract는 semantic interface 덕분에 planner와 renderer를 따로 학습할 수 있고, 이후 light co-training만으로 두 component를 맞출 수 있다고 설명한다.

이 구조의 장점은 명확하다.

MLLM의 pretrained understanding을 가능한 한 보존한다.
DiT renderer의 pretrained visual synthesis 능력도 보존한다.
end-to-end로 모든 component를 처음부터 맞추는 것보다 training cost와 instability를 줄일 수 있다.
interface가 ViT embedding space라서 planner output과 renderer condition을 비교적 직접 연결할 수 있다.

수식으로 보면 목적은 아래처럼 이해할 수 있다.

\[L = L_{planner} + L_{renderer} + \lambda L_{align}\]

여기서 $L_{planner}$는 target semantic representation 예측, $L_{renderer}$는 VAE latent denoising, $L_{align}$은 planner와 renderer interface를 맞추는 co-training 항으로 볼 수 있다. 단, 이 식은 구조 이해를 위한 단순화이며, 정확한 loss term 이름과 weight는 원문에서 재확인해야 한다.

4-3. Engineering notes

공개 코드와 모델 카드는 Bernini-R renderer 중심으로 공개되어 있다. Hugging Face model card에 따르면 2026-06-01에 Bernini Renderer의 inference code와 model weights가 공개되었다. GitHub README에는 2026-06-09에 1.3B renderer weights도 공개되었다고 적혀 있다.

실행 환경도 꽤 중요하다. Model card와 GitHub README는 Python 3.11.2, CUDA 12.4, PyTorch 2.5.1+cu124를 기준 환경으로 적고, Hopper GPU에서 FlashAttention-3 사용을 권장한다. 일반 CUDA GPU에서는 FlashAttention-2 또는 PyTorch SDPA로 fallback된다고 설명한다.

실제 inference recipe도 task별로 갈린다.

image task는 single GPU 예시가 제공된다.
video task는 8 GPU torchrun 예시가 제공된다.
Ulysses sequence parallel을 multi-GPU video inference에 사용한다.
default example은 480p and 16fps이고, content insertion 예시 중 일부는 720p and 24fps 설정도 제공된다.
prompt enhancer는 OpenAI-compatible endpoint를 통해 사용할 수 있고, best generation quality를 위해 권장된다.

이 부분은 연구자가 Bernini를 읽을 때 꼭 분리해야 한다. 논문은 full Bernini framework를 이야기하지만, 공개 릴리즈는 주로 Bernini-R renderer와 inference stack 중심이다. Planner까지 포함한 full training pipeline과 model availability는 별도 확인이 필요하다.

5. Evaluation

5-1. Main results

arXiv abstract는 Bernini가 video generation과 editing benchmark 전반에서 strong performance를 보인다고 주장한다. 공개 GitHub README에는 Bernini-R 1.3B와 14B renderer의 model performance table이 올라와 있다.

Model	EditVerse	OpenVE	OpenS2V	VBench	Bernini-v2v OS	Bernini-vr2v OS
Bernini-R 1.3B	7.74	3.65	62.18	84.69	3.15	3.21
Bernini-R 14B	7.99	3.78	62.94	84.64	3.25	3.34

이 표에서 흥미로운 점은 두 가지다.

첫째, 1.3B renderer가 단순 task에서는 14B와 가까운 성능을 보인다는 README 설명이다. GitHub news는 1.3B가 style transfer, subtitle or watermark removal, local editing 같은 simple task에서 14B variant에 가깝지만, human generation 같은 complex task에서는 뒤처진다고 설명한다.

둘째, VBench에서는 1.3B가 84.69, 14B가 84.64로 거의 같거나 1.3B가 약간 높다. 이 수치만 보면 scale이 항상 전부는 아니라는 해석이 가능하지만, benchmark별 difficulty와 task type을 같이 봐야 한다.

5-2. What really matters in the experiments

Bernini의 evaluation에서 진짜로 봐야 하는 것은 평균 점수보다 task coverage다.

V2V는 instruction following과 source preservation을 동시에 본다.
RV2V는 reference binding 능력을 본다.
Content insertion은 object placement와 temporal integration을 본다.
R2V는 multiple reference images를 coherent video로 바꾸는 능력을 본다.
Self-built arena는 human preference 관점에서 commercial model과 비교하려는 시도다.

Hugging Face model card는 video editing에서 Bernini가 leading closed-source commercial models와 같은 first tier에 도달했다고 설명하고, 자체 arena platform에서 blind human votes를 Bradley-Terry score와 pairwise win-rate matrix로 aggregation한다고 적는다.

다만 self-built arena는 해석에 주의가 필요하다. 비교 대상, prompt set, rater protocol, sampling settings, prompt enhancer 사용 여부가 결과를 크게 바꿀 수 있기 때문이다. 따라서 이 claim은 중요한 신호지만, 완전히 독립된 benchmark 결과처럼 읽기보다는 release note와 함께 검증할 필요가 있다.

5-3. Ablation and interpretation points

논문 abstract 기준으로 ablation에서 중요하게 봐야 할 축은 다음이다.

ViT embedding semantic interface가 caption interface보다 나은가
Planner와 renderer를 separate training plus light co-training으로 두는 것이 end-to-end training보다 안정적인가
SA-3D RoPE가 multiple visual inputs에서 실제로 어떤 failure를 줄이는가
Chain-of-thought planner가 complex editing에서 얼마나 기여하는가
Source VAE feature가 editing detail preservation에 얼마나 기여하는가

현재 draft에서는 구체 ablation 수치를 넣지 않았다. 원문 PDF table을 열어 세부 수치를 확인한 뒤 보강하는 편이 안전하다.

6. Limitations

Full framework release와 renderer release를 구분해야 한다.
- 공개 GitHub와 Hugging Face는 Bernini-R renderer inference 중심이다. 논문이 말하는 full MLLM planner와 training pipeline까지 그대로 공개되었는지는 별도 확인이 필요하다.
Self-built arena 결과는 protocol 의존성이 크다.
- blind vote와 Bradley-Terry aggregation은 의미 있는 방식이지만, prompt distribution, rater pool, model sampling setting에 따라 결과가 달라질 수 있다.
Prompt enhancer 사용은 evaluation 해석을 복잡하게 만든다.
- model card는 prompt enhancer 사용을 best quality를 위해 권장한다. 그러면 base model capability와 external prompt rewriting의 기여를 분리해서 봐야 한다.
Video inference cost가 작지 않다.
- video task 예시는 8 GPU torchrun과 Ulysses sequence parallel을 사용한다. 연구자가 바로 local reproduction하기에는 hardware requirement가 높다.
Semantic embedding interface는 해석 가능성이 낮을 수 있다.
- ViT embedding은 caption보다 정보 보존에는 유리할 수 있지만, 사람이 직접 inspect하기는 어렵다. Failure analysis를 어떻게 할지가 남는다.

7. My Take

7-1. Why this matters for my work

Bernini는 video diffusion paper라기보다 multimodal system design paper로 읽는 편이 좋다. 핵심은 video quality만이 아니라, MLLM과 diffusion model의 책임을 어떻게 나눌지에 있다.

최근 multimodal generation은 점점 더 많은 input type을 하나의 모델로 넣으려 한다. 하지만 모든 것을 하나의 monolithic model로 해결하면 학습 비용, data mixture, modality alignment, debugging이 모두 어려워진다. Bernini는 그 반대 방향을 보여준다. MLLM은 semantic reasoning을 담당하고, renderer는 VAE latent denoising을 담당한다. 두 모델은 ViT embedding interface로 이어진다.

이 구조는 생성 모델뿐 아니라 agent system에도 유용하다. Agent가 직접 pixel을 만들 필요는 없지만, world state를 semantic representation으로 계획하고, 하위 module이 이를 실행하는 구조는 꽤 넓게 재사용될 수 있다.

7-2. Reuse potential

Bernini에서 재사용 가치가 큰 아이디어는 아래와 같다.

Semantic interface design
- 자연어 plan과 pixel latent 사이에 ViT embedding 같은 intermediate semantic space를 두는 방식.
Planner and renderer decoupling
- reasoning model과 generation model을 완전히 end-to-end로 묶지 않고, interface를 명확히 둔 뒤 light co-training하는 방식.
Segment-aware position encoding
- source, reference, target placeholder처럼 서로 다른 visual segment가 섞이는 multi-input generation에서 segment identity를 position encoding에 반영하는 방식.
Task routing via unified cases
- t2i, i2i, t2v, v2v, rv2v, r2v 같은 task를 case file로 route하는 inference 설계.
Small renderer release as practical entry point
- 1.3B renderer가 simple editing task에서 14B에 가까운 결과를 보인다는 점은 lightweight editing pipeline에 실용적일 수 있다.

7-3. Follow-up papers

후속으로 같이 읽을 만한 방향은 아래다.

SemanticGen: semantic space에서 video generation을 다루는 관련 흐름.
Wan2.2: Bernini-R release가 base component로 참조하는 video diffusion stack.
Qwen2.5-VL: Bernini가 acknowledgement에서 언급하는 MLLM 계열.
Video editing benchmark papers: EditVerse, OpenVE, OpenS2V, VBench.
Multimodal planner and renderer decoupling 관련 MLLM generation pipeline.

8. Summary

Bernini는 video generation과 editing을 MLLM planner plus DiT renderer 구조로 통합한다.
핵심 interface는 natural language caption이 아니라 ViT embedding space의 target semantic representation이다.
SA-3D RoPE는 여러 visual segment가 섞이는 multi-input setting에서 token source를 구분하기 위한 장치다.
공개 release는 Bernini-R renderer inference와 weights 중심이므로 full planner release 여부는 분리해서 봐야 한다.
이 논문의 가치는 video quality claim보다 semantic planning과 rendering을 나누는 system design에 있다.

Twitter Facebook LinkedIn