Cosmos 3: Omnimodal World Models for Physical AI Review

2026-07-02 12 분 소요

0. Introduction

Cosmos 3는 “world model”이라는 말을 단순한 video generation 모델이 아니라 physical AI용 foundation model stack으로 확장하려는 NVIDIA의 기술 보고서다. 이 논문은 language, image, video, audio, action을 하나의 omnimodal world model family 안에서 처리하고 생성하려 한다. 그래서 그냥 text-to-video 모델로 읽으면 핵심을 놓치기 쉽다. 이 논문이 실제로 밀고 있는 방향은 VLM, video generator, world simulator, robot policy model을 따로 만들지 말고, 하나의 모델 family와 하나의 serving stack 안에서 연결하자는 것이다.

이 논문의 가장 중요한 질문은 다음이다.

“Physical AI에서 필요한 모델 인터페이스는 이제 text and image in, text out 정도로 충분한가?”

Cosmos 3의 대답은 아니다. 로봇, 자율주행, industrial vision, video analytics agent에서는 모델이 장면을 이해해야 하고, 미래를 roll out해야 하며, sound와 action도 같이 다뤄야 한다. 더 나아가 observation과 action을 같이 조건으로 넣고 future video나 action sequence를 생성해야 한다. Cosmos 3는 이 요구를 Reasoner와 Generator라는 두 runtime surface로 나누면서도, MoT 기반 backbone과 multimodal attention, mRoPE를 공유하는 방식으로 묶는다.

한 줄 요약: Cosmos 3는 autoregressive reasoning path와 diffusion generation path를 Mixture-of-Transformers 구조로 결합해, text, image, video, audio, action을 하나의 physical AI world model family 안에서 이해하고 생성하려는 대규모 omnimodal model release다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

Physical AI 모델을 VLM, video generator, simulator, policy model로 쪼개지 않고 하나의 omnimodal backbone 문제로 정리한다.
Reasoner mode와 Generator mode를 분리하면서도 architecture와 position representation을 공유한다는 설계가 흥미롭다.
model card, GitHub, HF collection, inference recipe가 같이 공개되어 있어서 논문 claim과 실제 deployment surface를 함께 볼 수 있다.
world model이 agent training, synthetic data, closed-loop simulation, robot policy learning으로 이어지는 전체 stack을 어떻게 겨냥하는지 확인하기 좋다.

1. Problem Setting

1-1. Problem definition

Cosmos 3가 겨냥하는 문제는 “physical AI용 foundation model을 어떤 입출력 인터페이스로 만들어야 하는가”다.

일반적인 VLM은 image or video와 text를 받아 text를 출력한다. Video generation 모델은 text나 image를 받아 video를 만든다. Robot policy 모델은 observation과 instruction을 받아 action을 낸다. World simulator는 현재 상태와 action을 받아 future state를 예측한다. 기존 연구에서는 이 네 가지가 보통 별도 모델, 별도 tokenizer, 별도 training recipe, 별도 serving path로 존재한다.

하지만 physical AI에서 실제로 필요한 것은 이 경계가 깔끔하게 나뉘지 않는다.

장면 이해: object, state, event, relation, physical plausibility를 읽어야 한다.
미래 예측: 현재 observation과 control이 주어졌을 때 future video를 roll out해야 한다.
inverse dynamics: 관찰된 변화에서 action trajectory를 추정해야 한다.
policy: instruction과 visual context에서 action chunk를 생성해야 한다.
synthetic data: text, image, video, audio, action 조건으로 training data를 확장해야 한다.

즉 Cosmos 3의 문제 설정은 single task score를 높이는 것이 아니라, physical AI에 필요한 modality graph 전체를 하나의 model family로 묶는 것에 가깝다.

1-2. Why previous approaches are insufficient

기존 접근은 크게 세 갈래로 볼 수 있다.

Type	Main interface	Limitation
VLM	vision + text to text	reasoning은 가능하지만 action과 generation surface가 약함
Video generator	text or image to video	physical reasoning과 downstream policy interface가 분리됨
Robot policy	observation + instruction to action	embodiment별 specialization이 강하고 general world modeling이 약함
Simulator	state + action to future state	photorealism, language grounding, open-world coverage가 제한됨

Cosmos 3는 이 분리를 병목으로 본다. 로봇이 어떤 물체를 집어 어디에 놓아야 하는지 이해하려면 VLM처럼 reasoning해야 한다. 그 행동이 어떤 결과를 낳을지 보려면 world simulator처럼 future를 생성해야 한다. 그 행동을 실행하려면 policy model처럼 action을 출력해야 한다. 그런데 각 능력을 서로 다른 모델에 맡기면, representation mismatch와 system integration cost가 커진다.

이 논문이 말하는 omnimodal world model의 가치는 여기서 나온다. 하나의 모델이 모든 일을 완벽히 한다는 뜻이 아니라, understanding, generation, simulation, action을 같은 token and latent interface 위에 올리는 것이 장기적으로 더 나은 연구 플랫폼이 될 수 있다는 주장이다.

2. Core Idea

2-1. Main contribution

Cosmos 3의 핵심 기여는 크게 5가지로 정리할 수 있다.

Omnimodal input-output space
- text, image, video, audio, action을 모두 모델 입출력 대상으로 둔다.
- output도 text, image, video, sound, action state까지 포함한다.
Mixture-of-Transformers architecture
- reasoning에는 autoregressive transformer를 사용한다.
- generation에는 diffusion transformer를 사용한다.
- 두 path가 같은 omnimodal model family 안에서 동작한다.
Reasoner and Generator surfaces
- Reasoner는 perception, planning, grounding, temporal understanding, action reasoning에 초점을 둔다.
- Generator는 image, video, audio, action-conditioned rollout, policy output에 초점을 둔다.
Physical AI oriented model family
- Cosmos3-Nano, Cosmos3-Super, Text2Image, Image2Video, Policy-DROID 같은 variant를 제공한다.
- Nano는 16B급, Super는 model card와 GitHub 표기 기준 64B급으로 설명된다.
Open release stack
- code, model checkpoints, curated synthetic datasets, evaluation benchmark가 OpenMDW1.1 license 아래 공개된다.
- Diffusers, vLLM-Omni, vLLM, NIM, Cosmos Framework 쪽 integration path가 제공된다.

2-2. Design intuition

이 논문의 설계 직관은 비교적 명확하다.

Physical AI에서 모델이 해야 하는 일을 modality별로 나누면 training과 serving은 쉬워질 수 있다. 하지만 agent 입장에서는 perception, prediction, action이 독립 문제가 아니다. 예를 들어 로봇이 컵을 집어야 한다면, 먼저 장면의 object와 free space를 찾아야 하고, gripper trajectory를 정해야 하며, 그 trajectory가 실행되면 어떤 video future가 생길지도 예측할 수 있어야 한다.

Cosmos 3는 이 전체 과정을 아래처럼 하나의 interface로 묶으려 한다.

observation을 text, image, video token으로 받는다.
action이나 trajectory 조건을 추가로 받는다.
Reasoner path는 상황을 설명하고 계획한다.
Generator path는 image, video, audio, action을 denoise한다.
같은 physical scene을 reasoning과 generation 양쪽에서 공유한다.

여기서 중요한 것은 “모든 modality를 같은 방식으로 생성한다”가 아니다. 오히려 text는 autoregressive decoding이 자연스럽고, image/video/audio/action은 diffusion denoising이 자연스럽다. Cosmos 3의 포인트는 생성 mechanism을 억지로 하나로 통일하지 않고, MoT라는 상위 구조 안에서 AR과 diffusion을 병렬적으로 배치했다는 점이다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	physical AI용 omnimodal world model 구축
Core architecture	Mixture-of-Transformers
Reasoner path	AR transformer, causal self-attention, text output 중심
Generator path	diffusion transformer, full attention, multimodal denoising 중심
Shared components	transformer architecture, multimodal attention layers, 3D mRoPE
Main inputs	text, image, video, audio, action trajectory
Main outputs	text, image, video, sound, action state
Release surface	GitHub, HF model collection, Cosmos Framework, vLLM-Omni, Diffusers

3-2. Module breakdown

1) Reasoner mode

Reasoner mode는 text, image, video를 받아 text를 생성하는 path다. GitHub README 기준으로 Reasoner는 world understanding, grounding, physical reasoning, task planning, action forecasting, embodied agent reasoning에 쓰인다.

일반 VLM과 비슷해 보이지만, Cosmos 3의 Reasoner는 physical AI task에 맞춰 다음과 같은 출력을 염두에 둔다.

detailed captioning
temporal localization
spatial grounding
action chain-of-thought style reasoning
driving scene understanding
physical plausibility judgment
task planning

HF model card 기준 Reasoner input은 text, text + image, text + video를 지원하고, context window는 최대 256K tokens로 표기된다. Video input은 4 fps 사용이 권장된다. 이 부분은 long video understanding과 embodied reasoning 쪽에서 꽤 중요한 interface다.

2) Generator mode

Generator mode는 noisy image, video, audio, action token을 denoise하는 diffusion path다. GitHub README는 Generator가 text, vision, sound, action input을 받아 vision, sound, action output을 생성한다고 설명한다.

대표 workflow는 다음과 같다.

Workflow	Inputs	Outputs	Meaning
Text-to-image	text	image	physical scene image 생성
Text-to-video	text	video	industrial or robotics video 생성
Text-to-video with sound	text	video + sound	visual-audio 동기화 생성
Image-to-video	text + image	video	시작 image 기반 motion 생성
Video-to-video	text + video	video	prompt-guided transformation
Forward dynamics	text + vision + action	video	action-conditioned future rollout
Action policy	text + vision	action + video	action trajectory와 rollout 생성

이 구성이 중요한 이유는 video generator를 synthetic data model로만 쓰지 않는다는 점이다. Cosmos 3는 Generator를 world simulator, future predictor, policy rollout generator로도 쓰려고 한다.

3) Unified 3D mRoPE

Cosmos 3는 spatial and temporal structure를 통합적으로 인코딩하기 위해 3D multi-dimensional rotary position embedding, 즉 mRoPE를 쓴다고 설명한다. 이 부분은 multi-modal generation보다도 video, action, audio, trajectory가 같이 들어올 때 중요하다.

2D image에서는 spatial relation이 중요하고, video에서는 temporal order가 중요하다. Audio는 시간 축의 continuity가 중요하고, action trajectory는 frame-aligned control sequence로 다뤄야 한다. 3D mRoPE는 이런 modality들의 위치 정보를 하나의 representation system에 넣기 위한 장치로 읽을 수 있다.

4) Model family

공개 model family는 크게 다음처럼 정리된다.

Model	Size	Primary use
Cosmos3-Nano	16B	compact omnimodal world model
Cosmos3-Super	64B급	frontier-scale omnimodal world model
Cosmos3-Super-Text2Image	64B급	text-to-image generation
Cosmos3-Super-Image2Video	64B급	image-to-video generation
Cosmos3-Nano-Policy-DROID	16B	DROID manipulation and control

여기서 주의할 점은 HF collection에서는 일부 모델이 65B로 보이고, GitHub README와 model card 본문에서는 64B로 설명된다는 것이다. 기능적 의미는 Super가 large checkpoint라는 점이지만, 정확한 parameter accounting은 release artifact 기준으로 다시 확인하는 편이 안전하다.

5) Runtime surface

Cosmos 3는 연구용 데모만 공개한 것이 아니라, 꽤 다양한 runtime surface를 같이 둔다.

Diffusers: Generator research and model development
vLLM-Omni: OpenAI-compatible Generator serving
vLLM: Reasoner serving
NIM: prebuilt optimized Reasoner endpoint
Cosmos Framework: training, inference, omni-model workflow
Cosmos Curator: data curation
Cosmos Evaluator: model evaluation

실무적으로는 이 부분이 중요하다. Physical AI 모델은 paper model만으로는 쓰기 어렵다. video, audio, action, local media path, async video job, guardrails, tensor parallel, context parallel 같은 serving detail이 바로 deployment bottleneck이 되기 때문이다.

4. Training / Data / Recipe

4-1. Data

논문 abstract는 code, model checkpoints, curated synthetic datasets, evaluation benchmark 공개를 강조한다. HF model card는 training data 구성을 더 자세히 보여준다.

HF card 기준 dataset overview는 다음과 같다.

Item	Value
Total size	1.3B data points
Dataset entries	393
Collection period	2024-2026
Public examples	OpenImage, Coyo700M, YouTube Video, UMI
Private examples	Egocentric, Nexar, AgiBot, HOI
Synthetic examples	HiDream-I1 images, Qwen-Image-2512 images, Qwen3-VL captions

Modality별 sample count는 다음처럼 표기된다.

Modality	Reasoning data	Generation data
Text	22M	Not Applicable
Image	19M	767M
Video	1M	348M
Audio	Not Applicable	139M
Action	Not Applicable	8M

이 표에서 보이는 핵심은 generation side의 visual/audio/action coverage가 상당히 크다는 점이다. 동시에 reasoning data와 generation data의 역할이 분리되어 있다. Cosmos 3는 단순히 VLM data를 늘린 모델이 아니라, world generation과 action-conditioned generation을 위해 데이터 축을 다시 설계한 모델로 보는 편이 맞다.

4-2. Training strategy

공개 자료만 기준으로 보면, Cosmos 3의 학습 전략은 아래 흐름으로 이해할 수 있다.

Omnimodal pretraining
- text, image, video, audio, action을 포함한 heterogeneous corpus로 backbone을 학습한다.
- modality별 tokenizer or encoder representation을 model-ready token/latent로 변환한다.
Reasoner specialization
- visual reasoning, temporal event understanding, grounding, planning, physical plausibility task로 text output 능력을 강화한다.
- Qwen3-VL-compatible message convention을 따르는 serving interface가 제공된다.
Generator specialization
- text-to-image, text-to-video, image-to-video, video-to-video, audio-visual generation, action-conditioned generation을 다룬다.
- prompt upsampling을 통해 짧은 prompt를 dense structured prompt로 확장하는 recipe가 제공된다.
Action and policy post-training
- DROID policy model처럼 robot embodiment specific action output으로 specialized variant를 만든다.
- action dimension은 embodiment에 따라 다르며, camera motion, autonomous vehicle, egocentric motion, single-arm, dual-arm, humanoid setting이 구분된다.

4-3. Engineering notes

1) Prompt upsampling is not optional in practice

HF card는 text-to-video 품질을 위해 short prompt를 JSON 구조로 upsample하는 recipe를 제공한다. 이 부분은 일반 image/video model에서도 익숙한 패턴이지만, Cosmos 3에서는 physical scene, object detail, motion, camera, sound까지 얽히므로 더 중요해진다.

즉 deployment에서 중요한 것은 모델 호출 하나가 아니라, prompt compiler 를 함께 갖추는 것이다.

2) Reasoner and Generator need different serving paths

Reasoner는 vLLM이나 NIM 쪽이 자연스럽고, Generator는 Diffusers나 vLLM-Omni 쪽이 자연스럽다. GitHub README는 Generator production inference에서 vLLM-Omni를 쓰고, understanding-only text task는 vLLM Reasoner를 쓰라고 나눈다.

이 구분은 엔지니어링 관점에서 중요하다. Cosmos 3가 하나의 omnimodal model family라고 해서 serving path가 하나로 단순해지는 것은 아니다. 오히려 task별로 어떤 runtime surface를 쓸지 정해야 한다.

3) Action interface has embodiment-dependent shape

Action conditioning은 그냥 action token 몇 개를 붙이는 문제가 아니다. HF card와 GitHub README는 camera motion, autonomous vehicle, egocentric motion, single-arm robot, dual-arm robot, humanoid robot 등 embodiment별 action dimension을 따로 둔다.

이 말은 Cosmos 3를 robot policy에 쓰려면 model quality 외에도 다음을 같이 설계해야 한다는 뜻이다.

action normalization
frame alignment
camera calibration
embodiment metadata
safety filter
rollout validation

4) Frame count and size are artifact-dependent

GitHub README는 generation setting에서 frame count 5 to 300, default 189를 말한다. HF card는 video generation duration 5 to 400 frames, default 189를 말한다. 두 공개 artifact 사이에 약간의 표기 차이가 있으므로, 실제 사용자는 내가 쓸 checkpoint와 runtime 기준으로 다시 확인해야 한다.

5. Evaluation

5-1. Main results

논문 abstract는 Cosmos 3가 diverse understanding and generation tasks에서 strong result를 보이며, post-trained models가 technical report 작성 시점에 Artificial Analysis 기준 best open-source Text-to-Image and Image-to-Video model, RoboArena 기준 best policy model로 ranked됐다고 설명한다.

Project page는 benchmark claim을 두 덩어리로 요약한다.

Reasoning: Robotics, Smart Space, Driving benchmark average에서 open models 중 높은 순위
Generation: R-Bench, Artificial Analysis, RoboLab, RoboArena에서 text-to-image, image-to-video, robot policy를 평가

GitHub README는 inference benchmark도 별도 파일로 제공한다고 설명한다. Generator는 PyTorch, vLLM-Omni, Diffusers를 비교하고, Reasoner는 concurrency 1/64/128/256에서 TTFT, request latency, throughput을 본다.

이런 evaluation 구성을 보면 Cosmos 3는 single metric SOTA paper라기보다, model capability와 serving feasibility를 같이 보여주려는 release report에 가깝다.

5-2. What really matters in the experiments

Cosmos 3에서 진짜 봐야 할 실험 축은 3가지다.

Capability unification
- 같은 model family가 reasoning, generation, action을 얼마나 넓게 커버하는가.
- 단일 benchmark 최고점보다 task interface의 폭이 중요하다.
Physical consistency
- 생성 video가 보기 좋은지보다, action-conditioned rollout이 physical dynamics를 얼마나 보존하는지가 중요하다.
- 로봇이나 자율주행에서는 temporal consistency와 causality가 곧 안전성과 연결된다.
Serving practicality
- 16B Nano와 64B급 Super의 품질 차이뿐 아니라 latency, memory, GPU requirement, vLLM-Omni support 상태를 봐야 한다.
- action mode 일부는 GitHub README 기준 review or planned 상태로 표시되는 항목이 있어 release maturity를 구분해야 한다.

6. Limitations

Physics fidelity is still not guaranteed
- GitHub README는 long, high-resolution, physically complex output에서 temporal inconsistency, unstable camera/object motion, inaccurate sound-video alignment, object morphing, inaccurate 3D structure, implausible physical dynamics가 생길 수 있다고 명시한다.
Open model does not mean turnkey deployment
- code와 checkpoint가 공개되어도, safety-critical robot or autonomous system deployment에는 validation, guardrails, closed-loop testing, system-level safety analysis가 필요하다.
Unified architecture does not remove runtime complexity
- Reasoner, Generator, policy, forward dynamics가 서로 다른 runtime path와 request schema를 가진다.
- vLLM-Omni, Diffusers, vLLM, NIM 중 무엇을 쓸지 task별로 정해야 한다.
Benchmark aggregation can hide modality-specific failure
- Robotics, Driving, Smart Space 평균이 높더라도, 특정 domain의 rare event, OOD action, long-horizon rollout에서 실패할 수 있다.
Data provenance and license review remain important
- OpenMDW1.1 license로 공개되어 있지만, commercial or downstream deployment에서는 model card와 dataset provenance를 별도로 검토해야 한다.

7. My Take

7-1. Why this matters for my work

이 논문이 흥미로운 이유는 model architecture보다 interface 설계에 있다. 요즘 physical AI 논문을 보면 VLM, VLA, video world model, robot policy, synthetic data generator가 서로 가까워지고 있다. Cosmos 3는 이 흐름을 아주 노골적으로 하나의 family로 묶는다.

Cosmos 3의 핵심 메시지는 다음이다.

Physical AI foundation model의 경쟁력은 benchmark score보다 input-output schema의 폭에서 먼저 갈릴 수 있다.

로봇이나 자율주행을 다루는 팀이라면, 앞으로 모델 선택 기준이 text answer score만으로 끝나지 않을 가능성이 크다. action-conditioned rollout이 되는가, video with sound가 되는가, inverse dynamics가 되는가, policy output schema가 맞는가, serving stack이 있는가를 같이 봐야 한다.

7-2. Reuse potential

재사용 관점에서는 4가지 방향이 보인다.

Video analytics agent backbone
- long video understanding, temporal localization, dense captioning, event extraction에 Reasoner를 사용할 수 있다.
Synthetic data generation pipeline
- text-to-video, image-to-video, video-to-video, video with sound를 활용해 rare scenario data를 만들 수 있다.
Robot policy pretraining or post-training backbone
- DROID policy variant처럼 embodiment-specific action head를 붙이는 방식이 가능하다.
Closed-loop simulator candidate
- forward dynamics mode를 통해 action-conditioned future rollout을 만들 수 있다.
- 다만 이 경우 physical fidelity 검증이 필수다.

7-3. Follow-up papers

Cosmos 1 / Cosmos 2 / Cosmos Transfer line
OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
MolmoAct2: Action Reasoning Models for Real-world Deployment
Latent Spatial Memory for Video World Models

8. Summary

Cosmos 3는 text, image, video, audio, action을 모두 다루는 omnimodal world model family다.
핵심 구조는 AR reasoning path와 diffusion generation path를 결합한 Mixture-of-Transformers다.
Reasoner는 VLM, grounding, planning, physical reasoning에 가깝고, Generator는 image/video/audio/action rollout을 담당한다.
공개 release는 model checkpoint뿐 아니라 GitHub, HF collection, inference recipe, evaluation ecosystem까지 포함한다.
다만 physical fidelity, action consistency, runtime maturity, safety validation은 반드시 별도로 봐야 한다.

Twitter Facebook LinkedIn