Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments Review

2026-06-28 12 분 소요

0. Introduction

Qwen-VLA는 “로봇용 Qwen” 정도로 읽으면 핵심을 놓치기 쉬운 논문이다. 이 논문이 진짜로 묻는 질문은 manipulation, navigation, trajectory prediction처럼 서로 달라 보이는 embodied decision-making 문제를 하나의 vision-language-action model 안으로 밀어 넣을 수 있는가에 가깝다.

기존 VLA 논문은 보통 특정 manipulation benchmark, 특정 robot arm, 특정 action space에 강한 모델을 보여주는 경우가 많다. 반면 Qwen-VLA는 Qwen의 vision-language modeling stack을 perception, understanding, reasoning에서 멈추지 않고 continuous action과 trajectory generation까지 확장하려고 한다. 여기서 중요한 블록은 DiT-based action decoder와 embodiment-aware prompt conditioning이다.

한 줄 요약: Qwen-VLA는 Qwen 계열 VLM을 기반으로, heterogeneous robot data와 trajectory supervision을 joint pretraining하고, robot embodiment와 control convention을 text prompt로 조건화해 manipulation, navigation, trajectory prediction을 하나의 action-and-trajectory prediction 문제로 통합하려는 VLA foundation model이다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

VLA를 single-task manipulation policy가 아니라 embodied foundation model 문제로 재정의한다.
robot embodiment를 별도 adapter가 아니라 prompt conditioning으로 다루려는 방향이 실용적이다.
manipulation, navigation, trajectory-centric task를 한 모델 평가 안에 묶는다.
real-world ALOHA OOD와 DOMINO zero-shot dynamic manipulation까지 포함해 deployment 쪽 질문을 던진다.

이 논문은 Qwen이 robotics로 확장됐다는 뉴스보다 더 중요하다. 핵심은 VLM의 reasoning prior를 action policy에 붙이는 것이 아니라, 로봇의 몸과 action semantics를 언어 인터페이스로 모델에 알려주고, action decoding을 trajectory generation 문제로 재구성했다는 점이다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 embodied intelligence의 fragmentation이다.

로봇 모델은 보통 아래처럼 쪼개져 있다.

manipulation은 end-effector action과 gripper control을 다룬다.
navigation은 instruction following과 map-free movement를 다룬다.
trajectory prediction은 scene dynamics나 future path를 다룬다.
robot embodiment가 바뀌면 action dimension, camera view, control convention이 달라진다.

이런 분리는 연구하기에는 편하지만, generalist embodied agent를 만들기에는 불리하다. 모델이 특정 task family나 robot morphology에 묶이면, 새로운 환경이나 새로운 platform으로 넘어갈 때 generalization이 제한된다.

Qwen-VLA가 풀려는 문제는 다음 질문으로 정리할 수 있다.

“vision-language model이 보고 이해하고 설명하는 데서 멈추지 않고, 서로 다른 embodiment의 continuous action까지 생성할 수 있는가?”

이를 수식으로 단순화하면, 모델은 observation $o$, language instruction $u$, embodiment description $e$를 조건으로 action 또는 trajectory $A$를 예측해야 한다.

\[p(A | o, u, e)\]

여기서 어려운 점은 $A$의 의미가 robot마다 다르다는 것이다. 어떤 환경에서는 arm pose일 수 있고, 어떤 환경에서는 mobile navigation trajectory일 수 있으며, 어떤 환경에서는 future trajectory label일 수 있다. 그래서 Qwen-VLA에서 embodiment description은 단순 metadata가 아니라 action semantics를 결정하는 조건이다.

1-2. Why previous approaches are insufficient

기존 접근의 한계는 크게 세 가지다.

첫째, task별 policy는 강하지만 범용성이 약하다. manipulation 전용 policy는 manipulation benchmark에서는 잘 동작해도, navigation이나 trajectory-centric task로 자연스럽게 이어지기 어렵다.

둘째, robot data는 본질적으로 heterogeneous하다. camera setup, action frequency, robot body, gripper convention, simulator와 real world domain이 모두 다르다. 이 이질성을 억지로 하나의 action format으로 누르면 정보가 사라지고, 반대로 task별 adapter를 늘리면 unified model의 장점이 줄어든다.

셋째, VLM은 perception과 reasoning에 강하지만 continuous control을 직접 생성하는 데는 별도 interface가 필요하다. 언어 토큰을 잘 생성하는 것과 로봇 action sequence를 안정적으로 생성하는 것은 다른 문제다.

Approach	Main idea	Limitation
Task-specific robot policy	특정 task와 robot에 맞춰 policy 학습	cross-task transfer가 제한됨
VLM plus planner	VLM으로 high-level plan 생성	low-level continuous action은 별도 policy 필요
Single manipulation VLA	image와 instruction에서 robot action 생성	navigation, trajectory prediction, embodiment transfer까지 포괄하기 어려움
Cross-embodiment adapter	robot별 adapter 또는 head 추가	embodiment가 늘수록 system complexity 증가

Qwen-VLA는 이 문제를 action space 하나를 억지로 표준화해서 해결하지 않는다. 대신 robot-specific textual description을 prompt에 넣고, action decoder가 그 조건을 받아 continuous action과 trajectory를 생성하게 만든다.

2. Core Idea

2-1. Main contribution

Qwen-VLA의 핵심 기여는 아래 5개로 정리할 수 있다.

Unified embodied foundation model
- Qwen의 vision-language modeling stack을 action generation까지 확장한다.
- perception, understanding, reasoning, action을 하나의 VLA system으로 묶는다.
DiT-based action decoder
- language token generation head만으로 continuous action을 다루지 않는다.
- action과 trajectory generation을 위해 diffusion transformer 기반 decoder를 붙인다.
Embodiment-aware prompt conditioning
- robot-specific textual description을 prompt에 포함한다.
- 현재 robot embodiment와 control convention을 모델에 명시적으로 알려준다.
Unified action-and-trajectory prediction
- manipulation, navigation, trajectory prediction을 같은 prediction framework 안에 둔다.
- task family를 나누는 대신 action과 trajectory supervision을 하나의 학습 신호로 엮는다.
Large-scale joint pretraining
- robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, auxiliary vision-language data를 함께 사용한다.

이 논문의 가장 큰 기여는 action decoder 자체보다 problem reframing이다. Qwen-VLA는 “로봇마다 head를 새로 붙이자”가 아니라 “로봇마다 다른 몸과 제어 규약을 언어 조건으로 넣고, 나머지는 shared VLA backbone에서 학습하자”는 쪽에 가깝다.

2-2. Design intuition

Qwen-VLA의 설계 직관은 꽤 명확하다.

첫째, embodied task는 겉으로는 달라도 공통적으로 visual grounding, spatial reasoning, language following, temporal prediction을 요구한다. 따라서 완전히 별도 모델로 나누기보다 shared multimodal representation 위에 올리는 것이 자연스럽다.

둘째, robot embodiment 차이는 제거할 대상이 아니라 조건화할 대상이다. 같은 instruction이라도 robot arm, mobile robot, bimanual setup에서 실행 의미가 다르다. 그러면 모델은 action을 생성하기 전에 자신이 어떤 body를 control하는지 알아야 한다.

셋째, continuous action generation은 text generation과 다르다. token-by-token language modeling으로만 다루면 action smoothness, trajectory consistency, uncertainty modeling을 다루기 어렵다. 그래서 Qwen-VLA는 DiT-based action decoder를 통해 trajectory generation 쪽 inductive bias를 넣는다.

개념적으로는 아래 구조다.

\[context = f_{VLM}(observation, instruction, embodiment)\] \[a_{1:T} = f_{action}(context)\]

여기서 $f_{VLM}$은 Qwen 계열의 multimodal understanding과 reasoning을 담당하고, $f_{action}$은 continuous action 또는 trajectory를 생성하는 decoder 역할을 맡는다.

내 해석으로는 이 논문이 중요한 이유는 VLA의 중심을 “action head 설계”에서 “task, body, trajectory를 어떻게 하나의 conditional generation 문제로 만들 것인가”로 옮긴 데 있다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	manipulation, navigation, trajectory prediction을 하나의 VLA model로 통합
Backbone	Qwen vision-language modeling stack
Key decoder	DiT-based action decoder
Key conditioning	embodiment-aware textual prompt
Training signal	robot trajectories, egocentric demonstrations, simulation data, VLN data, trajectory supervision, auxiliary V-L data
Output target	continuous action and trajectory generation
Main evaluation scope	manipulation, navigation, trajectory-centric benchmarks, OOD and real-world tests

3-2. Module breakdown

1) Qwen vision-language backbone

Qwen-VLA는 완전히 새로운 robot-only model을 처음부터 만드는 방식이 아니다. 논문의 표현상 핵심은 Qwen의 vision-language modeling stack을 action 영역까지 확장하는 것이다.

이 선택은 중요하다. 로봇이 해야 하는 많은 일은 사실 low-level control만의 문제가 아니다.

instruction을 이해해야 한다.
visual scene에서 relevant object를 찾아야 한다.
spatial relation을 해석해야 한다.
task progress를 추론해야 한다.
다음 action이 현재 observation과 language goal에 맞는지 판단해야 한다.

VLM backbone은 이 중 perception, understanding, reasoning에 대한 prior를 제공한다. Qwen-VLA는 이 prior를 continuous action generation으로 연결하려 한다.

2) DiT-based action decoder

Qwen-VLA에서 action decoder는 중요한 interface다. 논문은 Qwen의 VLM stack에서 나온 multimodal context를 continuous action과 trajectory generation으로 연결하기 위해 DiT-based action decoder를 사용한다.

여기서 DiT는 diffusion transformer 계열의 action decoder로 이해하면 된다. 즉 action sequence를 단순 categorical token처럼 생성하기보다, trajectory를 조건부 생성 대상으로 본다.

개념적으로 action sequence $A$에 noise를 넣은 상태 $A_t$를 두고, context $c$를 조건으로 denoising step을 예측하는 식으로 생각할 수 있다.

\[\hat{A}_0 = D_{\theta}(A_t, c, t)\]

위 식은 리뷰용 개념식이다. 원문 세부 loss와 schedule은 full paper에서 다시 확인하는 편이 안전하다. 중요한 것은 action decoding을 language token head의 부속물로 취급하지 않고, trajectory generation에 맞는 decoder로 분리했다는 점이다.

3) Embodiment-aware prompt conditioning

이 논문에서 가장 실용적인 아이디어는 embodiment-aware prompt conditioning이다.

Robot-specific textual description은 모델에게 아래 정보를 알려준다.

현재 robot이 어떤 embodiment인지
어떤 sensor view를 가지고 있는지
action이 어떤 convention으로 표현되는지
manipulation인지 navigation인지 trajectory prediction인지
control target이 end-effector인지, base movement인지, 다른 trajectory variable인지

즉 prompt는 task instruction만 담지 않는다. 로봇의 몸과 action interface까지 설명한다.

이 아이디어는 단순하지만 강력하다. robot마다 adapter를 따로 두면 확장성이 떨어지고, robot 차이를 모두 numeric token으로만 넣으면 해석이 어려워질 수 있다. 반면 textual embodiment description은 VLM이 이미 잘 다루는 language channel을 활용한다.

이 부분이 Qwen-VLA의 가장 재사용 가치가 높은 설계다. 실제 로봇 시스템에서는 action format보다 action format을 설명하는 metadata가 훨씬 더 유지보수하기 쉬운 경우가 많다.

4) Unified action-and-trajectory prediction

Qwen-VLA는 manipulation, navigation, trajectory prediction을 별도 objective로 완전히 분리하지 않는다. 논문은 이들을 unified action-and-trajectory prediction framework로 cast한다고 설명한다.

이 관점에서는 아래 세 task가 같은 꼴로 정리된다.

Task family	Input	Output
Manipulation	camera observation, instruction, robot description	continuous arm and gripper action
Navigation	visual observation, navigation instruction, platform description	movement trajectory or navigation action
Trajectory prediction	visual context, scene description, task condition	future trajectory

이 통합이 중요한 이유는 VLA pretraining에서 heterogeneous data를 버리지 않기 위해서다. robot trajectory, human egocentric video, synthetic simulation, VLN data는 형식이 다르지만, 모두 어떤 visual-language condition 아래에서 future behavior를 예측한다는 공통점이 있다.

5) Qwen-VLA-Instruct

arXiv abstract는 Qwen-VLA-Instruct 결과를 별도로 보고한다. 이름에서 볼 수 있듯이 instruct variant는 task instruction following과 deployment-facing behavior를 더 겨냥하는 모델로 보인다.

다만 이번 리뷰 초안에서는 Qwen-VLA base와 Qwen-VLA-Instruct의 exact training split, instruction tuning recipe, release status를 단정하지 않는다. 원문 table과 official release note를 다시 확인해야 한다. 여기서는 abstract에 확인되는 결과만 사용한다.

4. Training / Data / Recipe

4-1. Data

Qwen-VLA의 data recipe는 이 논문의 핵심이다. abstract 기준으로 사용되는 data source는 아래와 같다.

robotics manipulation trajectories
human egocentric demonstrations
synthetic simulation data
vision-and-language navigation data
trajectory-centric supervision
auxiliary vision-language data

이 구성은 꽤 의도적이다. manipulation trajectory만으로는 navigation과 broad spatial reasoning을 얻기 어렵고, VLN data만으로는 continuous robot action을 얻기 어렵다. human egocentric demonstration은 embodied visual experience를 넓혀줄 수 있지만, robot action과 정확히 같은 것은 아니다. synthetic simulation은 coverage를 넓혀주지만 real world domain gap이 남는다.

따라서 Qwen-VLA의 data strategy는 하나의 perfect dataset을 찾는 방식이 아니다. 서로 다른 embodied data를 하나의 joint pretraining mixture로 묶고, embodiment prompt와 action decoder가 그 이질성을 흡수하게 만드는 방향이다.

4-2. Training strategy

논문은 large-scale joint pretraining recipe를 강조한다. 리뷰 관점에서 핵심은 아래 순서로 이해하면 된다.

VLM backbone이 visual-language understanding과 reasoning context를 만든다.
heterogeneous embodied data를 action-and-trajectory prediction format으로 정렬한다.
robot-specific textual descriptions로 embodiment와 control convention을 조건화한다.
DiT-based action decoder가 continuous action 또는 trajectory를 생성하도록 학습한다.
instruction-following variant가 downstream task execution 쪽으로 정렬된다.

이 구조를 단일 objective로 단순화하면 아래와 같다.

\[\mathcal{L} = \mathcal{L}_{action} + \lambda \mathcal{L}_{aux}\]

여기서 $\mathcal{L}{action}$은 action 또는 trajectory prediction에 해당하고, $\mathcal{L}{aux}$는 auxiliary vision-language supervision을 나타내는 개념적 표현이다. 원문이 사용하는 정확한 loss decomposition과 weight는 full paper table에서 다시 확인해야 한다.

4-3. Engineering notes

Qwen-VLA를 실제로 재사용하려면 아래 포인트가 중요하다.

Embodiment prompt schema
- robot description을 어떤 template으로 만들 것인지가 성능에 큰 영향을 줄 수 있다.
- 단순히 robot name을 넣는 것이 아니라, action dimension과 control convention을 설명해야 한다.
Action normalization and decoding convention
- 서로 다른 robot의 action scale과 coordinate frame을 어떻게 맞췄는지가 중요하다.
- 이 부분은 abstract만으로는 충분히 확인되지 않으므로 원문 implementation detail 재확인이 필요하다.
Data mixture balance
- manipulation data가 너무 많으면 navigation generalization이 약해질 수 있고, 반대도 가능하다.
- joint pretraining은 data type별 sampling ratio가 핵심일 가능성이 높다.
Inference latency
- robotics deployment에서는 benchmark success뿐 아니라 control frequency와 latency가 중요하다.
- DiT-based action decoder가 실제 robot loop에서 어떤 속도로 동작하는지는 반드시 따로 봐야 한다.
Safety and recovery
- continuous action을 생성하는 VLA는 실패했을 때 recovery policy와 safety constraint가 필요하다.
- 논문 결과가 높아도 real robot deployment에서는 guardrail이 별도로 필요하다.

5. Evaluation

5-1. Main results

arXiv abstract 기준으로 Qwen-VLA-Instruct의 대표 결과는 아래와 같다.

Benchmark	Reported metric	Result
LIBERO	success style score	97.9%
Simpler-WidowX	success style score	73.7%
RoboTwin-Easy	success style score	86.1%
RoboTwin-Hard	success style score	87.2%
R2R	OSR	69.0%
RxR	SR	59.6%
Real-world ALOHA OOD	average OOD success	76.9%
DOMINO dynamic manipulation	zero-shot success	26.6%

이 표를 볼 때 중요한 것은 숫자 하나의 크기보다 evaluation coverage다. LIBERO와 Simpler-WidowX는 manipulation 쪽을 보고, R2R과 RxR은 navigation 쪽을 보며, ALOHA OOD와 DOMINO는 real-world와 dynamic manipulation 쪽 stress test에 가깝다.

논문은 scene layout, background, lighting, object configuration, robot embodiment variation 아래에서 OOD generalization을 본다고 설명한다. 이 점이 Qwen-VLA의 핵심 주장과 맞물린다. 이 모델은 단순히 한 benchmark에서 높은 점수를 내는 것보다, task, environment, embodiment 축을 같이 바꿔도 shared VLA stack이 버티는지를 보여주려 한다.

5-2. What really matters in the experiments

이 논문의 결과에서 진짜로 봐야 할 포인트는 세 가지다.

VLA paper가 manipulation benchmark만 보고 끝나는 경우가 많다. Qwen-VLA는 navigation benchmark까지 같이 넣으면서, VLA가 단순 robot arm policy가 아니라 embodied decision-making model이어야 한다고 주장한다.

이건 좋은 방향이다. 실제 embodied agent는 물체를 집기 전에 찾고, 이동하고, 주변 context를 해석해야 한다. manipulation과 navigation이 분리되어 있으면 system-level generalization을 말하기 어렵다.

2) Embodiment OOD가 핵심이다

Qwen-VLA가 embodiment-aware prompt conditioning을 제안한 이상, 평가는 robot embodiment 변화에서 의미가 있어야 한다. abstract 기준으로 robot embodiment variation과 real-world ALOHA OOD 결과가 포함되어 있다는 점은 이 주장과 잘 맞는다.

다만 여기서는 반드시 baseline과 split을 봐야 한다. 어떤 embodiment가 train에 있었고, 어떤 embodiment가 test에서 새로 등장했는지에 따라 결과 해석이 크게 달라진다.

3) DOMINO zero-shot은 아직 조심스럽게 읽어야 한다

DOMINO dynamic manipulation에서 26.6% zero-shot success는 흥미로운 결과다. 하지만 동시에 dynamic manipulation이 여전히 어렵다는 신호이기도 하다.

따라서 이 결과는 “dynamic manipulation도 해결했다”가 아니라, unified VLA pretraining이 zero-shot dynamic setting에도 일부 transfer signal을 만든다는 정도로 읽는 편이 안전하다.

6. Limitations

Full paper table 확인이 필요하다.
- abstract에는 핵심 benchmark 숫자가 나오지만, baseline, variance, protocol, train-test split, evaluation episodes는 full paper에서 다시 확인해야 한다.
Robot deployment metric이 충분한지는 별도 문제다.
- success rate는 중요하지만, real robot에서는 latency, control frequency, safety stop, recovery, calibration robustness도 같이 봐야 한다.
Embodiment prompt는 강력하지만 prompt 품질에 민감할 수 있다.
- robot-specific textual description이 실제로 얼마나 structured해야 하는지, unseen robot에서 어느 정도 generalize되는지는 추가 검증이 필요하다.
Unified model이 specialized policy를 항상 이기는 것은 아니다.
- multi-task generalization을 얻는 대신 특정 robot과 task에 최적화된 policy 대비 손해가 있을 수 있다.
Data mixture 의존성이 클 수 있다.
- manipulation, egocentric, simulation, navigation, auxiliary V-L data가 어떤 ratio로 섞였는지에 따라 재현성과 transfer가 크게 달라질 수 있다.

7. My Take

7-1. Why this matters for my work

Qwen-VLA의 핵심은 robotics에서도 foundation model의 문제 설정이 바뀌고 있다는 점이다.

예전에는 VLM이 scene을 이해하고, planner가 plan을 짜고, low-level controller가 action을 실행하는 식의 modular pipeline이 자연스러웠다. 하지만 Qwen-VLA는 이 경계를 약하게 만든다. vision-language reasoning과 action generation을 하나의 model family 안에서 묶고, robot-specific part는 textual embodiment description으로 넘긴다.

이 접근은 agent 연구에도 중요하다. tool agent에서 tool schema를 prompt로 설명하듯이, robot agent에서는 body schema와 control convention을 prompt로 설명할 수 있다. 즉 embodiment prompt는 robotics version의 tool schema처럼 볼 수 있다.

7-2. Reuse potential

실무와 연구에서 바로 가져갈 만한 포인트는 아래와 같다.

Embodiment description template
- robot platform마다 action semantics를 text schema로 정리하는 방식은 재사용성이 높다.
Action-and-trajectory unification
- manipulation, navigation, trajectory prediction을 완전히 분리하지 않고 future behavior prediction으로 묶는 관점은 다른 embodied benchmark 설계에도 유용하다.
VLM backbone plus action decoder separation
- perception-reasoning backbone과 action decoder를 분리하면, 기존 VLM capability를 버리지 않고 robot action으로 확장할 수 있다.
OOD evaluation axes
- scene layout, background, lighting, object configuration, embodiment variation을 나눠 보는 평가는 VLA model card에도 들어가야 할 체크리스트에 가깝다.

반대로 그대로 가져가기 어려운 점도 있다. 이 논문은 Qwen ecosystem과 large-scale heterogeneous data mixture를 전제로 한다. 작은 팀이 전체 recipe를 재현하기보다는 embodiment prompt schema, task unification, OOD evaluation protocol부터 가져가는 편이 현실적이다.

7-3. Follow-up papers

OpenVLA
- open VLA 계열의 manipulation-focused baseline으로 같이 읽기 좋다.
Octo
- robot policy를 multi-embodiment, multi-task로 학습하는 방향과 비교하기 좋다.
RT-2
- VLM을 action generation으로 확장하는 초기 큰 흐름을 이해하는 데 필요하다.
PaLM-E
- embodied multimodal model에서 language, vision, robot state를 통합하는 큰 그림을 제공한다.
Qwen2.5-VL Technical Report
- Qwen-VLA가 어떤 VLM capability 위에 올라왔는지 이해하는 데 도움이 된다.

8. Summary

Qwen-VLA는 manipulation, navigation, trajectory prediction을 하나의 VLA framework로 묶으려는 embodied foundation model이다.
핵심 설계는 Qwen VLM stack, DiT-based action decoder, embodiment-aware prompt conditioning의 결합이다.
robot embodiment와 control convention을 text prompt로 조건화한다는 점이 가장 실용적인 아이디어다.
evaluation은 LIBERO, Simpler-WidowX, RoboTwin, R2R, RxR, ALOHA OOD, DOMINO까지 넓게 잡혀 있다.
다만 full paper의 baseline, protocol, data mixture, action normalization, inference latency를 확인해야 실제 재현성과 deployment 가능성을 판단할 수 있다.

Twitter Facebook LinkedIn