Qwen3.5-Omni Technical Report Review

2026-06-01 10 분 소요

0. Introduction

Qwen3.5-Omni를 “오디오와 비디오를 같이 받는 멀티모달 모델” 정도로 읽으면 핵심을 놓치기 쉽다. 이 리포트의 진짜 흥미로운 지점은 Thinker-Talker 분리를 유지한 채, 긴 audio-video context, streaming speech generation, 그리고 tool-use까지 한 시스템 안에서 다루는 방법을 product 관점으로 본 것에 있다.

요즘 omni model 논문을 읽다 보면 “무엇을 더 잘 이해하는가”와 “어떻게 실시간으로 반응하는가”가 종종 한 덩어리로 섞여 있다. 그런데 Qwen3.5-Omni는 이 둘을 꽤 분명하게 나눈다. Thinker는 multimodal understanding과 reasoning을 맡고, Talker는 low-latency speech generation을 맡는다. 그리고 이 둘 사이의 interface를 ARIA, multi-codebook speech token, chunk-wise streaming input, Hybrid MoE로 다시 설계한다.

특히 흥미로운 점은 이 논문이 성능 보고서에만 머물지 않는다는 것이다. text-only counterpart와 비슷한 text, vision 성능을 유지하면서, audio, audio-visual, speech generation 쪽을 확장하려고 한다. 즉 “omni라서 다 약해지는 것 아닌가”라는 흔한 의심을 정면으로 다룬다.

한 줄 요약: Qwen3.5-Omni는 Thinker-Talker 구조 위에 Hybrid MoE, AuT audio encoder, ARIA 기반 text-speech alignment, multi-codebook streaming speech generation을 결합해, 긴 audio-video 이해와 실시간 음성 상호작용을 동시에 노리는 omni agent 기술 보고서다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

omni model을 “input modality 확장”이 아니라 real-time system design 문제로 다룬다.
text, vision, audio, audio-visual, speech generation, tool use를 한 리포트 안에서 함께 비교해 볼 수 있다.
한국어를 포함한 multilingual speech generation, cross-lingual voice cloning, audio-video understanding까지 이어져서 실서비스 관점의 시사점이 크다.

이 논문의 핵심 메시지는 단순하다. Qwen3.5-Omni의 본질은 “모든 modality를 받는다”가 아니라, 실시간 상호작용이 가능한 omni agent interface를 만들기 위해 reasoning path와 speech path를 다시 나눴다는 데 있다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 text, image, audio, video를 함께 이해하고, text 또는 speech로 반응하며, 필요하면 tool까지 호출하는 end-to-end omni model을 어떻게 만들 것인가이다.
여기서 어려운 점은 단순히 modality 수가 많다는 데 있지 않다. 긴 audio-video sequence를 받아야 하고, speech는 낮은 지연으로 생성해야 하며, text-only 또는 vision-only 성능도 크게 잃지 않아야 한다.
특히 speech generation은 일반 text generation과 다르다. text token과 speech token의 rate가 다르고, streaming으로 응답해야 하며, turn-taking, emotion, voice control까지 고려해야 한다.
결국 이 논문의 문제 설정은 “multimodal understanding model”이 아니라, understanding, reasoning, generation, action을 함께 묶는 real-time omni system에 가깝다.

1-2. Why previous approaches are insufficient

기존 omni 계열은 종종 perception은 강하지만, real-time interaction이나 agentic behavior는 약한 경우가 많다.
text와 speech를 별도 track으로 두는 구조는 직관적이지만, 실제 streaming generation에서는 tokenization rate mismatch 때문에 skipped word, number pronunciation error, synchronization overhead가 생기기 쉽다.
audio-video를 길게 넣을수록 KV-cache I/O와 latency 문제가 커지기 때문에, 단순 transformer 확장만으로는 product-grade serving이 어렵다.
multimodal model은 종종 audio나 video를 넣는 대신 text-only, vision-only 성능이 떨어질 수 있다. 그래서 “omni가 되면 기존 성능을 얼마나 잃는가”도 중요한 질문이다.
즉 기존 방식의 한계는 한 모듈이 부족해서가 아니라, long-context multimodal understanding, streaming speech generation, agentic action을 하나의 interface로 정리하지 못했다는 데 있다.

2. Core Idea

2-1. Main contribution

Qwen3.5-Omni의 핵심 기여는 Qwen2.5-Omni와 Qwen3-Omni의 Thinker-Talker 프레임을 유지하면서, 길이 확장, speech stability, multilingual speech, audio-visual grounding, tool use를 동시에 강화한 것이다.
첫째, Thinker와 Talker 모두에 Hybrid Attention Mixture-of-Experts를 적용한다. 특히 Gated Delta Net을 넣어 긴 audio-video sequence에서 KV-cache I/O 부담을 줄이려 한다.
둘째, audio encoder로 AuT를 사용한다. 이 encoder는 4,000만 시간의 audio-text pair로 학습되며, 6.25Hz token rate에서 일반 목적 audio representation을 만든다.
셋째, speech generation 쪽에서는 RVQ 기반 multi-codebook speech representation과 MTP를 사용해 single-frame 단위의 streaming speech generation을 지원한다.
넷째, Talker 입력 구조는 기존 dual-track보다 ARIA를 중심으로 다시 설계된다. 핵심은 text token과 speech token을 adaptive rate로 interleave해 streaming speech의 안정성과 자연스러움을 높이는 것이다.
다섯째, capability layer에서도 확장이 있다. controllable audio-visual captioning, real-time interaction, voice cloning, WebSearch, complex FunctionCall, 그리고 “Audio-Visual Vibe Coding”까지 포함한다.

2-2. Design intuition

이 논문의 설계 직관은 꽤 선명하다. 이해와 추론은 Thinker, 낮은 지연의 음성 생성은 Talker에 맡기고, 둘 사이의 interface를 정교하게 만들자는 것이다.
즉 하나의 거대한 decoder가 모든 modality를 동시에 처리하는 방식보다, reasoning path와 speech path를 분리하는 편이 실시간 상호작용에 유리하다고 본다.
ARIA의 목적도 같다. speech generation의 핵심 문제는 “더 좋은 음성 모델”이 아니라, text token rate와 speech token rate의 mismatch가 streaming interaction에서 만드는 시스템 불안정성이다.
AuT 역시 단순한 encoder 교체가 아니다. audio를 6.25Hz token으로 압축해 Thinker가 긴 audio를 다룰 수 있게 하고, timestamp를 명시적으로 넣어 audio-video alignment를 강화한다.
이 논문의 핵심 설계 철학은 “모든 modality를 하나로 합치자”보다, 모든 modality가 같은 속도와 같은 granularity로 움직이지 않는다는 사실을 architecture에 반영하자에 가깝다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	긴 audio-video 이해와 low-latency speech interaction을 동시에 지원하는 omni model 구축
Backbone	Thinker-Talker architecture + Hybrid MoE
Audio path	AuT audio encoder + timestamp-aware multimodal alignment
Speech path	RVQ multi-codebook speech tokens + MTP + Code2Wav + ARIA
Difference from prior work	dual-track speech generation보다 text-speech interleave와 streaming stability를 더 직접 다룸

3-2. Module breakdown

1) Thinker

Thinker는 text generation과 multimodal reasoning을 맡는다.
Vision encoder는 SigLIP2를 사용하고, audio는 AuT를 통해 token sequence로 변환한다.
text, audio, image, silent video 입력은 하나의 unified representation sequence로 합쳐진다.
긴 audio-video를 다루기 위해 explicit timestamp와 contiguous position numbering을 사용한다. 목적은 modality가 많아져도 positional conflict를 줄이고, long-range temporal modeling을 유지하는 것이다.
길이 측면에서는 최대 256k context, 10시간 이상의 audio, 720P video 400초 at 1 FPS를 지원한다고 보고한다.

2) AuT audio encoder

AuT는 transformer 기반 audio encoder다.
4,000만 시간의 audio-text data로 학습되며, 4개의 Conv2D block으로 다운샘플링한 뒤 self-attention layers로 audio token을 만든다.
출력 token rate는 6.25Hz다. 이 설계는 긴 audio를 Thinker가 감당할 수 있게 만드는 핵심이다.
multilingual data 비중도 크게 늘렸고, dynamic attention window size training을 도입해 real-time prefill caching과 offline audio understanding 성능을 같이 맞추려 한다.

3) Talker

Talker는 multimodal input과 Thinker의 text 출력을 받아 speech를 생성한다.
speech representation은 RVQ 기반 multi-codebook token을 사용하고, MTP module이 각 step에서 residual codebook을 예측한다.
생성된 token은 causal ConvNet codec decoder인 Code2Wav를 통해 waveform으로 바뀐다.
흥미로운 점은 Talker에 dedicated system prompt를 둔다는 것이다. 이를 통해 zero-shot voice cloning과 controllable speech generation을 더 세밀하게 다루려 한다.

4) ARIA

ARIA는 이 논문에서 가장 중요한 설계 중 하나다.
기존 dual-channel generation 대신, text와 speech token을 하나의 interleaved single stream으로 다룬다.
핵심 목적은 tokenization efficiency mismatch 때문에 생기던 synchronization overhead를 줄이고, skipped word, incorrect pronunciation, number rendering ambiguity 같은 문제를 완화하는 것이다.
한마디로 ARIA는 speech model의 품질 개선이라기보다, streaming interaction에서 text와 speech를 어떻게 같이 decode할 것인가에 대한 interface 재설계다.

5) Long-context serving path

Qwen3.5-Omni는 chunked-prefilling을 유지하고, Thinker와 Talker 모두에 Hybrid MoE를 적용한다.
여기에 Gated Delta Net을 더해 긴 audio-video sequence에서 KV-cache I/O overhead를 줄이려 한다.
실제 Table 1 기준 first-packet latency는 audio input에서 Flash 235ms, Plus 435ms이고, video input에서는 Flash 426ms, Plus 651ms다.
즉 이 논문은 benchmark score만이 아니라, “언제 첫 응답이 나오느냐”를 architecture 차원에서 같이 설계한다.

4. Training / Data / Recipe

4-1. Data

Qwen3.5-Omni는 text-vision pair와 1억 시간 이상의 audio-visual data를 함께 사용해 native omnimodal pretraining을 수행한다.
speech input 쪽은 113개 language and dialect, speech output 쪽은 36개 language and dialect를 지원하도록 확장했다.
AuT encoder 자체는 Qwen3-ASR로 생성한 4,000만 시간의 audio-text pair로 학습된다.
training data는 image-text, video-text, audio-text, video-audio, video-audio-text, pure text를 함께 포함한다.
중요한 점은 unimodal과 cross-modal data를 초반 pretraining 단계부터 함께 넣어 modality 간 alignment를 일찍 형성하려 한다는 것이다.

4-2. Training strategy

이 리포트는 full recipe를 모두 공개하는 논문은 아니다. 그래서 optimizer, step budget, batch size 같은 세부는 원문에서 추가 확인이 필요하다.
다만 큰 그림은 분명하다.
- Thinker와 Talker 모두 Hybrid MoE로 확장
- AuT는 scratch에서 학습
- text tokenizer는 250k vocabulary로 확장
- audio-video는 timestamp 기반으로 정렬
- Talker는 RVQ multi-codebook speech token과 ARIA를 중심으로 설계
즉 이 논문은 “새 loss 하나”보다 pretraining data scale + multimodal alignment + streaming generation interface의 조합으로 읽는 편이 맞다.

4-3. Engineering notes

paper capability와 product serving limit를 구분해서 봐야 한다.

논문은 10시간 이상의 audio와 400초 720P video를 말하지만, 실제 제품 문서의 input limit와 serving policy는 더 보수적일 수 있다.
따라서 원문 capability claim과 현재 API product limit를 같은 수치로 읽으면 안 된다.

first-packet latency를 같이 봐야 한다.

Setting	Qwen3.5-Omni-Flash	Qwen3.5-Omni-Plus
Audio input first-packet latency	235 ms	435 ms
Video input first-packet latency	426 ms	651 ms

omni model은 성능만 높아서는 부족하고, 실제로 언제 첫 응답이 나오는가가 중요하다.
이 표는 Qwen 팀이 product latency를 주요 목표로 두고 있음을 잘 보여준다.

text-only counterpart와의 비교가 중요하다.

이 논문은 audio, speech 쪽 수치만 강조하지 않는다.
오히려 Qwen3.5-Plus-Instruct와 비교해 text, vision이 얼마나 유지되는지를 계속 보여준다.
내가 보기엔 이게 중요하다. omni model이 강해졌다는 말은, 한 modality가 강해졌다는 뜻이 아니라 multi-task compromise가 얼마나 적은가와 연결되기 때문이다.

5. Evaluation

5-1. Main results

논문이 보여주는 가장 중요한 메시지는 “audio가 강하다” 하나가 아니다. text와 vision을 크게 깎지 않으면서, audio, audio-visual, speech generation까지 확장했다는 점이다.

아래는 내가 중요하다고 본 대표 숫자만 뽑은 표다.

Slice	Baseline	Qwen3.5-Omni-Plus	Read
MMLU-Pro	Qwen3.5-Plus-Instruct 86.8	85.9	text-only와 거의 비슷
IFEval	Qwen3.5-Plus-Instruct 89.7	89.7	instruction following 유지
MLVU	Qwen3.5-Plus-Instruct 85.1	86.8	video understanding 개선
LVBench	Qwen3.5-Plus-Instruct 68.6	71.2	long video 이해 강화
MMAU	Gemini-3.1 Pro 81.1	82.2	audio understanding 우위
VoiceBench	Gemini-3.1 Pro 88.9	93.1	end-to-end speech dialogue 우위
Qualcomm IVD	Gemini-3.1 Pro 66.2	68.5	audio-query interactive scenario 우위
Omni-Cloze	Gemini-3.1 Pro 57.2	64.8	captioning 강점
OmniGAIA	Gemini-3.1 Pro 68.9	57.2	tool use는 아직 gap 존재

이 표만 봐도 해석 포인트가 분명하다.

text degradation은 제한적이다.
video understanding은 오히려 강해진다.
audio와 speech dialogue는 분명히 강하다.
하지만 tool use는 아직 Gemini-3.1 Pro보다 약하다.

speech generation 쪽에서도 흥미로운 결과가 있다.

zero-shot TTS에서는 SEED 기준 WER 0.99를 기록해 Qwen3-Omni-30B-A3B 1.07보다 개선된다.
multilingual speech generation에서는 29개 평가 언어 중 22개에서 lowest WER를 기록했다고 보고한다.
cross-lingual speech generation에서는 12개 방향 중 10개에서 최고 성능을 기록한다.
특히 zh-to-ko에서 error rate를 14.4에서 4.03으로 줄였다고 보고한다.

5-2. What really matters in the experiments

1) 이 논문의 핵심은 “omni인데도 text, vision을 크게 안 잃었다”는 점이다

많은 omni model은 audio나 speech를 붙이는 대신 text-only reasoning이나 visual QA가 약해지는 식의 trade-off를 보인다.
그런데 Qwen3.5-Omni-Plus는 MMLU-Pro 85.9 vs 86.8, IFEval 89.7 vs 89.7처럼 text-only counterpart와 거의 비슷한 수준을 유지한다.
이건 단순 성능 자랑보다 더 중요하다. omni 확장이 기존 foundation capability를 얼마나 덜 훼손했는가를 보여주기 때문이다.

2) audio와 speech는 단순 ASR이 아니라 interaction으로 봐야 한다

이 논문은 audio benchmark만 보여주지 않는다.
VoiceBench, URO-Bench, multilingual TTS, cross-lingual voice cloning까지 묶어서 본다.
즉 “오디오를 듣고 글로 잘 답한다”보다, 실시간 음성 상호작용을 얼마나 안정적으로 하느냐가 핵심 목표다.
ARIA와 multi-codebook Talker는 바로 이 지점에서 의미가 있다.

3) video understanding이 생각보다 중요하다

Qwen3.5-Omni-Plus는 Qwen3.5-Plus-Instruct 대비 VideoMME, MLVU, LVBench, MME-VideoOCR 같은 video task에서 더 강하다.
이건 audio-video joint training이 실제로 dynamic visual perception을 강화했을 가능성을 시사한다.
이 논문은 “speech model”이라기보다, audio가 붙은 video model로 읽는 편이 더 정확하다.

4) tool use는 아직 조심해서 읽어야 한다

OmniGAIA에서는 Qwen3.5-Omni-Plus 57.2, Gemini-3.1 Pro 68.9다.
게다가 이 결과는 judge 기반 평가다.
즉 논문이 말하는 “native omnimodal agentic behavior”는 흥미롭지만, 실제 production-grade agent benchmark에서 바로 frontier라고 보기엔 이르다.
현재 강점은 understanding + interaction + speech interface이고, agent/tool 쪽은 아직 더 봐야 한다.

5) 한국어 관점에서는 cross-lingual speech 결과가 더 중요하다

한국어 사용자에게 바로 체감되는 포인트는 VoiceBench 같은 영어 중심 지표보다, ko-target speech generation과 cross-lingual cloning이다.
zh-to-ko에서 14.4 -> 4.03으로 내려간 결과는 꽤 강한 시그널이다.
즉 이 논문을 한국어 서비스 관점에서 읽는다면, “omni agent” general claim보다 multilingual speech transfer 품질을 더 주목해서 보는 편이 좋다.

6. Limitations

이 리포트는 너무 많은 변수를 한 번에 바꾼다.

Hybrid MoE, AuT, tokenizer 확장, ARIA, multilingual data 확장, interaction-aligned RL이 함께 들어간다.
그래서 무엇이 얼마만큼 기여했는지에 대한 component-level ablation은 상대적으로 부족하다.

tool-use와 agent 성능은 아직 보수적으로 읽어야 한다.

OmniGAIA에서는 Gemini-3.1 Pro보다 낮다.
judge 기반 평가도 포함되므로, “agentic” claim을 과하게 일반화하면 안 된다.

text-only counterpart와 완전히 동일한 성능은 아니다.

전반적으로는 잘 유지되지만, MMLU-Pro, GPQA, LiveCodeBench 같은 일부 항목에서는 Qwen3.5-Plus-Instruct보다 소폭 낮다.
즉 omni가 공짜는 아니다.

paper claim과 product limit를 구분해야 한다.

논문 capability와 실제 API serving limit, preview 정책, input quota는 다를 수 있다.
실무 도입에서는 기술 보고서의 최대 capability보다 현재 서비스 조건을 따로 확인해야 한다.

speech generation의 절대적 SOTA를 모든 축에서 달성했다고 보긴 어렵다.

예를 들어 zero-shot TTS 단일 benchmark에서는 강한 상용 또는 전용 TTS baseline이 더 좋은 항목도 있다.
Qwen3.5-Omni의 진짜 강점은 단일 TTS score가 아니라, omni interaction 안에서 speech를 자연스럽게 붙였다는 점에 있다.

현재 공개 접근은 API 중심이다.

따라서 open-weight reproduction, low-level serving reimplementation, on-prem deployment를 바로 하려는 팀에는 제약이 있다.
이 논문은 open recipe보다 frontier product report에 더 가깝다.

7. My Take

7-1. Why this matters for my work

내가 이 논문을 중요하게 보는 이유는 omni model을 benchmark 모음이 아니라 interface 설계 문제로 본다는 데 있다.
실무에서 중요한 것은 “무엇을 입력으로 받을 수 있는가”보다, 어떤 지연으로, 어떤 modality를 어떻게 연결해, 어떤 행동까지 낼 수 있는가이다.
Qwen3.5-Omni는 바로 그 질문에 답한다. Thinker와 Talker를 나누고, speech generation을 text generation의 부록이 아니라 독립적인 serving path로 본다.
한국어 음성 인터랙션이나 audio-first assistant를 생각하는 팀에게도 시사점이 크다. 특히 cross-lingual speech와 ko-target 성능은 꽤 직접적인 힌트다.

7-2. Reuse potential

reasoning path와 generation path를 분리하는 설계는 재사용 가치가 높다. 모든 multimodal model이 반드시 monolithic decoder일 필요는 없다.
token rate mismatch를 architecture 문제로 다루는 방식도 중요하다. text와 speech를 같이 다루는 시스템에서는 alignment heuristic보다 decode interface 재설계가 더 중요할 수 있다.
timestamp-aware multimodal alignment는 long audio-video understanding 시스템에서 계속 재사용될 아이디어다.
first-packet latency 중심의 평가도 실무적으로 유용하다. multimodal assistant는 offline benchmark보다 사용자 체감 지연이 더 중요하기 때문이다.
반대로 그대로 가져오기 어려운 부분도 있다.
- AuT의 4,000만 시간 scale
- 1억 시간 이상의 audio-visual pretraining
- API-only availability
- Thinker-Talker full-stack serving infrastructure

7-3. Follow-up papers

Qwen2.5-Omni
Qwen3-Omni
Qwen3 Technical Report
MiniMax-Speech
CosyVoice3

8. Summary

Qwen3.5-Omni는 Thinker-Talker 구조 위에 Hybrid MoE, AuT, ARIA, multi-codebook speech generation을 결합한 omni technical report다.
핵심은 modality 추가보다 real-time interaction이 가능한 omni interface를 설계한 데 있다.
text-only counterpart와 비슷한 text, vision 성능을 유지하면서 audio, speech dialogue, audio-visual captioning을 강화한다.
특히 multilingual speech와 cross-lingual voice cloning은 한국어 서비스 관점에서도 꽤 중요한 결과다.
다만 tool use, full agent performance, component-level attribution, product limit 해석은 보수적으로 같이 봐야 한다.

Twitter Facebook LinkedIn