Qwen3-TTS Technical Report Review

2026-05-31 8 분 소요

0. Introduction

한 줄 요약: Qwen3-TTS는 tokenizer level에서 quality-latency trade-off를 명시적으로 드러낸 뒤, dual-track autoregressive architecture와 staged post-training을 통해 text, speaker control, speech generation을 다시 하나로 엮는 LM-native TTS system이다.

“Qwen3-TTS”는 또 하나의 open TTS release 이상으로 읽을 가치가 있다. 이 논문이 유용한 이유는 시스템 trade-off를 하나의 “headline metric” 뒤에 숨기지 않기 때문이다. 저자들은 speech representation 자체를 두 개의 “operating point”로 나눈다. richer reconstruction을 위해 더 expressive한 semantic-acoustic 정보를 유지하는 “25Hz” path와, real-time streaming이 가능할 정도로 latency를 강하게 낮추는 “12Hz” path다.

이 때문에 이 report는 content faithfulness, controllability, deployment latency의 균형을 맞춰야 하는 voice agent, omni model, speech product를 만드는 사람에게 흥미롭다.

이 논문을 지금 읽을 가치가 있는 이유는 다음과 같다.

TTS를 고립된 acoustic stack이 아니라 “LM-native audio system”과 같은 설계 공간의 일부로 다룬다.
representation choice를 직접 드러내기 때문에 25Hz vs 12Hz trade-off가 숨겨지지 않고 이해 가능하다.
tokenizer design, LM architecture, post-training, streaming deployment를 한 곳에서 다루기 때문에 system design 관점에서 유용하다.

1. Problem Setting

1-1. Problem definition

이 논문은 하나의 model family 안에서 multilingual, controllable, robust, streaming text-to-speech를 겨냥한다.
원하는 system은 3-second voice cloning, description-based voice design, instruction-following style control, real-time streaming을 모두 지원해야 한다.
또한 model은 LM-style generation stack과 자연스럽게 결합되어야 하므로, speech representation이 autoregressive modeling에 맞게 tractable해야 한다.

1-2. Why previous approaches are insufficient

순수 semantic tokenizer는 language modeling에는 효율적이지만 expressive한 acoustic detail을 잃을 수 있다.
순수 acoustic tokenizer는 detail은 잘 보존하지만 LM 안에 너무 많은 low-level signal을 주입해 long-horizon modeling을 더 어렵게 만든다.
chunked diffusion이나 look-ahead decoding에 의존하는 streaming system은 first-packet latency penalty를 자주 치른다.
많은 TTS system이 cloning이나 naturalness에는 강하지만, fine-grained instruction control이나 multilingual transfer에는 상대적으로 약하다.
실제 deployment team이 필요한 것은 하나의 최고 benchmark score가 아니라, quality, latency, control 사이에서 명확한 operating point다.

2. Core Idea

2-1. Main contribution

Qwen3-TTS는 하나의 representation으로 모든 deployment target을 해결하려 하지 않고, 서로 다른 operating point를 갖는 두 speech tokenizer를 제안한다.
learnable speaker encoder를 함께 둔 “dual-track LM”을 사용해 text token과 speech token을 하나의 autoregressive loop 안에 유지한다.
staged pre-training과 post-training을 적용해 먼저 기본 text-to-speech mapping을 학습하고, 그 다음 quality, long-context handling, preference alignment, controllability, speaker adaptation을 순차적으로 다룬다.
모든 data를 ChatML로 포맷하고 speech control을 instruction-following problem으로 다루기 때문에 cloning, voice design, style editing을 하나의 framework 안에서 통합한다.

2-2. Design intuition

핵심 설계 직관은 speech representation이 진짜 bottleneck이라는 점이다. representation이 너무 semantic하면 speech는 안정적이지만 bland해지고, 너무 acoustic하면 LM이 local detail에 과부하된다.
25Hz tokenizer는 semantic tractability를 유지하면서도 strong waveform reconstruction에 필요한 acoustic information을 충분히 넣으려 한다.
12Hz tokenizer는 semantic stream과 acoustic stream을 여러 codebook으로 분리해 low bitrate, low latency, good content consistency를 동시에 노린다.
그 다음 LM 쪽은 control interface가 된다. text, speaker state, acoustic generation이 더 이상 따로 떨어진 module이 아니다.
이 논문의 진짜 기여는 한 model checkpoint가 아니라, 두 개의 명확한 “deployment regime”을 드러내고 나머지 system이 그 위에 적응하도록 설계한 데 있다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	cloning, voice design, instruction control, streaming을 지원하는 open multilingual TTS family 구축
Text backbone	standard Qwen tokenizer를 쓰는 Qwen3 LM family
Speech representation	Qwen-TTS-Tokenizer-25Hz 또는 Qwen-TTS-Tokenizer-12Hz
Identity control	backbone과 jointly trained되는 learnable speaker encoder
Streaming design	textual token과 acoustic token을 channel axis에서 concatenate하는 dual-track representation
Main trade-off	25Hz는 richer reconstruction과 long-form stability를, 12Hz는 lower latency와 strong content robustness를 더 중시

3-2. Module breakdown

1) Qwen-TTS-Tokenizer-25Hz

25Hz tokenizer는 Qwen2-Audio 위에 구축된 single-codebook tokenizer다.
Stage 1에서는 저자들이 ASR에 대해 Qwen2-Audio pretraining을 이어가고, resampling layer를 추가한 뒤, intermediate position에 vector quantization layer를 넣는다.
Stage 2에서는 convolution-based mel-spectrogram decoder를 더해 audio token이 더 많은 acoustic detail을 흡수하도록 만든다.
streaming waveform generation을 위해 25Hz path는 Flow Matching으로 학습된 chunk-wise DiT와, 수정된 BigVGAN vocoder를 사용한다.
streaming detokenizer는 3-block lookback과 1-block lookahead를 갖는 sliding-window block attention을 쓴다.

2) Qwen-TTS-Tokenizer-12Hz

12Hz tokenizer는 semantic stream과 acoustic stream을 분리한 12.5Hz multi-codebook tokenizer다.
semantic path는 WavLM teacher의 guidance를 받아 첫 번째 codebook이 semantic alignment를 유지하도록 한다.
acoustic path는 semantic codebook이 잡아내지 못한 detail을 보완하기 위해 15-layer RVQ stack을 사용한다.
학습은 multi-scale mel-spectrogram reconstruction loss를 포함한 GAN-based framework로 진행된다.
encoder와 decoder가 fully causal이기 때문에 token을 emit하고 decode할 때 look-ahead가 필요 없다.

3) Dual-track LM and hierarchical acoustic prediction

Qwen3-TTS는 text에는 standard Qwen tokenizer를, speech에는 Qwen-TTS tokenizer를 사용한다.
stable identity control을 위해 backbone과 함께 learnable speaker encoder를 jointly train한다.
dual-track design은 textual token과 acoustic token을 channel axis에서 concatenate하므로, model이 text에 즉시 반응하면서 streaming mode에서 speech packet을 emit할 수 있다.
25Hz variant에서는 backbone이 현재 speech token을 예측하고, 이를 chunk-wise Code2Wav path로 넘긴다.
12Hz variant에서는 backbone이 먼저 zeroth codebook을 예측하고, MTP module이 residual codebook을 생성한다. 12Hz가 detail을 크게 포기하지 않으면서도 low latency를 유지할 수 있는 핵심 이유가 여기에 있다.

4) Control interface

모든 data는 ChatML로 포맷되므로 model이 speech control을 structured generation task처럼 다룰 수 있다.
fine-grained control signal은 input sequence 앞쪽에 prepend된다.
cloning의 경우, Qwen3-TTS는 speaker embedding을 통해 reference audio를 사용할 수 있고, prosody preservation이 더 중요할 때는 text-speech pair를 in-context로 사용할 수 있다.
voice design의 경우, model은 Qwen3 text backbone과, training 중 probabilistically activated되는 thinking pattern을 활용해 detailed description에 대한 instruction following을 강화한다.

4. Training / Data / Recipe

4-1. Data

Qwen3-TTS는 10개 언어에 걸친 500만 시간 이상의 speech data로 학습된다.
공개 release page에는 Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian이 나열되어 있다.
논문은 이 scale을 multilingual mapping, cloning, controllability, long-form robustness의 기반으로 제시한다.
모든 TTS training data는 ChatML로 포맷되므로, 같은 system이 plain TTS, cloning, instruction-guided control을 함께 다룰 수 있다.

4-2. Training strategy

main model training은 pre-training과 post-training 두 단계로 나뉜다.
pre-training은 세 단계다.
- S1 general stage: 기본 multilingual text-to-speech mapping 학습
- S2 high-quality stage: stratified higher-quality data를 사용해 hallucination을 줄이고 output quality를 개선
- S3 long-context stage: maximum token length를 8,192에서 32,768로 늘리고 long speech를 upsample
post-training도 세 단계다.
- human feedback으로 만든 multilingual preference pair에 대한 DPO
- capability와 stability improvement를 위한 GSPO 기반 rule-based reward
- target voice adoption, naturalness, controllability 향상을 위한 lightweight speaker fine-tuning
tokenizer training recipe도 staged다. 25Hz path는 ASR continuation과 acoustic reconstruction을 통해 학습되고, 12Hz path는 causal streaming support를 가진 semantic-acoustic codec으로 학습된다.

4-3. Engineering notes

이 논문은 꽤 실용적인 latency decomposition을 제시한다.

\[\text{FirstPacketLatency} = \text{LM TTFP} + \text{Tokenizer Decode TPP}\]

엔지니어링적으로 중요한 detail은 몇 가지가 있다.

보고된 latency는 end-to-end 기준이며, tokenizer decoding에는 torch.compile과 CUDA Graph acceleration이 적용된 internal vLLM V0 backend에서 측정된다.
25Hz path에서는 chunk size가 8이고, DiT가 첫 번째 mel chunk를 합성하기 전에 LM이 16 token을 생성해야 한다. 그래서 25Hz tokenizer는 look-ahead cost를 치른다.
논문은 또한 25Hz path에서 BigVGAN이 추가적인 right-context look-ahead를 유발한다고 적는다.
12Hz path에서는 decoder가 pure left-context이므로, 필요한 token만 준비되면 waveform emission을 바로 시작할 수 있다.
scheduling overhead를 과도하게 늘리지 않기 위해 12Hz setup은 1 packet을 4 token으로 정의하고, 이는 packet당 320ms speech에 해당한다.

5. Evaluation

5-1. Main results

Setting	What the paper shows
Speech tokenizer	Qwen-TTS-Tokenizer-12Hz는 LibriSpeech test-clean에서 3.21 PESQ_WB, 3.68 PESQ_NB, 0.96 STOI, 4.16 UTMOS, 0.95 SIM에 도달한다
Zero-shot cloning	Qwen3-TTS-12Hz-1.7B-Base는 test-zh에서 0.77 WER, test-en에서 1.24 WER을 기록한다
Multilingual generation	Qwen3-TTS는 MiniMax-Speech와 ElevenLabs Multilingual v2 대비 10개 언어 중 6개에서 lowest WER, 10개 전부에서 best speaker similarity를 기록한다고 보고한다
Cross-lingual transfer	zh-to-ko에서 Qwen3-TTS-12Hz-1.7B-Base는 4.82 mixed error rate를 기록하고, CosyVoice3는 14.4, CosyVoice2는 24.8이다
Voice design	Qwen3TTS-12Hz-1.7B-VD는 InstructTTSEval-EN에서 82.4 DSD와 68.4 RP를 기록하며, 이 두 metric에서는 Hume보다 앞선다
Target-speaker editing	Qwen3-TTS는 target-speaker editing에서 보고된 모든 metric에서 GPT-4o-mini-tts를 앞서고, speaker-fine-tuned system은 10개 언어 중 7개에서 GPT-4o-Audio-Preview보다 낫다
Long-form speech	10분이 넘는 sample로 구성된 internal long-speech test에서는 25Hz-1.7B-CustomVoice variant가 가장 강한 Qwen3-TTS setting으로 제시된다

5-2. What really matters in the experiments

중요한 교훈은 단지 12Hz가 더 빠르다는 것이 아니다. 논문은 더 coarse한 temporal resolution이 autoregressive model이 더 긴 dependency를 포착하게 돕고, 이것이 zero-shot speech generation의 WER 개선으로 이어진다고 주장한다.
동시에 25Hz variant도 여전히 중요하다. long-form experiment를 보면 richer semantic token이 very long utterance에서 stability에 도움이 될 수 있어서, 12Hz가 universal replacement는 아니다.
multilingual result도 유용하다. open baseline뿐 아니라 commercial system과 비교하면서, 10개 언어 중 6개에서 lowest WER, 10개 모두에서 top speaker similarity를 주장한다.
controllable speech result 역시 중요하다. Qwen3-TTS는 cloning model만이 아니라 speech instruction-following model이기도 하다.
이 실험의 핵심 takeaway는 representation choice가 LM이 실제로 무엇을 model할 수 있는지를 바꾼다는 점이다. 먼저 modeling decision이고, 그 다음이 deployment decision이다.

6. Limitations

논문이 다루는 family 전체와 현재 공개된 checkpoint 사이에는 차이가 있다. GitHub release page는 현재 12Hz tokenizer와 12Hz model variant에 더 초점을 맞추고 있고, report에 언급된 다른 model은 이후 공개로 적혀 있다.
현재 system은 10개 언어를 지원하며, 저자들도 더 넓은 multilingual coverage와 더 세밀한 style control을 future work로 명시한다.
최적 operating point는 task에 따라 다르다. 12Hz branch는 first-packet latency와 많은 content metric에서 유리하지만, 25Hz branch는 very long-form stability에서는 여전히 더 강해 보인다.
중요한 evidence 일부, 특히 long-form evaluation은 standard public benchmark가 아니라 internal dataset에 의존한다.
long-speech result 주변의 narrative text와 tabulated value가 HTML과 release artifact에서 완전히 정렬되어 보이지 않으므로, 최종 발행 전 original PDF table을 다시 확인하는 편이 좋다.

7. My Take

7-1. Why this matters for my work

내가 가장 유용하게 본 부분은 Qwen3-TTS가 audio output을 LM-native token problem으로 다룬다는 점이다.
이건 TTS를 넘어 omni model, speech agent, streaming voice interface까지 연결된다. text와 speech가 어떻게 하나의 generation loop를 공유할 수 있는지에 대한 구체적인 design pattern을 준다.
또한 이 논문은 하나의 model variant가 모든 product need를 지배해야 한다고 가장하지 않는다. operating point를 명확히 드러낸다.

7-2. Reuse potential

25Hz vs 12Hz split은 product tiering template로 재사용하기 좋다. 하나는 richer quality를, 다른 하나는 stricter latency를 겨냥하는 구조다.
ChatML과 prepended control instruction은 controllable voice system에서 재사용하기 좋은 interface idea다.
evaluation setup도 자체로 유용하다. zero-shot cloning, multilingual transfer, cross-lingual transfer, instruction following, target-speaker transfer, long-form stability를 각각 따로 체크해야 한다는 점을 보여준다.
공개 release도 실용적이다. model card와 GitHub repo가 Apache 2.0 checkpoint, qwen-tts package, 그리고 바로 실험 가능한 0.6B와 1.7B 12Hz variant를 제공한다.

7-3. Follow-up papers

Qwen3.5-Omni Technical Report
CosyVoice 3: Towards In-the-Wild Speech Generation via Scaling-up and Post-Training

8. Summary

Qwen3-TTS는 benchmark report라기보다 LM-native speech generation에 대한 systems paper로 읽는 편이 좋다.
핵심 설계는 richer reconstruction을 위한 25Hz와 ultra-low-latency streaming을 위한 12Hz라는 두 speech tokenization regime을 분리한 데 있다.
dual-track LM, speaker encoder, ChatML control interface가 cloning, voice design, instruction-following speech generation을 하나로 묶는다.
strongest result는 하나의 universal setting을 좇기보다 representation을 deployment goal에 맞춘 데서 나온다.
지금 speech agent stack을 만든다면, tokenizer choice, streaming architecture, evaluation coverage 측면에서 이 논문을 design reference로 볼 가치가 있다.

Twitter Facebook LinkedIn