Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation Review

2026-04-15 15 분 소요

0. Introduction

Nemotron-Cascade 2는 얼핏 “30B open reasoning model이 잘 나왔다”는 기술 보고서처럼 볼 수도 있다. 하지만 실제로 읽어보면 더 중요한 질문은 따로 있다. math, code, instruction following, tool use, long-context, SWE 같은 서로 다른 RL workload를 한 policy 안에 어떻게 넣을 것인가? 이 논문은 그 문제를 새로운 단일 objective 하나로 푸는 대신, stage ordering + domain isolation + intermediate distillation로 푼다.

내가 이 논문을 좋게 본 이유도 바로 여기에 있다. Cascade 1이 “domain-wise RL을 어떻게 순차적으로 쌓을 수 있는가”를 보여줬다면, Cascade 2는 그 위에 더 넓은 multi-domain RL과 MOPD(Multi-domain On-Policy Distillation)를 얹어서, specialized stage를 거치면서 생기는 성능 흔들림을 어떻게 다시 균형 잡을지까지 다룬다. 즉 이 글은 단순한 leaderboard 해설보다, reasoning post-training system design 관점에서 읽는 편이 훨씬 재밌다.

또 하나 흥미로운 점은 이 논문이 “모든 걸 joint RL로 섞는 것”을 정답으로 보지 않는다는 점이다. 저자들은 stage 순서가 모델의 현재 행동 특성, domain 간 간섭, verifier 비용, response length에 따라 달라져야 한다고 본다. 그래서 이 논문은 RL recipe를 정적인 cookbook이 아니라 behavior-dependent training pipeline으로 다룬다.

한 줄 요약: Nemotron-Cascade 2는 30B MoE(3B active) 기반 open model 위에 SFT -> IF-RL -> Multi-domain RL -> MOPD -> RLHF -> Long-context RL -> Code RL -> SWE RL 순서의 post-training pipeline을 적용하고, intermediate teacher들로부터의 on-policy distillation을 통해 domain specialization 중 생기는 capability drift를 다시 균형화하려는 논문이다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

reasoning RL을 “새 loss 하나”보다 도메인 간 간섭을 다루는 시스템 설계 문제로 다시 읽게 만든다.
MOPD가 단순 보조 기법이 아니라, specialized checkpoint들을 다시 하나의 정책으로 합치는 stabilization stage로 제시된다는 점이 흥미롭다.
SFT 데이터, RL 데이터, model checkpoint, HF serving artifact까지 열려 있어서 recipe 단위로 역추적하기 좋다.

이 논문은 단순히 “Nemotron-Cascade 2와 Qwen3.5의 비교”보다, general-purpose reasoning model을 만들 때 specialization과 rebalancing을 어떻게 번갈아 넣을지를 보여주는 문서로 읽는 편이 더 유익할 것이다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 핵심 문제는 general-purpose reasoning model을 RL로 키울 때 생기는 heterogeneous domain problem이다.
math, code, instruction following, long-context QA, agentic SWE는 각각
- 필요한 response length가 다르고
- verifier latency가 다르고
- reward density가 다르고
- training instability가 발생하는 방식도 다르다.
예를 들어 strict instruction-following verifier는 형식 제약 준수에 집중하지만, RLHF는 helpfulness와 preference alignment를 본다. 이 둘은 항상 같은 방향이 아니다.
long-context RL도 마찬가지다. 논문은 long-context stage에서 다른 domain을 함께 섞으면 unrelated benchmark 성능이 오히려 나빠질 수 있다고 보고한다.
결국 이 논문의 목표는 단순히 특정 benchmark를 올리는 것이 아니라,
1. domain별로 다른 RL workload를 안정적으로 확장하고
2. 그 과정에서 생기는 interference와 benchmark regression을 관리하며
3. 최종적으로 하나의 broad policy model로 다시 정리하는 것이다.

1-2. Why previous approaches are insufficient

저자들은 stage ordering이 보편 상수(universal constant)가 아니라고 못 박는다. 모델의 초기 능력, SFT 품질, 환경 복잡도에 따라 어떤 stage를 먼저 둘지가 달라진다는 것이다.
Cascade 1이 이미 domain-wise RL의 방향을 보여줬지만, 논문은 그것만으로는 capability drift를 완전히 없애지 못한다고 본다. specialized RL stage가 늘어날수록 benchmark fluctuation은 여전히 남는다.
joint RL도 무조건 나쁘다고 하지는 않는다. 대신 response format이 비슷하고 verifier 비용이 비슷하며, 서로 성능을 깎아먹지 않는 domain만 선택적으로 함께 묶어야 한다고 본다.
즉 문제는 “joint RL vs sequential RL”의 이분법이 아니다. 더 정확히는 어떤 domain은 묶고, 어떤 domain은 분리하고, 어느 시점에서 다시 하나로 합칠 것인가의 문제다.

내가 보기엔 이게 이 논문의 첫 번째 핵심이다. 이전 방식의 한계는 benchmark 숫자가 낮았다는 게 아니라, 서로 다른 RL workload를 한 objective와 한 runtime assumptions 안에 너무 쉽게 집어넣으려 했다는 데 있다.

2. Core Idea

2-1. Main contribution

Broader SFT curation: math, code reasoning, science, long-context, general chat, instruction following, safety, conversational tool use, SWE agent, terminal agent까지 포괄하는 SFT corpus를 구성한다.
Reordered Cascade RL: IF-RL -> Multi-domain RL -> MOPD -> RLHF -> Long-context RL -> Code RL -> SWE RL 순서로 cascade를 설계한다.
Selective multi-domain integration: STEM MCQA, tool calling, structured output처럼 상호 간섭이 적고 runtime 특성이 비슷한 영역은 하나의 stage로 묶는다.
MOPD: strongest intermediate teacher checkpoint들을 domain별로 뽑아 token-level distillation으로 student policy를 재균형화한다.
Simplified thinking / instruct interface: 기존 /think, /no_think 대신 <think> block 기반 인터페이스와 tool-call tags를 사용해 non-thinking / thinking / tool usage를 정리한다.

2-2. Design intuition

이 논문의 설계 직관은 꽤 명확하다.

1) 먼저 instruction adherence를 고정한다

IF-RL을 첫 stage에 두는 이유는, strict constraint following이 이후 여러 downstream task의 기초 priors 역할을 하기 때문이다.
동시에 early IF-RL checkpoint 자체가 나중 MOPD의 teacher로도 쓰인다.

2) 함께 학습해도 되는 domain만 먼저 묶는다

STEM MCQA, tool calling, structured output은 response length와 verification cost가 비슷하고, 논문 기준으로 성능 저하 없이 함께 오를 수 있는 조합이다.
그래서 이 논문은 “모든 걸 multi-domain으로 묶는다”가 아니라 limited multi-domain stage를 둔다.

3) specialization 뒤에는 rebalancing이 필요하다

specialized RL을 계속 쌓다 보면 한 domain은 오르지만 다른 domain은 흔들린다.
그래서 MOPD를 중간에 넣어, math / RLHF / multi-domain teacher의 장점을 다시 student 하나로 묶는다.
여기서 중요한 점은 MOPD가 단순 distillation 부록이 아니라 cascade 전체를 안정화하는 checkpoint라는 것이다.

4) 인간 선호와 style은 그다음에 맞춘다

RLHF는 MOPD 이후에 들어간다.
논문은 RLHF가 creative writing, non-verifiable problem solving, human preference alignment를 끌어올리되 다른 domain을 크게 해치지 않는다고 설명한다.

내 해석으로는, 이 구조의 핵심은 “general -> specialization -> rebalance -> preference alignment -> extreme domains”다. 즉 이 논문은 RL recipe를 한 줄짜리 수식이 아니라 staged control system으로 다룬다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	heterogeneous reasoning / agentic domains를 RL로 확장하면서 capability drift를 관리하는 것
Base model	Nemotron-3-Nano-30B-A3B-Base에서 post-train한 open 30B MoE, 3B active
Core pipeline	SFT -> IF-RL -> Multi-domain RL -> MOPD -> RLHF -> Long-context RL -> Code RL -> SWE RL
New element vs Cascade 1	broader RL coverage, selective multi-domain RL, MOPD insertion, changed stage ordering
Policy unification strategy	strongest intermediate teachers를 같은 tokenizer/vocab space에서 distill
Practical interface	`<think></think>` 기반 non-thinking mode, tool-call tags, HF/vLLM serving artifact 공개

3-2. Module breakdown

1) SFT data curation and chat interface

SFT는 단순한 instruction-tuning이 아니다. math, code, science, long-context, conversational tool use, SWE, terminal task까지 꽤 넓게 덮는다.
math 쪽은 competition math와 proof data를 분리해서 다루고, code 쪽은 competitive programming reasoning trace와 scientific coding을 같이 모은다.
SWE 쪽은 agentic 데이터와 agentless 데이터를 둘 다 넣는데, 논문은 이 조합이 OpenHands 성능을 높인다고 보고한다.
terminal agent data도 별도 축으로 다룬다. 기존 static task를 interactive terminal format으로 바꾸고, seed prompt와 terminal skill taxonomy로 synthetic task를 추가한다.
채팅 인터페이스도 바뀐다. Cascade 1의 /think, /no_think 대신, empty <think></think> block이 non-thinking mode를 의미하도록 단순화한다.
tool calling은 system prompt 안의 <tools> 정보와 <tool_call> 태그를 명시적으로 사용한다.

2) Cascade RL ordering

Figure 2를 보면 pipeline은 아래 순서를 따른다.

SFT
IF-RL
Multi-domain RL
MOPD
RLHF
Long-context RL
Code RL
SWE RL

이 순서는 임의가 아니다.

IF-RL은 foundational instruction adherence를 만든다.
Multi-domain RL은 tool calling, STEM reasoning, structured output을 같이 밀어준다.
MOPD는 그 시점까지 쌓인 specialized checkpoint의 장점을 student 하나로 재정렬한다.
RLHF는 preference alignment와 creative writing, non-verifiable reasoning 쪽을 보강한다.
Long-context / Code / SWE는 각각 별도 엔지니어링이 필요한 heavy domain이라 후반 specialized stage로 남겨 둔다.

3) Multi-domain On-Policy Distillation (MOPD)

이 논문의 가장 인상적인 부분은 역시 MOPD다.

teacher는 외부 model family에서 끌어오지 않는다.
같은 SFT initialization에서 출발한 intermediate checkpoint 중에서 도메인별 strongest teacher를 뽑는다.
논문 본문 기준 main setting에서는 math, RLHF, multi-domain teacher 세 개를 사용한다.
objective는 student가 직접 sample한 token에 대해 teacher log-prob와 current policy log-prob의 차이를 token-level dense advantage로 쓰는 reverse-KL 형태다.
train/inference mismatch는 truncated importance weighting으로 다룬다.

중요한 건 여기서 distillation이 sequence-level sparse reward의 대체물처럼 작동한다는 점이다. 저자들은 이 dense token-level signal이 GRPO보다 훨씬 sample-efficient하고 step-efficient하다고 주장한다.

4) RL training configuration

전체 cascade RL은 strict on-policy GRPO를 따른다.
매 iteration마다 current policy로 rollout을 만들고, 바로 single gradient update를 한다.
저자들은 이 설정이 importance sampling ratio를 1로 유지해 entropy collapse를 줄이고 안정성을 높인다고 본다.
다시 말해, Nemotron-Cascade 2의 핵심은 새로운 RL 알고리즘 발명보다 기존 on-policy RL을 domain-aware하게 배치하는 방식에 있다.

4. Training / Data / Recipe

4-1. Data

이 논문은 데이터 구성도 꽤 공격적이다. 내가 중요하게 본 포인트만 뽑으면 아래와 같다.

Math SFT
- competition math: 1.8M python tool-calling + 2.6M non-tool samples
- proof: 98K problems에서 proof generation / verification을 합쳐 총 816K samples
Code SFT
- 약 165K unique coding prompt에서 dedup과 correctness filtering을 거쳐
- 1.9M Python reasoning trace, 1.0M C++14 trace, 1.3M Python tool-calling trace 구성
- scientific coding 1.1M samples 추가
Conversational tool use
- multi-turn conversational tool-use 822K samples
SWE SFT
- 125K agentic + 389K agentless samples를 조합
Terminal agent SFT
- Terminal-Task-Gen 기반 490K samples

데이터를 이렇게 넓게 깔아두는 이유는 분명하다. Cascade RL이 아무리 좋아도, 그 앞단의 SFT가 좁으면 teacher pool 자체가 약해진다. MOPD가 paper에서 먹히는 것도 결국 teacher checkpoint들이 이미 서로 다른 강점을 충분히 갖고 있기 때문이다.

4-2. Training strategy

전체 stage를 한 번에 보면 아래 표가 가장 이해가 쉽다.

Stage	Primary purpose	Notable recipe
IF-RL	strict instruction adherence	dynamic filtering, thinking mode only, 49K max response
Multi-domain RL	STEM MCQA + tool calling + structured output	55% MCQA / 30% tool calling / 15% structured output
MOPD	capability rebalance	reverse-KL token-level distillation from domain teachers
RLHF	human preference / creative writing / non-verifiable tasks	pairwise comparisons, GenRM, conciseness bonus
Long-context RL	long-context reasoning	long-context-only data, 32K input / 49K max sequence
Code RL	competitive coding	3.5K hard prompts, 118K max response, strict binary reward
SWE RL	code repair + agentic SWE	agentless RL + execution-based OpenHands RL

1) SFT

SFT는 packed sequence length 256K로 single-stage training을 수행한다.
본문 기준으로는 약 1.5 epoch 부근에서 optimal performance에 도달한다고 한다.
appendix hyperparameter table 기준 global batch size는 64, optimizer는 AdamW, max learning rate는 5e-5다.

2) IF-RL

IF-RL은 첫 RL stage다.
objective-verifiable instruction-following dataset을 사용하고, all-correct / all-wrong sample을 배치에서 빼는 dynamic filtering을 적용한다.
extended IF-RL이 불필요한 token usage를 키울 수 있어서 overlong penalty도 둔다.
이 stage는 thinking mode only로 학습된다.
appendix table 기준 49K max response, batch size 128, rollout size 16, 약 180 steps다.

3) Multi-domain RL

IF-RL 다음에는 MCQA, agentic tool calling, structured output을 하나의 blended stage로 묶는다.
비율은 대략 55% MCQA, 30% agentic tool calling, 15% structured output이다.
저자들이 이 조합을 택한 이유는 두 가지다.
1. benchmark degradation이 관찰되지 않았고
2. response length / verification time이 비슷해서 runtime inefficiency가 작았기 때문이다.
appendix table 기준 49K max response, rollout 16, 약 70 steps다.

4) MOPD

MOPD는 논문에서 가장 독특한 stage다.
main setting은 rollout size 4, 128 prompts per update, effective batch size 512 responses다.
warm-up은 첫 30 step에 걸쳐 적용하고, 전체는 대체로 40~50 steps 안에 수렴한다고 설명한다.
teacher는 math, RLHF, multi-domain 세 축에서 뽑는다.
- math teacher는 initial SFT checkpoint
- RLHF teacher는 initial SFT에서 따로 RLHF로 최적화한 checkpoint
- multi-domain teacher는 IF-RL + multi-domain RL 이후의 checkpoint

즉 MOPD는 “teacher 한 명에게 배우는 distillation”이 아니라, specialized intermediate models를 다시 하나의 broad student로 압축하는 stage다.

5) RLHF

RLHF는 NVIDIA Nano-v3 계열 dataset과 GenRM recipe를 재사용한다.
pair-wise comparison을 모든 rollout pair에 적용하고, length-normalized reward adjustment와 quality-gated conciseness bonus를 쓴다.
중요한 디테일은 RLHF를 thinking mode only로 학습한다는 점이다.
저자들은 thinking + non-thinking을 같이 쓰면 convergence가 좋아질 수는 있어도, instruction-following 성능이 크게 떨어졌다고 보고한다.
appendix table 기준 RLHF는 16K max response, rollout 16, learning rate 3e-6, step 수는 25로 적혀 있다. 다만 본문은 “around 30 steps”라고 적어 두어서, 최종 게시 전에 이 부분은 다시 맞춰보는 것이 좋다.

6) Long-context RL

RLHF 뒤에는 long-context-only RL stage가 온다.
논문은 long-context RL에서 다른 domain을 섞으면 unrelated benchmark가 나빠질 수 있다고 말한다.
그래서 NVIDIA Nano-v3 RL blend 중에서도 long-context dataset만 쓴다.
Nemo-Gym 환경과 Qwen3-235B-A22B-Instruct-2507 judge를 사용한다.
input은 32K로 제한하고, max sequence length는 49K다.
이 stage도 약 30 step 정도로 짧게 돈다.

7) Code RL

Code RL training set은 Nemotron-Cascade coding corpus에서 오지만, 너무 쉬운 prompt는 과감히 잘라낸다.
GPT-OSS-120B가 8/8 rollouts 모두 정답을 맞히는 샘플을 제거해 최종적으로 3.5K high-difficulty set만 남긴다.
max response length는 118K까지 늘린다.
reward는 strict binary reward다. reward hacking 가능성을 줄이기 위해 fully on-policy로 유지한다.
verification throughput을 위해 384 CPU core 위에서 asynchronous reward verification server를 돌리고, batch당 처리 시간을 427.2초로 보고한다.

8) SWE RL

SWE는 사실상 두 단계다.

(a) Agentless RL

대부분의 instance에 실행 가능한 Docker 환경이 없기 때문에 GPT-OSS-120B를 reward model처럼 사용한다.
golden localization과 retrieved localization을 함께 prompt에 넣고, none-of-rollouts-reward>0.5인 difficult prompt는 loss masking으로 처리한다.
agentless RL은 보통 40~50 step에서 수렴한다.
이 stage만으로도 SWE-bench Verified에서 OpenHands avg@4가 49.8% -> 50.8%, pass@4가 64.2% -> 65.0%로 오른다.

(b) Execution-based agentic SWE RL

이후에는 OpenHands scaffold 안에서 직접 RLVR를 수행한다.
batch size 1024는 16 prompts × 64 rollouts 구조다.
max context 256K, max turn 200으로 long-horizon agent workflow를 학습한다.
100% accuracy인 너무 쉬운 instance는 제거하고, 0% accuracy인 너무 어려운 instance는 90%를 버린다.

내가 보기엔 이 SWE 설계는 꽤 현실적이다. executable environment가 없는 영역은 agentless로 먼저 밀고, 실행 가능한 영역은 scaffold 안에서 end-to-end RL로 다시 미는 식이다.

4-3. Engineering notes

paper는 1M-context benchmark를 보고하지만, HF quick-start 문서는 vLLM 예시를 262,144 max-model-len 기준으로 제공한다. 즉 evaluation claim과 default serving example은 분리해서 읽는 편이 안전하다.
HF model card 기준 이 모델은 thinking / instruct dual mode를 지원하고, practical agentic coding / SWE 쪽은 현재 OpenCode보다 OpenHands 중심으로 안내된다.
multi-turn에서 이전 assistant turn이 thinking mode였다면, conversation history에는 final summary만 넣도록 가이드한다. reasoning trace 누적으로 context를 불필요하게 잡아먹지 않게 하려는 선택으로 보인다.
sampling recommendation은 temperature 1.0, top-p 0.95다.

5. Evaluation

5-1. Main results

1) Headline result는 “intelligence density”다

이 논문이 가장 강하게 미는 결과는 competition setting이다.

IMO 2025: 35/42, Gold
IOI 2025: 439.28/600, Gold
ICPC World Finals 2025: 10/12, Gold

30B MoE / 3B active라는 점을 감안하면, paper가 말하는 “high intelligence density”라는 framing 자체는 납득할 만하다.

2) Math / code / instruction following이 가장 강한 축이다

Table 1에서 눈에 띄는 숫자는 아래쪽이다.

AIME 2025: 92.4, tool-integrated reasoning 포함 시 98.6
LiveCodeBench v6: 87.2, TIR 포함 시 88.4
ArenaHard v2 average: 83.5
IFBench (prompt): 82.9
NIAH@1M (RULER subset): 99.0

즉 이 모델의 strongest story는 broad “everything model”이라기보다, math / code reasoning + instruction adherence가 핵심이다.

3) 그런데 broad capability profile은 비대칭적이다

Table 1을 자세히 보면, Nemotron-Cascade 2가 모든 축에서 최고는 아니다.

long-context에서는 NIAH@1M은 매우 강하지만, AA-LCR 39.1과 LongBench v2 40.3은 larger baseline보다 약하다.
agentic에서도 BFCL v4 52.9, τ²-Bench 58.9, Terminal Bench 2.0 21.1, SWE Verified(OpenHands) 50.2로 영역별 편차가 크다.
multilingual 역시 MMLU-ProX 72.5, WMT24++ 84.1로 아주 강한 축은 아니다.

즉 paper headline만 보면 “general-purpose reasoning across the board”처럼 읽히기 쉽지만, 실제 main table은 카테고리별 강약이 꽤 분명한 프로파일을 보여준다.

5-2. What really matters in the experiments

1) Figure 2가 이 논문의 핵심 그림이다

솔직히 이 논문은 final benchmark table보다 Figure 2가 더 중요하다. 그 그림 하나에 pipeline의 철학이 다 들어 있다.

specialized stage를 순차적으로 쌓고
중간에 MOPD로 rebalance하고
후반에는 long-context / code / SWE처럼 runtime-heavy domain을 따로 밀어준다.

즉 이 논문이 전달하는 가장 중요한 메시지는 “우리는 이 loss를 썼다”가 아니라 “우리는 이 순서로 pipeline을 조직했다”다.

2) MOPD의 가치가 final score보다 더 중요하다

Figure 3와 Table 3에서 paper가 보여주는 메시지는 꽤 선명하다.

AIME25 math-only training에서 GRPO는 25 step 후 89.9 -> 91.0까지 오르는데,
MOPD는 30 step 안에 92.0에 도달하며 teacher-level performance를 회복한다고 한다.
ArenaHard v2에서는 MOPD가 52 step에서 Hard Prompt 85.5, Creative Writing 71.0에 도달하는 반면, RLHF는 160 step에서 Hard Prompt 80.7, Creative Writing 71.2 수준이다.

즉 MOPD는 “약간 더 좋다”보다 훨씬 빨리 균형을 회복한다는 쪽이 더 중요하다.

3) selective multi-domain RL의 기준이 꽤 실무적이다

이 논문이 마음에 드는 이유 중 하나는 multi-domain RL을 이상론으로 말하지 않는다는 점이다.

같이 묶을 domain은 성능 간섭이 적어야 하고
response length / verification time도 비슷해야 한다.

이건 실제 RL infra를 돌려본 사람에게는 꽤 현실적인 기준이다. 잘못 묶으면 GPU는 놀고 verifier만 기다리거나, 반대로 쉬운 domain이 어려운 domain을 덮어버리게 된다.

4) agentless RL이 agentic SWE에도 일부 전이된다

Table 4의 improvement는 크지 않지만 의미는 있다.

agentless RL만 했는데도 OpenHands 쪽 avg@4와 pass@4가 같이 오른다.

내 해석으로는 이 결과가 “scaffold-specific trajectory imitation만으로는 부족하고, underlying code repair skill 자체를 올리는 것”도 중요하다는 증거다.

5) long-context claim은 좁고 넓은 평가를 같이 봐야 한다

NIAH@1M은 99.0으로 매우 강하다.
하지만 AA-LCR, LongBench v2는 그렇게 압도적이지 않다.

따라서 이 모델을 “1M context model”이라고 한 줄로 정리해 버리면 정확하지 않다. needle retrieval / synthetic-style long-context strong point와 broader long-context reasoning profile은 분리해서 읽어야 한다.

6. Limitations

Benchmark profile이 생각보다 비대칭적이다.
math, code, ArenaHard, IFBench는 강하지만, long-context / agentic / multilingual 전반이 모두 최고는 아니다.
이 recipe는 teacher-heavy하다.
MOPD는 same-init intermediate teachers를 활용하고, SFT 데이터 자체도 강한 외부 teacher generation에 많이 의존한다. 그래서 이 결과를 pure RL의 승리로만 읽으면 과하다.
Stage ordering은 transferable law가 아니다.
논문 스스로도 ordering은 universal constant가 아니라 model behavior의 함수라고 말한다. 즉 다른 base model이나 다른 domain set에 그대로 복사해도 먹힌다는 보장은 없다.
1M-context eval과 실제 서빙 설정은 구분이 필요하다.
paper의 long-context headline과 HF quick-start의 262K serving example은 같은 층위의 주장이 아니다.
Agentic SWE는 여전히 scaffold 의존성이 크다.
execution-based RL이 OpenHands 안에서 잘 돌아간다는 것과, 다른 scaffold에서도 같은 이득이 유지된다는 것은 별개 문제다.
RLHF step 수 등 일부 디테일은 본문과 appendix를 다시 맞춰볼 필요가 있다.
이는 큰 문제라기보다 final post 검수 포인트에 가깝다.

7. My Take

7-1. Why this matters for my work

내 기준에서 이 논문의 가장 큰 가치는 reasoning post-training을 scheduling problem으로 다시 보게 만든다는 점이다.

보통 reasoning model 논문은

어떤 reward를 썼는지
어떤 verifier를 썼는지
어떤 benchmark를 얼마나 올렸는지

에 시선이 몰린다. 그런데 실제 시스템 관점에서는 그보다 먼저,

어떤 domain을 먼저 학습할지
어떤 domain을 묶을지
언제 specialization을 멈추고 rebalance할지
어떤 stage에서 human preference를 다시 맞출지

가 훨씬 중요할 때가 많다.

Nemotron-Cascade 2는 이 질문들에 꽤 실무적인 답을 준다. 특히 specialization과 rebalancing을 번갈아 배치하는 방식은 앞으로 다른 reasoning / agentic model recipe를 읽을 때도 기준점으로 쓸 만하다.

7-2. Reuse potential

실제로 재사용 가치가 커 보이는 포인트는 아래와 같다.

domain 간 verifier 비용과 response length를 기준으로 stage를 분리하는 방식
completely joint RL 대신 selective multi-domain RL을 넣는 기준
same-tokenizer intermediate teacher pool로 MOPD를 구성하는 방식
long-context RL을 별도 domain stage로 떼는 결정
hard-only code RL data filtering
agentless RL -> execution-based agentic RL로 이어지는 SWE training ladder
multi-turn thinking trace를 full history에 누적하지 않고 final summary만 남기는 serving choice

이 중에서 특히 MOPD는 “논문 아이디어”라기보다 현실적인 capability rebalance tool로 보인다. 수많은 RL stage를 거친 뒤 성능이 요동칠 때, 마지막에 모든 걸 다시 GRPO로 한 번 더 미는 것보다 더 깔끔한 선택지가 될 수 있다.

7-3. Follow-up papers

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models
AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy
A Survey of On-Policy Distillation for Large Language Models

내가 보기엔 다음 읽기 순서는 꽤 분명하다. 먼저 Cascade 1로 backbone idea를 확인하고, 그다음 AceReason-Nemotron 1.1로 math/code recipe를 분해하고, 마지막으로 on-policy distillation survey 계열로 MOPD를 더 넓은 문맥에 놓아보는 식이 자연스럽다.

8. Summary

Nemotron-Cascade 2의 핵심은 더 강한 30B model 자체보다, domain-aware post-training pipeline이다.
IF-RL -> Multi-domain RL -> MOPD -> RLHF -> Long-context RL -> Code RL -> SWE RL 순서가 이 논문의 진짜 메시지다.
MOPD는 specialized intermediate teacher들을 다시 하나의 broad policy로 묶는 stabilization stage로 읽는 것이 맞다.
math / code / instruction following은 강하지만, long-context / agentic / multilingual은 카테고리별 편차가 있다.
그래서 이 논문은 “universal SOTA report”보다, reasoning RL systems paper로 읽는 편이 더 정확하다.

Twitter Facebook LinkedIn