Co-Evolving Policy Distillation Review

2026-05-10 15 분 소요

0. Introduction

Co-Evolving Policy Distillation, 줄여서 CoPD는 “여러 expert를 하나의 모델로 합치는 또 다른 distillation recipe” 정도로 읽으면 핵심을 놓치기 쉬운 논문이다. 이 논문이 진짜로 겨냥하는 문제는 post-training에서 specialization과 consolidation을 언제 어떻게 섞을 것인가다.

최근 reasoning post-training에서는 RLVR와 OPD가 거의 기본 재료처럼 쓰인다. RLVR는 verifiable reward가 있는 domain에서 policy를 강하게 밀 수 있고, OPD는 teacher policy가 가진 dense token-level signal을 student rollout 위에서 전달할 수 있다. 그런데 여러 능력을 하나의 모델에 넣으려면 문제가 생긴다. Text reasoning, image reasoning, video reasoning처럼 성격이 다른 capability를 한 policy에 동시에 넣으면 gradient conflict가 생기고, 반대로 expert를 따로 끝까지 학습한 뒤 OPD로 합치면 teacher와 student의 behavioral pattern이 너무 멀어져 transfer가 약해진다.

CoPD의 답은 단순하지만 꽤 중요하다. expert를 다 학습한 뒤 합치지 말고, expert들이 학습되는 중간중간 서로 distill하게 만들자는 것이다. 각 branch는 자기 domain에서 RLVR로 specialized capability를 키우고, 그 사이사이에 mutual OPD를 넣어 다른 branch와 behavioral pattern을 너무 멀어지지 않게 유지한다. 마지막에는 co-evolved branch들을 merge해 all-in-one policy를 만든다.

한 줄 요약: CoPD는 여러 capability expert를 독립적으로 완성한 뒤 사후 distill하는 대신, 각 expert branch가 RLVR로 전문성을 키우는 동안 mutual OPD로 계속 서로를 teacher로 삼게 해 behavioral overlap을 유지하면서 text, image, video reasoning capability를 하나의 policy로 통합하려는 post-training framework다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

RLVR와 OPD를 따로 보는 것이 아니라, capability exploration 과 capability consolidation 을 한 training loop 안에서 묶는다.
“expert를 먼저 만든 뒤 distill하면 된다”는 직관이 왜 깨질 수 있는지, behavioral overlap 관점에서 설명한다.
multi-domain post-training을 data mixing 문제가 아니라 model-parallel co-evolution 문제로 다시 본다.
OPD의 핵심이 teacher score보다 teacher-student state overlap이라는 최근 흐름과 잘 맞는다.
text, image, video reasoning을 하나의 policy에 넣는 multimodal post-training setup에서 practical한 diagnostic을 제공한다.

이 논문은 개별 benchmark 점수보다 training paradigm이 더 중요하다. 핵심 메시지는 “여러 domain을 섞어서 RL을 돌리자”도 아니고, “강한 expert를 모아서 distill하자”도 아니다. expert가 멀어지기 전에 주기적으로 서로 맞춰야 한다는 것이다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 multiple expert capabilities를 하나의 unified policy에 통합하는 것이다.

예를 들어 하나의 base VLM이 있다고 하자. 이 모델을 text reasoning, image reasoning, video reasoning에 모두 강하게 만들고 싶다. 가장 단순한 방법은 모든 데이터를 섞고 하나의 RLVR objective로 학습하는 것이다. 하지만 domain마다 reward, response pattern, visual evidence usage, reasoning length, verifier behavior가 다르면 하나의 update가 모든 capability에 같은 방향으로 작동하지 않을 수 있다.

반대로 각 domain expert를 따로 학습하면 capability conflict는 줄어든다. Text branch는 text reasoning을, image branch는 image reasoning을, video branch는 video reasoning을 각각 잘하게 만들 수 있다. 하지만 이들을 나중에 하나의 student로 OPD하려고 하면 또 다른 문제가 생긴다. expert들이 이미 각자의 behavior pattern으로 멀리 이동한 뒤라면, student rollout이 teacher의 high-probability region과 잘 겹치지 않을 수 있다. 이 경우 teacher는 강하지만 student가 그 signal을 local gradient로 활용하기 어렵다.

따라서 문제는 다음처럼 정리된다.

mixed RLVR는 여러 capability를 동시에 넣을 수 있지만, capability conflict를 만들 수 있다.
static OPD는 expert conflict를 피하지만, teacher와 student의 behavior gap 때문에 absorption efficiency가 낮아질 수 있다.
CoPD는 specialization과 consolidation을 분리하지 않고, 주기적으로 번갈아 수행한다.

논문은 이를 utility 관점으로도 설명한다. Mixed RLVR는 총 optimization signal을 모두 쓰는 것처럼 보이지만 capability divergence cost를 낸다.

\[U_{mix} \approx X(D_1, D_2) - \Phi(D_1, D_2)\]

여기서 $X(D_1, D_2)$는 두 capability dataset이 가진 전체 optimization signal이고, $\Phi(D_1, D_2)$는 서로 다른 capability가 만드는 divergence cost다.

Static OPD는 gradient conflict를 피하지만 teacher-student overlap이 낮으면 absorption efficiency가 떨어진다.

\[U_{static} \approx \eta(O_{low}) X(D_1, D_2)\]

여기서 $\eta(O)$는 behavioral overlap $O$에 의존하는 absorption efficiency다. 핵심은 $O_{low}$가 너무 낮으면 teacher가 강해도 student가 충분히 흡수하지 못한다는 점이다.

CoPD는 이 둘 사이의 중간 operating point를 노린다.

\[U_{CoPD} \approx \eta(O_{mod}) X(D_1, D_2)\]

즉 expert branch가 완전히 같지는 않지만, OPD가 먹힐 만큼은 충분히 가까운 $O_{mod}$ 영역을 유지하려는 것이다.

1-2. Why previous approaches are insufficient

기존 방식의 한계는 크게 두 가지다.

첫째, mixed RLVR는 capability를 섞는 방식이 너무 직접적이다. 여러 domain의 data와 reward를 하나의 policy update에 넣으면 implementation은 쉽다. 하지만 reward landscape가 서로 다르면 하나의 capability가 좋아지는 방향이 다른 capability를 깎을 수 있다. 특히 multimodal reasoning에서는 text-only reasoning, image evidence grounding, temporal video evidence 사용이 서로 다른 response pattern을 요구할 수 있다.

둘째, static OPD는 distillation timing이 너무 늦다. expert가 충분히 학습된 뒤에 teacher로 쓰면 capability는 강하지만, behavior pattern이 이미 student와 멀어져 있을 수 있다. OPD는 student가 방문한 state 위에서 teacher log-prob를 주는 방식이므로, teacher가 강하다는 사실만으로 충분하지 않다. Student rollout이 teacher의 useful support 근처를 지나야 dense token signal이 실제 gradient로 작동한다.

이 지점은 최근 OPD 분석 논문들과도 연결된다. OPD의 성공은 teacher benchmark score보다 teacher-student compatibility에 더 민감할 수 있다. CoPD는 이 compatibility를 사후에 복구하는 대신, training 과정 자체에서 유지하려 한다.

2. Core Idea

2-1. Main contribution

CoPD의 핵심 기여는 세 가지로 볼 수 있다.

Unified view of mixed RLVR and static OPD

논문은 multiple capability integration에서 mixed RLVR와 static OPD가 서로 다른 방식으로 capability loss를 만든다고 본다. Mixed RLVR는 inter-capability divergence cost를 내고, static OPD는 teacher-student behavioral gap 때문에 absorption efficiency를 잃는다.
Co-evolving expert branches

하나의 base model에서 여러 branch를 시작하고, 각 branch가 자기 capability dataset에서 RLVR를 수행한다. 하지만 branch를 끝까지 독립적으로 학습하지 않는다. 중간중간 mutual OPD phase를 넣어 branch들이 서로 teacher이자 student가 되게 한다.
Behavioral overlap as a design target

CoPD는 OPD가 잘 작동하려면 teacher와 student의 behavioral overlap이 충분해야 한다는 가설을 training recipe로 바꾼다. 논문은 top-k token overlap과 symmetric KL divergence를 사용해 branch 간 behavioral proximity를 추적한다.

핵심은 “expert diversity”와 “behavioral consistency”를 동시에 유지하는 것이다. RLVR phase는 branch를 각자의 capability frontier로 밀고, OPD phase는 branch들이 너무 멀어지지 않게 다시 당긴다.

2-2. Design intuition

CoPD의 설계 직관은 다음과 같다.

RLVR는 exploration이다. 각 branch가 자기 reward와 data에서 더 좋은 policy를 찾도록 한다. 이 단계가 없으면 branch 간 차이가 충분하지 않고, mutual distillation도 새로 배울 것이 적다.

OPD는 consolidation이다. Branch가 얻은 새로운 capability를 다른 branch에 dense token-level signal로 전달한다. 이 단계가 없으면 branch들이 너무 멀어져 나중에 합치기 어려워진다.

따라서 CoPD는 두 힘을 번갈아 사용한다.

RLVR는 knowledge gap을 만든다.
OPD는 behavior gap을 줄인다.
반복 cycle은 learnable difference와 absorbable similarity 사이의 균형을 유지한다.

이 아이디어는 curriculum처럼 보이지만, 정확히는 single model curriculum이 아니라 parallel policy curriculum 이다. 여러 policy copy가 동시에 움직이고, 각 policy가 다른 policy의 intermediate checkpoint를 teacher로 사용한다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	여러 capability expert를 하나의 unified policy로 통합
Base setup	동일한 base model에서 여러 branch를 초기화
Capability unit	각 branch는 서로 다른 dataset 또는 reasoning domain에 대응
Exploration phase	branch-specific RLVR, 주로 GRPO style objective
Consolidation phase	mutual OPD, branch들이 서로 teacher-student 역할을 수행
Diagnostics	top-k token overlap, symmetric KL divergence
Final step	co-evolved branches를 parameter merge해 unified model 생성
Extension	branch 수가 많아질 경우 hub-and-spoke topology 사용 가능
Difference from prior work	expert를 다 학습한 뒤 distill하지 않고, expert 학습 중간에 bidirectional OPD를 반복

3-2. Module breakdown

1) Parallel branch initialization

CoPD는 하나의 base policy에서 여러 branch를 만든다.

\[\pi_{\theta_0} \rightarrow \{\pi_{\theta_1}, \pi_{\theta_2}, ..., \pi_{\theta_K}\}\]

각 branch는 특정 capability dataset $D_k$와 연결된다. Two-branch setting에서는 예를 들어 text branch와 image branch가 될 수 있고, three-branch setting에서는 text, image, video branch가 될 수 있다.

중요한 점은 branch들이 모두 같은 initialization에서 시작한다는 것이다. 시작점이 같기 때문에 초기 behavioral overlap은 높다. CoPD는 이 overlap이 완전히 무너지기 전에 주기적으로 OPD를 넣는다.

2) Branch-specific RLVR phase

각 branch는 자기 data에서 RLVR를 수행한다. 논문은 GRPO style objective를 사용한다. 개념적으로 branch $k$의 objective는 다음처럼 볼 수 있다.

\[L_{RLVR}^{(k)}(\theta_k) = E_{x \sim D_k} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min \left( \rho_{i,t}^{(k)} A_i^{RL}, clip(\rho_{i,t}^{(k)}, 1-\epsilon, 1+\epsilon) A_i^{RL} \right) \right]\]

여기서 $G$는 group size, $A_i^{RL}$은 group-level advantage, $\rho_{i,t}^{(k)}$는 importance ratio다.

이 phase의 역할은 branch를 자기 domain에 더 강하게 만드는 것이다. 즉 image branch는 image reasoning signal에서, video branch는 video reasoning signal에서 더 specialized된다.

하지만 이 phase만 계속 돌리면 branch 간 behavior가 멀어진다. 그래서 CoPD는 RLVR phase를 일정 step 수행한 뒤 mutual OPD phase로 넘어간다.

3) Mutual OPD phase

Mutual OPD phase에서는 branch들이 서로 teacher가 된다. 예를 들어 branch $k$가 branch $j$의 data에서 rollout을 만들면, branch $j$는 그 rollout prefix 위에서 token-level log-prob signal을 제공한다.

Teacher signal은 다음처럼 볼 수 있다.

\[\delta_{i,t}^{(k \leftarrow j)} = \log \pi_{\theta_j}(y_{i,t}^{(k)} | x', y_{i,<t}^{(k)}) - \log \pi_{\theta_k}(y_{i,t}^{(k)} | x', y_{i,<t}^{(k)})\]

이 값은 teacher branch가 student branch의 sampled token을 student보다 얼마나 더 선호하는지를 나타낸다. 이를 OPD advantage로 바꿔 student branch를 update한다.

\[A_{i,t}^{(k)} = \beta_k \delta_{i,t}^{(k \leftarrow j)}\]

여기서 $\beta_k$는 OPD signal의 강도를 조절하는 coefficient다.

중요한 점은 이 과정이 bidirectional 이라는 것이다. Branch $k$만 branch $j$에게 배우는 것이 아니라, branch $j$도 branch $k$에게 배운다. 그래서 CoPD는 teacher-student hierarchy라기보다 peer-to-peer distillation에 가깝다.

4) Alternating cycles

CoPD는 RLVR와 OPD를 한 번씩만 수행하지 않는다. 전체 training은 여러 cycle로 구성된다.

Phase	Role
Branch-specific RLVR	각 branch가 자기 capability를 더 깊게 탐색
Mutual OPD	branch들이 서로 얻은 capability를 전달하고 behavior gap을 줄임
Repeat	specialization과 consolidation을 반복
Merge	co-evolved branch를 unified model로 통합

논문은 RLVR exploration step 수와 OPD consolidation step 수의 비율이 중요하다고 본다. 공개 요약 기준으로는 $S_{RL}:S_{OPD}=1.5:1$이 가장 좋은 setting으로 보고된다. 이 수치를 그대로 universal recipe로 받아들이기보다는, exploration이 너무 길면 branch가 멀어지고, consolidation이 너무 길면 specialization이 약해진다는 trade-off를 보여주는 결과로 읽는 편이 낫다.

5) Behavioral overlap diagnostic

CoPD는 teacher-student compatibility를 top-k token overlap으로 본다. 개념적으로는 student가 방문한 state에서 student와 teacher의 top-k token set이 얼마나 겹치는지를 측정한다.

\[O_k(\pi_{\theta}, \pi_T) = E_{x, y_{<t} \sim \mu_{\theta}} \left[ \frac{ |TopK(\pi_{\theta}(\cdot | x, y_{<t})) \cap TopK(\pi_T(\cdot | x, y_{<t}))| }{k} \right]\]

여기서 $\mu_{\theta}$는 student policy가 만드는 state visitation distribution이다.

이 지표는 단순한 output similarity가 아니다. OPD가 실제로 적용되는 prefix state에서 teacher와 student의 high-probability support가 얼마나 겹치는지를 본다. 논문은 behavioral analysis에서 CoPD가 top-k overlap을 높은 수준으로 유지하고, symmetric KL divergence를 낮게 유지한다고 보고한다.

6) Final merge

마지막에는 co-evolved branch들을 하나의 policy로 merge한다. 논문은 simple parameter merge를 final consolidation step으로 사용한다.

이 점이 흥미롭다. Branch들이 너무 멀어진 상태라면 parameter merge는 매우 위험할 수 있다. 하지만 CoPD는 training 중 mutual OPD로 branch 간 behavior를 계속 맞추기 때문에, final merge가 더 안정적으로 작동할 수 있다는 논리를 갖는다.

즉 merge는 핵심 contribution이라기보다 CoPD cycle의 결과가 실제로 잘 정렬되었는지를 보여주는 마지막 test에 가깝다.

4. Training / Data / Recipe

4-1. Data

원문에서 공개적으로 확인되는 핵심 setup은 text, image, video reasoning capability를 통합하는 것이다. 공개 요약 기준으로 실험은 Qwen3-VL-4B-Instruct를 기반으로 진행된다.

정확한 dataset 이름, sample 수, filtering rule, reward function detail은 접근 가능한 요약만으로는 완전히 확인하기 어렵다. 따라서 publish 전에 원문 PDF의 data table을 다시 확인해야 한다. 이 초안에서는 논문의 구조적 claim 위주로 정리한다.

Item	Description
Base model	Qwen3-VL-4B-Instruct로 보고됨
Capability domains	text reasoning, image reasoning, video reasoning
Branch setup	two-branch text + image, three-branch text + image + video 실험이 보고됨
Reward type	verifiable reward가 있는 RLVR setup
Distillation signal	branch 간 on-policy rollout 위 token-level teacher signal

이 논문에서 data보다 더 중요한 것은 data를 넣는 방식이다. Mixed RLVR는 모든 capability data를 한 곳에 넣는다. Static OPD는 각 expert를 따로 학습한 뒤 distill한다. CoPD는 각 data stream을 branch별로 유지하되, branch 사이에 OPD connection을 반복적으로 넣는다.

4-2. Training strategy

CoPD training recipe는 다음 순서로 정리할 수 있다.

Shared base model에서 branch들을 초기화한다.
각 branch를 자기 capability data에서 RLVR로 update한다.
일정 step 뒤 branch 간 mutual OPD를 수행한다.
RLVR와 OPD cycle을 반복한다.
Behavioral overlap과 symmetric KL을 monitoring한다.
마지막에 co-evolved branch들을 merge한다.
unified model을 text, image, video reasoning benchmark에서 평가한다.

가장 중요한 hyperparameter는 다음 세 가지다.

Hyperparameter	Meaning
$S_{RL}$	branch-specific RLVR phase step 수
$S_{OPD}$	mutual OPD phase step 수
$\beta_k$	OPD teacher signal strength

공개 요약 기준으로 $S_{RL}:S_{OPD}=1.5:1$이 좋은 결과를 보였다고 한다. 하지만 이 ratio는 model, domain, reward density, branch 수에 따라 바뀔 가능성이 높다. 실무적으로는 fixed ratio를 그대로 쓰기보다 overlap curve와 validation score를 같이 보며 조정해야 한다.

4-3. Engineering notes

실제로 CoPD를 구현한다고 보면 중요한 점은 objective보다 orchestration이다.

Branch rollout과 teacher scoring을 동기화해야 한다

Mutual OPD에서는 branch $k$가 만든 rollout을 branch $j$가 score해야 한다. 즉 multiple policy copy, rollout buffer, log-prob computation, reward computation이 cycle 단위로 맞물려야 한다.
OPD를 너무 늦게 넣으면 static OPD와 같아진다

CoPD의 핵심은 expert가 완전히 수렴하기 전에 distillation을 넣는 것이다. RLVR를 오래 돌려 branch가 너무 멀어지면 behavioral overlap이 낮아지고, CoPD의 장점이 약해진다.
OPD를 너무 자주 넣으면 expert specialization이 약해질 수 있다

Consolidation이 너무 강하면 branch가 서로 비슷해진다. 그러면 branch별로 새로 배울 knowledge gap이 작아지고, mutual distillation의 정보량도 줄어든다.
Final merge 전에 overlap diagnostic을 봐야 한다

Parameter merge는 branch들이 충분히 aligned되어 있다는 전제가 필요하다. Top-k overlap, symmetric KL, validation curve를 같이 확인하지 않고 merge하면 destructive interference가 생길 수 있다.
K > 2 branch에서는 topology가 중요하다

논문은 hub-and-spoke topology를 확장 방향으로 제시한다. 모든 branch pair를 완전 연결하면 cost가 커진다. Hub branch를 두고 spoke branch들과 knowledge exchange를 수행하면 branch 수가 늘 때 더 practical할 수 있다.

5. Evaluation

5-1. Main results

공개적으로 확인되는 실험 메시지는 다음과 같다.

CoPD는 text, image, video reasoning capability를 all-in-one으로 통합하는 setting에서 mixed RLVR와 static OPD/MOPD 계열 baseline보다 좋은 결과를 보인다.
Two-branch setting뿐 아니라 three-branch setting에서도 CoPD가 강한 것으로 보고된다.
일부 task에서는 domain-specific expert보다 unified CoPD model이 더 좋은 결과를 보인다고 보고된다.
Ablation에서는 bidirectional mutual OPD와 continuous co-evolution이 중요하다고 한다.
Behavioral analysis에서는 CoPD가 branch 간 top-k token overlap을 높은 수준으로 유지하고 symmetric KL divergence를 낮게 유지한다.
공개 요약 기준으로 top-k overlap은 0.90 이상을 유지하는 것으로 보고된다.
$S_{RL}:S_{OPD}=1.5:1$ ratio가 좋은 setting으로 보고된다.

정량 benchmark table은 이 초안에서 숫자를 재구성하지 않는다. 원문 PDF table 기준으로 다시 확인해야 한다. 특히 text, image, video reasoning 각각의 benchmark 이름과 metric, expert baseline, mixed RLVR, MOPD baseline의 정확한 score는 publish 전에 재검증하는 편이 안전하다.

5-2. What really matters in the experiments

1) Pilot study가 핵심이다

이 논문에서 가장 중요한 실험은 최종 benchmark table보다 teacher-student behavioral overlap과 distillation gain 사이의 관계다. CoPD의 모든 설계는 “OPD는 teacher와 student가 충분히 가까울 때 잘 먹힌다”는 가설 위에 있다.

만약 overlap이 낮아도 OPD가 잘 작동한다면 CoPD의 motivation은 약해진다. 반대로 overlap이 높을수록 OPD gain이 크고, independent RLVR가 overlap을 낮춘다면 CoPD의 alternating design은 자연스러운 해결책이 된다.

2) Mixed RLVR baseline은 단순 baseline이 아니라 반례다

Mixed RLVR는 모든 data를 한 policy에 넣기 때문에 겉으로는 가장 직접적인 통합 방식이다. 하지만 CoPD 논문은 이 방식이 inter-capability divergence cost를 낸다고 본다. 즉 “multi-domain data를 많이 넣으면 unified capability가 생긴다”는 직관에 대한 반례 역할을 한다.

실무적으로도 이 메시지는 중요하다. Multi-domain RLVR를 할 때는 data balance와 reward scale만 맞추면 되는 것이 아니다. Domain별 policy behavior가 서로 어떤 방향으로 밀리는지도 봐야 한다.

3) Static OPD baseline은 teacher strength의 한계를 보여준다

Static OPD는 강한 expert teacher를 만들어 나중에 student에게 전달한다. 이 방식은 전통적으로 자연스럽다. 하지만 CoPD 관점에서는 teacher가 너무 멀리 가면 오히려 absorption이 어려워진다.

즉 teacher가 강한가보다 중요한 질문은 다음이다.

Student rollout state에서 teacher가 meaningful signal을 줄 수 있는가?
Teacher와 student의 top-k support가 충분히 겹치는가?
Teacher의 behavior pattern이 student가 따라갈 수 있는 region에 있는가?

이 질문은 OPD teacher selection에서도 그대로 쓸 수 있다.

4) CoPD의 성공은 score보다 curve로 봐야 한다

CoPD가 좋은 이유는 단일 최종 score가 아니라 training 중 behavior curve에서 나온다. Branch별 capability score가 올라가면서도 overlap이 유지되는지, symmetric KL이 폭발하지 않는지, OPD phase 이후 branch 간 divergence가 줄어드는지를 봐야 한다.

따라서 이 논문을 재현하거나 응용할 때는 benchmark table만 복사해서는 부족하다. Training dynamics plot이 핵심 artifact다.

5) Ablation은 mutuality와 timing을 봐야 한다

CoPD에서 중요한 ablation은 세 가지다.

one-way distillation vs mutual distillation
static one-shot OPD vs cyclic OPD
merge 전 branch score vs merge 후 unified score

공개 요약 기준으로 mutual OPD와 merging이 모두 중요하다고 보고된다. 특히 CoPD branch가 merge 없이도 broad capability를 보인다는 점은 흥미롭다. 이는 branch들이 서로를 계속 distill하면서 이미 partially unified policy가 되어 간다는 뜻으로 해석할 수 있다.

6. Limitations

정확한 data recipe와 benchmark score는 원문 table 재확인이 필요하다

접근 가능한 공개 요약만으로는 dataset 이름, sample 수, filtering rule, reward function, evaluation protocol을 충분히 확인하기 어렵다. 이 초안은 구조와 해석 위주이며, publish 전 원문 PDF의 table과 appendix 확인이 필요하다.
Qwen3-VL-4B-Instruct 중심 결과일 가능성이 크다

공개 요약 기준 실험 base는 Qwen3-VL-4B-Instruct다. 더 큰 model, text-only LLM, MoE model, agentic tool-use model에서도 같은 ratio와 overlap behavior가 유지되는지는 추가 검증이 필요하다.
Branch parallelism은 compute cost를 늘린다

CoPD는 multiple expert branch를 동시에 유지한다. 단일 mixed RLVR보다 orchestration과 memory, rollout cost가 커질 수 있다. 특히 K가 커지면 branch pair마다 OPD를 수행하는 방식은 비싸다.
Parameter merge는 여전히 취약한 단계다

CoPD는 mutual OPD로 branch alignment를 유지하지만, final parameter merge가 항상 안정적이라는 보장은 없다. Architecture, optimizer state, LoRA 여부, branch update magnitude에 따라 merge behavior가 달라질 수 있다.
Top-k overlap은 유용하지만 충분한 지표는 아니다

Top-k token overlap은 behavioral proximity를 보는 좋은 proxy다. 하지만 reasoning trace quality, visual grounding, temporal evidence usage, answer format consistency까지 모두 설명하지는 못한다.
OPD가 너무 강하면 diversity가 줄어들 수 있다

Mutual OPD는 branch 간 behavior gap을 줄인다. 하지만 consolidation이 과하면 branch specialization이 약해지고, 결국 mixed training과 비슷한 homogenization이 생길 수 있다.
Video reasoning은 별도 stress test가 필요하다

Text와 image reasoning보다 video reasoning은 temporal context, frame sampling, evidence localization 문제가 더 크다. CoPD가 video branch를 포함했다는 점은 중요하지만, long video나 streaming video setting까지 같은 방식으로 일반화되는지는 추가 확인이 필요하다.

7. My Take

7-1. Why this matters for my work

이 논문은 distillation paper라기보다 multi-domain post-training systems paper로 읽는 편이 맞다. 흥미롭게 본 점은, CoPD가 여러 capability를 합치는 문제를 “data를 어떻게 섞을까”가 아니라 “policy들이 언제 서로를 teacher로 삼아야 할까”로 바꾼다는 점이다.

최근 post-training pipeline은 점점 stage가 복잡해지고 있다. SFT, RLVR, DPO, OPD, RLHF, long-context tuning, tool-use training, multimodal RL이 모두 들어간다. 이때 각 stage를 직렬로 쌓으면 catastrophic forgetting과 capability drift가 생기기 쉽고, 모든 data를 한 objective로 섞으면 gradient conflict가 생긴다.

CoPD는 그 사이의 third option을 준다. 여러 branch를 병렬로 키우되, branch들이 너무 멀어지지 않도록 중간중간 mutual distillation을 수행한다. 이건 Nemotron-Cascade 2 같은 staged post-training pipeline과도 이어진다. 다만 Nemotron-Cascade가 stage ordering과 MOPD를 강조한다면, CoPD는 동시에 진화하는 branch 사이의 반복적 consolidation을 강조한다.

CoPD의 실무적 가치는 세 가지다.

Multi-domain RLVR를 무작정 섞기 전에 branch-level training을 고려하게 만든다.
OPD teacher를 고를 때 benchmark score보다 behavioral overlap을 먼저 보게 만든다.
Final unified model을 만들기 전, branch 간 overlap/KL curve를 diagnostic으로 사용할 수 있게 한다.

7-2. Reuse potential

재사용해볼 만한 포인트는 다음과 같다.

Overlap-first OPD diagnostic

OPD를 돌리기 전에 teacher와 student의 top-k overlap, symmetric KL, entropy gap을 작은 validation set에서 먼저 측정한다. Score가 높은 teacher라도 overlap이 낮으면 바로 OPD하지 않고 cold-start나 intermediate teacher를 고려한다.
Alternating exploration and consolidation

Multi-domain training에서 한 domain을 오래 밀고 나중에 합치기보다, 일정 step마다 consolidation phase를 넣는다. 이 패턴은 text-image-video뿐 아니라 math-code-tool-use 조합에도 적용할 수 있다.
Branch-level curriculum

Data-level curriculum은 sample 순서를 조정한다. CoPD는 branch-level curriculum을 제공한다. 즉 policy copy별로 다른 capability를 탐색하게 하고, 주기적으로 서로 지식을 교환한다.
Hub-and-spoke policy distillation

K가 큰 multi-domain setting에서는 모든 pair distillation 대신 hub branch를 둔다. Hub가 여러 spoke의 knowledge를 모으고 다시 분배하는 구조는 compute 측면에서 더 현실적이다.
Merge readiness test

Final merge 전에 branch 간 top-k overlap과 symmetric KL을 threshold처럼 사용할 수 있다. Merge 후 score만 보는 것이 아니라 merge 전 alignment 상태를 먼저 보는 것이다.

7-3. Follow-up papers

Rethinking On-Policy Distillation of Large Language Models
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
DeepSeek-R1 Technical Report
Group Relative Policy Optimization
On-Policy Context Distillation for Language Models
Learning Beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

8. Summary

CoPD는 mixed RLVR와 static OPD가 multiple capability integration에서 각각 다른 방식으로 capability loss를 만든다고 본다.
Mixed RLVR는 inter-capability divergence cost를 내고, static OPD는 teacher-student behavioral gap 때문에 absorption efficiency가 낮아질 수 있다.
CoPD는 여러 branch를 병렬로 RLVR 학습하면서 중간중간 mutual OPD를 수행해 specialization과 consolidation을 반복한다.
핵심 diagnostic은 teacher-student top-k token overlap과 symmetric KL divergence다.
이 논문은 post-training을 단일 objective 문제가 아니라, 여러 policy branch가 함께 진화하는 model-parallel training 문제로 다시 보게 만든다.

Twitter Facebook LinkedIn