FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization Review

2026-05-24 15 분 소요

0. Introduction

FIPO는 reasoning model을 위한 RLVR 학습에서 token-level credit assignment를 다시 묻는 논문이다. 문제의식은 단순히 GRPO나 DAPO가 약하다는 이야기가 아니다. outcome reward만 쓰는 value-free RL은 최종 정답 여부를 trajectory 전체에 거의 균일하게 나눠 주기 쉽다. 그러면 실제로 풀이를 바꾼 핵심 token과, 별 의미 없이 이어진 filler token이 같은 방향의 advantage를 받는다.

이 논문은 이 “coarse-grained credit assignment”가 long reasoning의 길이 정체를 만든다고 본다. DAPO baseline은 초기에 response length가 늘어나지만 평균 약 4K token 부근에서 plateau에 들어가고, 그 이후 더 깊은 self-verification으로 넘어가지 못한다. FIPO는 여기에 “Future-KL”이라는 dense signal을 넣는다. 현재 token 하나가 이후 trajectory의 policy distribution을 얼마나 바꾸는지를 할인 누적하고, 그 값을 advantage weight로 사용한다.

한 줄 요약: FIPO는 GRPO/DAPO 계열의 outcome-only reward가 모든 token에 같은 credit을 주는 문제를, discounted Future-KL 기반 advantage re-weighting으로 완화해 long CoT 길이와 AIME 성능을 함께 끌어올리는 value-free RL recipe다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

RLVR에서 성능 병목이 reward model보다 token credit assignment 쪽에 있을 수 있다는 관점을 선명하게 보여준다.
critic 없이도 dense advantage를 만들 수 있다는 점에서 GRPO 계열 recipe의 확장 방향을 제시한다.
response length 증가가 단순 verbosity인지, 실제 reasoning depth인지 실험적으로 분리해 보려 한다.
DAPO, verl, Qwen2.5-32B-Base 위에서 비교를 맞추고 code까지 공개해 재현 관점의 가치가 있다.

이 논문의 핵심은 길게 말하게 만드는 “trick”이 아니라, 긴 reasoning을 계속 탐색하도록 policy update의 credit path를 다시 설계한 데 있다. FIPO는 final answer reward를 더 정교하게 만드는 논문이 아니라, 그 reward가 token sequence 안에서 어떻게 전파되어야 하는지를 다루는 논문에 가깝다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 RLVR 기반 reasoning model 학습에서 long CoT가 일정 길이 이상으로 잘 자라지 않는 현상이다.

수학 문제처럼 verifier가 있는 task에서는 reward를 비교적 명확하게 줄 수 있다. 답이 맞으면 positive, 틀리면 negative다. DeepSeek-R1 이후 GRPO 계열 algorithm이 주목받은 이유도 여기에 있다. value model 없이 group-level relative reward만으로도 reasoning 능력을 끌어낼 수 있기 때문이다.

하지만 outcome reward는 trajectory 끝에서만 나온다. 즉 수천 token짜리 풀이 안에서 어떤 token이 실제로 정답으로 가는 분기점이었는지, 어떤 token은 그냥 관성적으로 생성된 문장인지 구분하기 어렵다. 표준 GRPO/DAPO류 formulation에서는 이 global advantage가 token들에 넓게 공유된다.

FIPO가 보는 병목은 여기다.

최종 reward는 sparse하다.
advantage는 token 단위로 충분히 dense하지 않다.
따라서 critical reasoning pivot과 trivial token이 같은 credit을 받는다.
모델은 긴 “self-reflection”을 탐색하기보다, 짧고 안전한 풀이 패턴에 머무르기 쉽다.
결과적으로 response length와 성능이 함께 plateau에 들어간다.

1-2. Why previous approaches are insufficient

기존 접근은 크게 세 가지로 볼 수 있다.

Type	Main idea	Limitation
GRPO-style value-free RL	group relative reward로 value model 없이 학습	outcome reward가 trajectory 전체에 coarse하게 전파됨
DAPO-style recipe	decoupled clipping, dynamic sampling, token-level policy gradient	recipe는 안정화하지만 token별 downstream 영향은 직접 구분하지 못함
PPO with critic	value function으로 더 granular한 advantage 추정	critic 학습 비용과 안정성 문제가 추가됨

FIPO의 포인트는 PPO critic으로 돌아가지 않고도 dense credit assignment를 만들 수 있다는 것이다. 저자들은 RL 과정에서 대부분의 token distribution은 크게 바뀌지 않고, 일부 sparse but critical token에서 policy shift가 일어난다는 선행 관찰을 사용한다. 그러면 중요한 것은 value model이 아니라, policy가 실제로 어느 token 이후의 future trajectory를 더 선호하게 되었는지 측정하는 일이다.

이 관점에서는 KL을 regularization penalty로만 쓰지 않는다. FIPO는 current policy와 old policy 사이의 signed log-probability shift를 방향성 있는 behavioral signal로 읽는다. 이 signal을 현재 token 이후의 future trajectory에 대해 누적하면, 현재 token이 어떤 reasoning future를 열었는지 볼 수 있다는 것이 핵심 가정이다.

2. Core Idea

2-1. Main contribution

FIPO의 핵심 기여는 “Future-KL”을 이용해 standard advantage를 token별로 재가중하는 것이다.

먼저 token $t$에서 current policy와 old policy의 log-probability 차이를 본다.

\[\Delta \log p_t = \log \pi_\theta(y_t \mid x, y_{<t}) - \log \pi_{old}(y_t \mid x, y_{<t})\]

이 값이 positive면 current policy가 해당 token을 더 강화하고 있다는 뜻이고, negative면 억제하고 있다는 뜻이다. 하지만 이 instantaneous shift만으로는 부족하다. reasoning은 sequential process이므로, 어떤 token이 중요한지는 그 token 하나가 아니라 그 이후 trajectory가 어떻게 바뀌는지에 달려 있다.

그래서 FIPO는 현재 시점 이후의 signed probability shift를 할인 누적한다.

\[FK_t = \sum_{k=t}^{T} M_k \gamma^{k-t} \Delta \log p_k\]

여기서 $M_k$는 extreme importance ratio를 걸러내기 위한 mask, $\gamma$는 future signal의 decay factor다. 논문은 이를 Future-KL로 부른다. 엄밀히는 sample trajectory 위에서 future horizon에 제한된 log-likelihood ratio를 보는 것이고, 기능적으로는 현재 token 이후 trajectory가 old policy 대비 얼마나 강화 또는 억제되었는지를 나타낸다.

이 값을 바로 쓰면 variance가 커질 수 있으므로, exponential mapping과 clipping을 거쳐 influence weight를 만든다.

\[f_t = clip(\exp(FK_t), 1 - \epsilon_{f,low}, 1 + \epsilon_{f,high})\]

그리고 group relative advantage를 다음처럼 재가중한다.

\[\tilde{A}_t = \hat{A}_t f_t\]

최종적으로 DAPO식 token-level clipped policy objective의 advantage 자리에 $\tilde{A}_t$를 넣는다. 구조는 기존 policy gradient scaffold를 유지하지만, advantage가 더 이상 final outcome에서 균일하게 내려온 값만은 아니게 된다.

2-2. Design intuition

설계 직관은 꽤 명확하다. 좋은 reasoning trajectory는 정답 token 하나로 생기는 것이 아니라, 중간중간의 선택이 이후 탐색 공간을 바꾸면서 만들어진다. 따라서 reward도 최종 정답 여부만 볼 것이 아니라, 현재 token이 어떤 future behavior를 열었는지 반영해야 한다.

FIPO는 이 future behavior를 policy shift로 본다. current policy가 old policy보다 이후 trajectory 전체를 더 강화하고 있다면, 그 trajectory의 앞쪽 token은 stable anchor일 가능성이 높다. 반대로 이후 trajectory가 억제되고 있다면, 그 앞쪽 token은 잘못된 branch를 열었거나, 적어도 현재 update에서는 덜 선호되는 branch일 수 있다.

여기서 중요한 점은 FIPO가 reward model을 새로 만들지 않는다는 것이다. verifier reward는 그대로 outcome-level이다. 바뀌는 것은 그 reward가 token update로 변환되는 방식이다. 그래서 이 논문은 value-free RL의 simplicity를 유지하면서, PPO critic이 주려 했던 dense credit의 일부를 policy shift signal에서 끌어온다.

또 하나 중요한 직관은 local future다. 아주 먼 미래 token까지 current token의 결과라고 보는 것은 variance가 크다. 그래서 FIPO는 soft decay window를 둔다. immediate future는 더 크게 반영하고, 먼 future는 exponential decay로 줄인다. hard truncation이 아니라 soft window를 쓰는 이유는 boundary artifact 없이 reasoning chain의 local coherence를 반영하기 위해서다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	outcome-only RLVR에서 long reasoning credit assignment를 더 dense하게 만드는 것
Base scaffold	DAPO-style token-level policy gradient on verl
Base model	Qwen2.5-32B-Base
Key signal	current policy와 old policy 사이의 signed log-probability shift
Core module	Future-KL accumulation, soft decay window, extreme-ratio filtering, influence weight clipping
Difference from PPO critic	별도 value model 없이 policy shift에서 future-aware advantage weight를 구성
Main target	response length plateau를 깨고 self-verifying long CoT를 유도

FIPO pipeline은 다음 순서로 이해하면 쉽다.

Prompt마다 여러 response를 sampling한다.
Verifier로 outcome reward를 계산한다.
Group relative advantage $\hat{A}_t$를 만든다.
Current policy와 old policy의 token-level log-probability shift를 계산한다.
각 token 이후의 discounted future shift를 누적해 $FK_t$를 만든다.
Extreme IS ratio에 해당하는 token은 mask로 제외한다.
$\exp(FK_t)$를 clipping해서 influence weight $f_t$를 만든다.
$\hat{A}_t$를 $f_t$로 재가중한 뒤 DAPO-style clipped objective에 넣는다.

3-2. Module breakdown

1) Probability shift

FIPO의 atomic signal은 $\Delta \log p_t$다. 이것은 일반적인 KL penalty처럼 단순히 작게 유지해야 할 비용이 아니라, policy가 어느 방향으로 움직였는지 알려주는 signed signal이다.

positive shift: current policy가 해당 token을 old policy보다 더 선호한다.
negative shift: current policy가 해당 token을 old policy보다 덜 선호한다.
magnitude: 해당 token에서 policy update가 얼마나 강하게 개입했는지 나타낸다.

저자들의 배경 관찰은 중요한 token이 sparse하게 나타난다는 것이다. 대부분의 token에서는 generation distribution이 거의 같고, 일부 critical point에서 policy가 reasoning path를 바꾼다. 따라서 token별 signed shift는 credit assignment의 좋은 출발점이 된다.

2) Future-KL estimation

단일 token shift는 local signal일 뿐이다. FIPO는 이를 future trajectory 전체로 확장한다. 현재 token 이후의 shift를 할인 누적하면, current token이 시작한 future가 강화되는지 억제되는지 볼 수 있다.

\[FK_t = \sum_{k=t}^{T} M_k \gamma^{k-t} \Delta \log p_k\]

이 값이 positive면 token $t$ 이후 trajectory가 current policy에서 전반적으로 강화되고 있다는 뜻이다. negative면 그 future branch가 억제되고 있다는 뜻이다. 이것이 FIPO가 말하는 “future-aware credit”이다.

3) Soft decay window

모든 future token을 같은 weight로 누적하면 noise가 커진다. 현재 token과 바로 다음 token 사이의 causal dependency는 강하지만, 수천 token 뒤의 결과는 중간 stochastic choice의 영향을 많이 받는다.

그래서 FIPO는 decay factor를 둔다.

\[\gamma = 2^{-1 / \tau}\]

여기서 $\tau$는 effective horizon 또는 half-life에 해당한다. 논문의 32B main setting에서는 Future-KL decay rate를 32.0으로 둔다. 즉 FIPO는 current token이 여는 local future를 보되, 너무 먼 future까지 강하게 책임지게 하지는 않는다.

4) Extreme-ratio filtering

Future-KL은 logit shift에 민감하다. importance ratio가 너무 큰 token이 누적 안에 들어오면 influence weight가 급격히 흔들릴 수 있다. 논문은 vanilla Future-KL에서 step 70 부근에 low-clip fraction spike, policy KL surge, gradient norm explosion, response length collapse가 함께 발생하는 불안정 사례를 보인다.

이를 막기 위해 FIPO는 Dual-Clip threshold를 넘는 token을 Future-KL accumulation에서 제외한다. 이미 clipped objective가 harmful action의 gradient를 제한하고 있다면, 그 token의 extreme ratio를 future accumulation에 다시 넣는 것은 variance만 키울 수 있기 때문이다.

5) Influence weight clipping

Future-KL은 log-space 누적값이므로, 이를 multiplicative weight로 쓰기 위해 exponential mapping을 한다.

\[f_t = clip(\exp(FK_t), 1 - \epsilon_{f,low}, 1 + \epsilon_{f,high})\]

32B setting에서는 Future-KL clip ratio를 $[1.0, 1.2]$로 둔다. 이 설정은 positive trajectory에 속한 token을 더 강하게 밀고, negative trajectory에 속한 token을 더 강하게 correction하는 쪽에 가깝다. 7B appendix에서는 $[0.8, 1.2]$가 더 안정적인 setting으로 쓰인다. 이 차이는 scale별 exploration capacity가 다를 수 있음을 보여준다.

6) Target loss

FIPO는 policy objective 자체를 완전히 새로 만들지는 않는다. DAPO의 token-level clipped loss 형태를 유지하고, advantage만 Future-KL weight로 바꾼다.

\[r_t = \frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_{old}(y_t \mid x, y_{<t})}\] \[L_t^{FIPO} = min(r_t \tilde{A}_t, clip(r_t, 1 - \epsilon_{low}, 1 + \epsilon_{high})\tilde{A}_t)\]

여기서 핵심은 $r_t$나 clipping scaffold가 아니라 $\tilde{A}_t$다. 같은 outcome reward라도, future가 강화되는 token과 억제되는 token은 서로 다른 update magnitude를 받는다.

4. Training / Data / Recipe

4-1. Data

FIPO의 main experiment는 mathematical reasoning에 집중한다. training data는 DAPO에서 공개한 DAPO-17K dataset을 사용한다. 이는 algorithmic contribution을 isolated하게 보기 위한 선택이다.

중요한 점은 Qwen2.5-32B-Base를 사용한다는 것이다. 논문은 이 모델이 long-CoT synthetic data에 사전 노출되지 않은 cleaner base model이라고 설명한다. 따라서 FIPO가 보여주려는 것은 이미 long reasoning prior를 가진 model을 살짝 개선하는 것이 아니라, pure RL만으로 long reasoning behavior를 elicitation할 수 있는가에 가깝다.

Appendix에서는 Qwen2.5-7B-Math pilot experiment도 제공한다. 다만 main claim은 32B setting에 더 강하게 걸려 있다. 7B setting에서는 response length가 약 1200 token 근처에서 유지되고, 32B에서 보이는 10K+ length scaling이 재현되지 않는다. 논문은 이를 7B model capacity, 4K context pretraining prior, code-based concise reasoning bias와 연결해 해석한다.

4-2. Training strategy

32B main setting의 주요 recipe는 다음과 같다.

Item	Setting
Base model	Qwen2.5-32B-Base
Dataset	DAPO-17K
Framework	verl + DAPO codebase
Global batch size	512 prompts
Group size	16 responses per prompt
Total sampled responses per iteration	8,192
Learning rate	1e-6
LR scheduler	constant with 10 warmup steps
Weight decay	0.1
Gradient clipping	1.0
Max prompt length	2,048
Max response length	20,480
Overlong buffer	4,096
Sampling temp / top-p during training	1.0 / 1.0
Policy clip ratio	[0.2, 0.28] asymmetric
KL penalty coefficient	0.0
Future-KL decay rate	32.0
Future-KL clip ratio	[1.0, 1.2]
Safety threshold	10.0

DAPO baseline은 mini-batch size 32 prompt 기준으로 512 samples를 쓰고, FIPO는 stability를 위해 mini-batch size 64 prompt 기준으로 1,024 samples를 사용한다. 그 결과 DAPO standard setting은 training iteration당 16 gradient updates, FIPO setting은 8 gradient updates가 된다.

이 선택은 단순 throughput 조정이라기보다 stability recipe에 가깝다. Appendix E에서 저자들은 mini-batch 32가 가끔 좋은 peak를 내지만 재현성이 낮고 length stagnation failure가 자주 발생한다고 설명한다. mini-batch 64는 IS variance와 token clipping을 줄여 Future-KL signal이 long sequence 안에서 더 안정적으로 흐르게 만든다.

4-3. Engineering notes

FIPO를 실제로 구현할 때 중요한 engineering point는 세 가지다.

첫째, Future-KL은 naive하게 계산하면 dense temporal decay matrix가 필요하다. response length를 $L$이라고 하면 memory가 $O(L^2)$로 커질 수 있다. 논문은 이를 피하기 위해 chunk-based matrix multiplication implementation을 사용한다. analytical result는 유지하면서 peak memory를 chunk size에 묶고, GPU matmul을 활용해 overhead를 줄인다.

둘째, filtering과 clipping은 optional trick이 아니다. Future-KL이 log-probability shift에 의존하는 만큼 IS ratio volatility에 민감하다. extreme-ratio filtering이 없으면 influence weight가 지나치게 커지고, 많은 token이 PPO trust region 밖으로 밀리면서 stability가 나빠진다.

셋째, response length penalty와 length scaling은 함께 봐야 한다. FIPO는 overlong penalty가 있는 상황에서도 더 긴 reasoning을 만든다. 따라서 단순히 penalty를 제거해서 길이를 늘리는 방식과는 다르다. Appendix E에서 overlong penalty 제거만으로는 length stagnation을 해결하지 못했다고 보고한다.

5. Evaluation

5-1. Main results

Main evaluation은 AIME 2024와 AIME 2025다. 논문은 DAPO protocol에 맞춰 evaluation을 32회 반복하고, Pass@1 averaged over 32 samples를 Avg@32로 보고한다. inference hyperparameter는 temperature 1.0, top-p 0.7이다.

대표 결과는 다음과 같다.

Method	AIME 2024 Avg@32	AIME 2024 Cons@32	AIME 2024 Pass@32	AIME 2025 Avg@32	AIME 2025 Cons@32	AIME 2025 Pass@32
DAPO baseline	50.0	60.0	80.0	38.0	47.0	63.0
FIPO	56.0	73.0	83.0	43.0	50.0	67.0

Abstract와 conclusion에서는 AIME 2024 Pass@1이 DAPO 50.0%에서 peak 58.0%까지 올라가고, 최종적으로 약 56.0%에 converge한다고 설명한다. Table 1은 rounded final result로 56.0%를 제시한다. 따라서 이 논문을 인용할 때는 peak 58.0과 converged/table 56.0을 구분하는 것이 좋다.

또한 response length 측면에서 DAPO baseline은 평균 약 4K token 부근에서 정체되는 반면, FIPO는 10K+ token regime으로 넘어간다. 논문은 median token count도 초기 약 200에서 10K+까지 올라간다고 설명한다. 단순히 몇 개 outlier response만 길어진 것이 아니라, Min, Q25, Median, Q75 등 분포 전체가 위로 이동한다는 것이 Figure 3의 핵심 메시지다.

Appendix의 7B 결과도 참고할 만하다.

Method	AIME 2024 Pass@1	AIME 2025 Pass@1
GRPO 7B	22.0	18.0
DAPO 7B	36.0	18.0
FIPO 7B	40.0	19.0

다만 7B에서는 32B처럼 response length가 계속 자라지 않는다. 이 결과는 FIPO가 모든 scale에서 같은 방식으로 작동한다기보다, model capacity와 training prior에 따라 exploration regime이 달라진다는 점을 보여준다.

5-2. What really matters in the experiments

이 논문에서 중요한 실험 포인트는 최종 AIME 숫자보다 training dynamics다.

1) Length scaling이 진짜 reasoning depth인가?

FIPO는 response length를 10K+로 늘린다. 하지만 길이 증가가 항상 reasoning 향상은 아니다. 그래서 논문은 Figure 3과 Appendix D case study를 함께 제시한다.

저자들의 해석은 다음과 같다.

초기 model은 superficial planning을 한다. outline은 만들지만 실제 derivation은 부족하다.
DAPO convergence와 early FIPO는 linear execution 단계에 머문다. 정답을 찾으면 바로 끝내고 self-verification이 부족하다.
FIPO intermediate stage에서는 initial answer 이후 verification phase가 나타난다.
late FIPO에서는 symbolic re-derivation, arithmetic verification, multi-pass checking이 늘어난다.

즉 논문은 길이 증가를 verbosity가 아니라 self-reflection behavior의 emergence로 읽는다. 물론 case study는 qualitative evidence이므로, 모든 task에 일반화하려면 추가 검증이 필요하다.

2) Raw reward만 보면 오해할 수 있다

Figure 4에서 DAPO는 FIPO보다 mean training reward가 높게 보인다. 하지만 논문은 이것을 DAPO가 더 잘 학습한다는 뜻으로 보지 않는다. reward에 overlong penalty가 들어 있기 때문에, 더 긴 reasoning을 생성하는 FIPO는 raw reward에서 penalty를 더 많이 받는다.

오히려 중요한 지표는 response length weighted mean relative advantage다. DAPO는 이 값이 하락하는데, 이는 긴 positive sample이 긴 negative sample에 비해 충분한 advantage를 받지 못하고 있다는 뜻으로 해석된다. 반면 FIPO는 이 값이 상승한다. 즉 길고 valid한 reasoning chain이 점점 더 positive advantage와 연결되면서, 길이 확장을 지속할 incentive가 생긴다.

3) Stability mechanism이 성능의 일부다

FIPO의 algorithmic novelty는 Future-KL이지만, 실험적으로는 stability mechanism이 같이 중요하다. vanilla Future-KL은 extreme negative signal 때문에 collapse할 수 있다. 논문은 low-clip fraction, policy KL, gradient norm, response length collapse가 함께 나타나는 불안정 사례를 보여준다.

따라서 FIPO를 단순히 future shift를 누적하면 된다로 이해하면 안 된다. 실제 recipe는 아래 묶음이다.

Future-KL accumulation
soft decay horizon
extreme-ratio filtering
influence weight clipping
stable mini-batch setting
chunked computation

이 중 하나가 빠지면 paper result와 다른 behavior가 나올 수 있다.

4) AIME 2025와 Pass@32는 더 보수적으로 봐야 한다

FIPO는 AIME 2024와 AIME 2025에서 Avg@32를 각각 6.0 point, 5.0 point 올린다. 하지만 Pass@32의 증가는 상대적으로 작다. AIME 2024는 80.0에서 83.0, AIME 2025는 63.0에서 67.0이다.

논문은 이를 RL만으로 model의 absolute problem-solving boundary를 크게 넓히기는 어렵다는 의미로 해석한다. 즉 FIPO는 모델 내부 latent capacity 안에 있는 문제를 더 reliably 풀도록 만드는 데 강하고, 아예 못 풀던 문제 영역을 크게 확장하는 데는 한계가 있을 수 있다.

6. Limitations

Cost and efficiency

FIPO는 10K+ token reasoning을 elicitation한다. 이것은 training cost뿐 아니라 inference cost도 키운다. 논문도 long reasoning을 먼저 끌어낸 뒤, 이후 concise reasoning으로 압축하는 것이 다음 단계라고 설명한다. 서비스 관점에서는 FIPO 자체보다 FIPO로 만든 long reasoning behavior를 어떻게 distill하거나 budget-aware하게 제어할지가 더 중요할 수 있다.

Math benchmark 중심 평가

Main evaluation은 AIME 2024와 AIME 2025에 집중되어 있다. 수학은 verifiable reward가 명확하고 deep reasoning benchmark로 적합하지만, open-ended writing, tool use, code agent, multi-turn planning에서도 같은 credit mechanism이 작동하는지는 별도 검증이 필요하다.

Dataset scope

학습 데이터는 DAPO-17K로 제한된다. 이는 controlled comparison에는 좋지만, 더 크고 다양한 dataset에서 FIPO가 어떻게 scale되는지는 아직 열려 있다. data quality가 좋아질수록 Future-KL signal이 더 강해질 수도 있고, 반대로 noise가 커질 수도 있다.

Scale sensitivity

7B appendix 결과는 중요한 caution이다. 32B에서는 longer CoT와 entropy growth가 성능 향상과 연결되지만, 7B에서는 response length가 약 1200 token 근처에 머물고 lower-entropy reasoning trace가 더 적합한 것으로 해석된다. 따라서 FIPO hyperparameter를 model scale에 맞게 다시 찾아야 한다.

Future-KL volatility

Future-KL은 policy shift를 signal로 쓰기 때문에 IS ratio volatility에 민감하다. filtering, clipping, mini-batch size, decay horizon이 모두 중요하다. 재현할 때 단순히 loss function만 바꾸면 논문 결과와 다르게 collapse하거나 baseline 근처로 돌아갈 수 있다.

Length as a proxy

논문은 길이 증가가 self-reflection과 연결된다는 qualitative evidence를 제시하지만, 길이는 여전히 proxy다. 실제 deployment에서는 long CoT를 그대로 노출하지 않을 수 있고, hidden reasoning length가 늘어난다고 항상 user-visible answer quality가 좋아지는 것도 아니다.

7. My Take

7-1. Why this matters for my work

내 관점에서 FIPO의 가장 중요한 메시지는 RLVR recipe의 병목을 reward quality만으로 보지 않는다는 점이다. verifier reward가 정확해도, 그 reward가 token update로 전달되는 방식이 너무 거칠면 model은 deep reasoning branch를 충분히 탐색하지 못한다.

이건 LLM post-training에서 꽤 실무적인 문제다. 실제로는 reward function을 더 정교하게 만드는 것보다, 이미 있는 outcome reward를 어떻게 sequence 내부에 잘 배분할지가 더 효율적인 lever일 수 있다. FIPO는 이 lever를 value model 없이 policy dynamics에서 찾는다.

특히 흥미로운 지점은 KL의 역할 변화다. 많은 RLHF/RLVR recipe에서 KL은 policy가 너무 멀리 가지 않도록 막는 regularizer다. FIPO에서는 KL 계열 signal이 regularization cost가 아니라, future trajectory의 preference shift를 읽는 signal이 된다. 이 전환은 다른 post-training algorithm에도 재사용 가능성이 있어 보인다.

7-2. Reuse potential

재사용해볼 만한 포인트는 다음과 같다.

DAPO-style codebase에 loss module로 추가하기

FIPO는 actor, rollout, verifier, group advantage scaffold를 크게 바꾸지 않는다. 따라서 이미 GRPO/DAPO training stack이 있다면, Future-KL loss mode를 추가하는 방식으로 실험할 수 있다.

Training monitor 설계

이 논문에서 유용한 부분은 metric set이다. AIME accuracy만 보는 것이 아니라 response length percentile, length-weighted advantage, policy KL, entropy, gradient norm, clip fraction, sampled batch count를 같이 봐야 한다. long reasoning RL을 돌릴 때 이런 monitor는 거의 필수에 가깝다.

Tool-use reasoning으로 확장

Future-KL의 핵심은 current token이 future behavior를 어떻게 바꾸는지 보는 것이다. 이 구조는 tool call selection, search query reformulation, code editing step 같은 sequential decision에도 응용할 수 있다. 다만 verifier가 더 noisy하고 action space가 discrete token보다 구조화되어 있으므로 그대로 적용되지는 않을 것이다.

Long-to-short distillation과 결합

FIPO는 long reasoning을 먼저 elicitation하는 stage로 쓰고, 이후 concise answer 또는 efficient reasoning으로 distill하는 2-stage pipeline과 잘 맞아 보인다. 논문도 efficiency 최적화는 future work로 남긴다. 실무에서는 이 후속 stage가 없으면 inference cost가 부담이 될 수 있다.

7-3. Follow-up papers

DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- FIPO가 직접 baseline으로 삼는 recipe다. decoupled clipping, dynamic sampling, token-level policy gradient를 먼저 이해해야 FIPO의 차이가 선명해진다.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
- GRPO와 RLVR 기반 reasoning model의 출발점으로 같이 읽기 좋다.
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
- FIPO가 언급하는 PPO-based granular advantage 방향과 비교하기 좋다.
What is behind PPO collapse in long-CoT? Value optimization holds the secret
- critic/value 기반 접근이 long-CoT에서 어떤 안정성 문제를 갖는지 비교 축을 제공한다.
Learning to Reason without External Rewards
- external verifier reward 없이 reasoning behavior를 유도하는 방향과 대비해서 볼 수 있다.

8. Summary

FIPO는 GRPO/DAPO류 value-free RL의 uniform token credit assignment 문제를 Future-KL 기반 dense advantage로 완화한다.
핵심 signal은 current policy와 old policy 사이의 signed log-probability shift이며, 이를 future trajectory에 대해 할인 누적한다.
32B main setting에서 DAPO baseline 대비 AIME 2024 Avg@32는 50.0에서 56.0으로 오르고, peak는 58.0으로 보고된다.
response length는 DAPO의 약 4K plateau를 넘어 10K+ regime으로 확장되며, 논문은 이를 self-reflection behavior의 emergence와 연결한다.
다만 cost, math-only evaluation, scale sensitivity, Future-KL volatility 때문에 실제 재현과 서비스 적용에서는 stability recipe와 distillation stage를 함께 봐야 한다.

Twitter Facebook LinkedIn