Stream-T1: Test-Time Scaling for Streaming Video Generation Review

2026-06-21 17 분 소요

0. Introduction

Stream-T1은 “video generation에서 test-time compute를 더 쓰자” 정도로 읽으면 핵심을 놓치기 쉬운 논문이다. 이 논문이 진짜로 건드리는 문제는 후보를 많이 뽑는 brute-force test-time scaling이 아니라, streaming video generation에서는 test-time scaling의 단위 자체를 바꿀 수 있다 는 점이다.

기존 video TTS 계열은 대체로 noise 후보를 많이 만들고, denoising을 끝까지 돌리고, verifier나 reward로 좋은 후보를 고르는 방식에 가깝다. 문제는 video generation에서 후보 하나가 비싸다는 것이다. Full video trajectory를 많이 생성하려면 비용이 빠르게 커지고, 후보 선택이 frame-level temporal dependency를 직접 다루지 못하면 길어진 video에서 consistency가 깨지기 쉽다.

Stream-T1은 이 문제를 streaming video generation 관점으로 바꾼다. Streaming generator는 video를 한 번에 만들기보다 chunk 단위로 이어서 만든다. 각 chunk는 상대적으로 짧고, denoising step도 적으며, 이전 chunk의 latent noise, reward history, KV cache context를 다음 chunk에 넘길 수 있다. 이 구조는 test-time scaling을 단순 candidate search가 아니라 chunk-level trajectory control 로 바꾸기에 적합하다.

한 줄 요약: Stream-T1은 streaming video generation의 chunk-level synthesis를 활용해 test-time scaling을 noise propagation, reward pruning, memory sinking으로 나누고, 현재 chunk의 후보 선택뿐 아니라 다음 chunk의 latent noise와 context memory까지 함께 업데이트하는 inference-time framework다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

Video generation TTS를 단순히 “더 많이 샘플링하기” 가 아니라 stateful streaming optimization 문제로 다시 본다.
Streaming generator의 chunk 구조가 왜 test-time scaling과 잘 맞는지 꽤 명확하게 보여준다.
Reward-based pruning을 현재 chunk 품질 평가에만 쓰지 않고, 다음 chunk의 noise와 memory update에 다시 피드백한다.
5-second와 30-second benchmark를 같이 보며, short video quality보다 long video temporal coherence가 더 중요한 평가 축이라는 점을 드러낸다.
LongLive 같은 streaming video generator 위에 얹는 wrapper로 볼 수 있어, backbone retraining 없이 inference recipe를 바꾸는 방향의 재사용 가능성이 있다.

Stream-T1의 핵심은 “video generator도 생각을 더 하면 좋아진다”가 아니다. 더 정확한 메시지는 streaming generator는 이미 chunk, history, cache라는 temporal state를 갖고 있으므로, test-time compute를 그 state를 개선하는 데 써야 한다 는 것이다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 video generation에서 test-time scaling을 어떻게 practical하게 만들 것인가다. LLM에서는 test-time compute를 더 쓰는 전략이 비교적 자연스럽다. 여러 reasoning path를 만들고, verifier로 고르고, 필요한 경우 self-refinement를 반복할 수 있다. 하지만 video generation에서는 상황이 다르다.

Video diffusion model에서 후보 하나를 만드는 비용은 크다. 특히 긴 video에서는 candidate 하나가 많은 frame과 많은 denoising step을 포함한다. 따라서 단순히 후보 수를 늘리는 방식은 곧바로 inference cost 문제로 이어진다.

또 하나의 문제는 temporal guidance다. Text-to-video generation에서 후보를 많이 뽑는다고 해서 반드시 temporal consistency가 좋아지는 것은 아니다. Reward가 마지막 video clip 전체를 평가하더라도, 다음 chunk가 어떤 noise에서 시작해야 하는지, 어떤 visual context를 기억해야 하는지, 어떤 과거 정보가 cache에서 밀려났을 때 어떻게 보존할지를 직접 제어하지 못하면 long video에서는 drift가 생긴다.

즉 Stream-T1의 문제 정의는 다음과 같다.

더 많은 후보를 뽑지 않고도 test-time scaling 효과를 낼 수 있는가.
streaming generation의 chunk boundary를 TTS의 natural decision point로 사용할 수 있는가.
후보 선택 reward를 다음 chunk의 noise와 memory update까지 연결할 수 있는가.
short video 품질뿐 아니라 30-second generation에서 temporal coherence를 유지할 수 있는가.

이 관점에서 Stream-T1은 training algorithm보다 inference-time control algorithm 에 가깝다.

1-2. Why previous approaches are insufficient

기존 test-time scaling for video generation은 대체로 search 문제로 해석된다. 여러 noise trajectory를 만들고, verifier나 reward로 더 좋은 video를 고른다. Video-T1의 Tree-of-Frames처럼 autoregressive expansion과 pruning을 쓰는 접근도 있다. 이런 방향은 test-time compute가 generation quality를 올릴 수 있다는 점을 보여줬지만, Stream-T1 입장에서 보면 두 가지 병목이 남는다.

첫째, candidate exploration cost가 크다. Video는 이미지보다 candidate 하나가 훨씬 비싸다. Full clip을 여러 번 생성한 뒤 고르는 방식은 compute budget이 커지고, 실제 사용 시 latency 부담이 커진다. 특히 30-second 이상으로 가면 후보 수를 늘리는 것만으로는 practical operating point를 만들기 어렵다.

둘째, temporal state를 충분히 쓰지 않는다. Streaming video generator는 이전 chunk, 현재 chunk, 다음 chunk가 명확히 이어지는 구조를 가진다. 그런데 test-time selection이 현재 후보를 고르는 데만 머무르면, 선택 결과가 다음 chunk의 initial noise나 memory state에 충분히 반영되지 않는다.

셋째, long video에서 KV cache는 단순한 implementation detail이 아니다. Streaming model은 과거 visual context를 cache에 보존하지만, context length가 제한되면 일부 정보가 evict된다. Evicted context를 어떻게 요약하고 다음 chunk에 반영할지가 long-horizon consistency를 좌우한다. 기존 TTS는 이 cache management 문제를 reward feedback과 직접 연결하지 않는 경우가 많다.

따라서 Stream-T1은 TTS를 “많이 만들고 고르기”에서 “chunk마다 noise, reward, memory를 함께 업데이트하기”로 바꾼다.

2. Core Idea

2-1. Main contribution

Stream-T1의 핵심 기여는 streaming video generation에 맞춘 세 가지 stream-scaled module이다.

Stream-Scaled Noise Propagation
- 이전 chunk에서 검증된 high-quality noise 정보를 현재 chunk의 initial latent noise로 전파한다.
- 현재 chunk가 완전히 독립적인 noise에서 시작하지 않도록 만들어 temporal dependency를 강화한다.
- 직관적으로는 좋은 이전 chunk의 Gaussian prior를 다음 chunk sampling의 guide로 쓰는 방식이다.
Stream-Scaled Reward Pruning
- 여러 후보 chunk를 생성한 뒤, local spatial aesthetics와 global temporal coherence를 함께 평가한다.
- immediate short-term reward와 sliding-window 기반 long-term reward를 섞어 후보를 pruning한다.
- 단순 frame quality가 아니라 streaming history와 맞는 후보를 고르는 것이 목적이다.
Stream-Scaled Memory Sinking
- KV cache에서 밀려나는 context를 reward feedback에 따라 다른 update pathway로 보낸다.
- Evicted visual information을 그냥 버리지 않고, 이후 chunk를 anchor하는 memory signal로 보존하려 한다.
- long video에서 background, identity, layout drift를 줄이는 역할을 한다.

이 세 module은 단순히 병렬 feature가 아니다. Stream-T1은 chunk마다 다음 순서로 작동한다.

pre-generation stage에서 noise를 propagation한다.
post-generation stage에서 candidate chunk를 만든다.
post-pruning stage에서 reward로 후보를 고르고, 선택 결과를 memory update에 반영한다.

즉 reward는 최종 selection만 하는 것이 아니라, 다음 chunk generation state를 다시 바꾸는 피드백 신호가 된다.

2-2. Design intuition

Stream-T1의 설계 직관은 video generation의 temporal axis를 더 작게 나누는 데 있다. Full video를 하나의 trajectory로 보면 TTS는 매우 비싸다. 하지만 streaming generator에서는 video가 chunk sequence로 나뉜다.

개념적으로는 다음처럼 볼 수 있다.

\[X = [x_1, x_2, ..., x_T]\]

여기서 $x_t$는 전체 video frame이 아니라 chunk다. Stream-T1은 각 chunk에서 여러 후보를 만들고, history $H_{t-1}$와 reward $R$을 사용해 가장 좋은 후보를 고른다.

\[x_t^* = argmax_{x_t^k} R(x_t^k, H_{t-1})\]

하지만 이 식만으로는 Stream-T1을 충분히 설명하지 못한다. 더 중요한 것은 선택된 chunk가 다음 state를 바꾼다는 점이다.

\[H_t = update(H_{t-1}, x_t^*, r_t)\]

여기서 $H_t$는 단순 generated frames가 아니라, 다음 chunk에 영향을 주는 noise history, reward history, memory context를 포함한다. 그래서 Stream-T1은 one-shot best-of-N sampling보다 streaming control loop에 가깝다.

이 design이 중요한 이유는 세 가지다.

첫째, chunk-level candidate generation은 full-video candidate generation보다 싸다. 후보 하나의 길이가 짧고 denoising step도 적기 때문에 TTS cost를 낮추기 쉽다.

둘째, streaming history를 reward에 넣을 수 있다. 현재 chunk가 예뻐도 이전 chunk와 연결이 어색하면 reward에서 불리하게 만들 수 있다.

셋째, reward feedback이 다음 generation state를 개선한다. 후보 선택이 끝나고 버려지는 정보가 아니라, 다음 chunk의 noise propagation과 memory sinking에 다시 사용된다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	streaming video generation에서 inference-time compute를 사용해 temporal consistency와 visual quality를 높이는 것
Base setting	chunk-level streaming video generation
Core modules	Stream-Scaled Noise Propagation, Stream-Scaled Reward Pruning, Stream-Scaled Memory Sinking
Optimization unit	full video가 아니라 current generating chunk
Reward role	candidate selection plus next-state update feedback
Main evaluation	5-second and 30-second generation at 832x480
Main baselines	CausVid, Self-Forcing, LongLive
Difference from prior TTS	full trajectory candidate search보다 streaming state update를 강조

3-2. Module breakdown

1) Chunk-level test-time scaling

Stream-T1의 출발점은 streaming generator의 chunk structure다. 일반적인 TTS 관점에서는 prompt 하나에 대해 후보 video를 여러 개 만들고, 그중 좋은 것을 고른다. Stream-T1은 이를 chunk 단위로 바꾼다.

각 chunk에서 model은 여러 candidate를 만들 수 있다. 이 후보들은 같은 prompt와 같은 이전 history를 공유하지만, initial noise나 intermediate sampling path가 다를 수 있다. Reward pruning은 이 후보 중 하나를 고른다. 중요한 점은 후보 선택이 chunk마다 반복된다는 것이다.

이 구조의 장점은 TTS budget을 더 세밀하게 쓸 수 있다는 점이다. 전체 video를 한 번에 생성한 뒤 뒤늦게 평가하는 것이 아니라, 잘못된 방향으로 가는 chunk를 초기에 pruning하고, 좋은 후보의 noise와 memory를 다음 chunk에 넘긴다.

즉 Stream-T1은 video generation을 다음처럼 본다.

Full video is expensive to search.
Chunk is cheap enough to search.
History makes chunk search temporally grounded.
Reward should update future generation state.

이 네 문장이 이 논문 방법론의 핵심이라고 볼 수 있다.

2) Stream-Scaled Noise Propagation

Noise Propagation은 현재 chunk의 initial latent noise를 완전히 새로 뽑지 않고, 이전 chunk에서 quality가 좋았던 noise prior를 활용한다.

Diffusion generation에서는 initial noise가 최종 sample trajectory에 큰 영향을 준다. 일반적인 candidate search는 여러 noise를 뽑아 결과를 비교한다. 하지만 streaming setting에서는 이전 chunk에서 이미 어떤 noise trajectory가 temporal continuity와 visual quality에 도움이 되었는지에 대한 정보가 있다.

Stream-T1은 이 정보를 현재 chunk의 initial noise refinement에 사용한다. 핵심 의도는 다음과 같다.

이전 chunk와 현재 chunk 사이의 stochastic discontinuity를 줄인다.
좋은 이전 후보의 latent pattern을 다음 chunk의 sampling prior로 활용한다.
Temporal transition을 post-hoc reward로만 고치는 것이 아니라, generation 시작점부터 부드럽게 만든다.

이 부분은 꽤 중요하다. Video generation에서 temporal consistency는 생성이 끝난 뒤 verifier로 고르는 것만으로는 부족할 수 있다. 이미 noise trajectory가 너무 멀리 벗어나면 후보를 버리는 것 외에는 선택지가 없다. Noise Propagation은 애초에 후보들이 temporal prior 근처에서 시작하도록 만든다.

3) Stream-Scaled Reward Pruning

Reward Pruning은 generated candidates를 평가하고 pruning하는 stage다. 하지만 일반적인 best-of-N selection과 달리, Stream-T1의 reward는 두 축을 같이 본다.

현재 chunk 자체의 local visual quality
streaming history와의 temporal coherence

논문 설명에 따르면 Stream-T1은 immediate short-term assessment와 sliding-window-based long-term evaluation을 함께 사용한다. 이것은 short video generation과 long video generation의 평가 기준이 다르기 때문이다.

현재 chunk만 보면 sharp하고 aesthetic한 후보가 좋아 보일 수 있다. 하지만 long video에서는 다음 문제가 더 중요할 수 있다.

배경이 이전 chunk와 같은가.
subject identity가 유지되는가.
camera motion이 갑자기 튀지 않는가.
semantic state가 이전 chunk와 이어지는가.
motion smoothness가 길게 유지되는가.

따라서 Stream-T1의 reward는 대략 다음처럼 해석할 수 있다.

\[R = w_l R_{local} + w_h R_{history}\]

여기서 $R_{local}$은 chunk-level visual quality, $R_{history}$는 history-aware temporal coherence를 뜻하는 conceptual notation이다. 논문이 이 단순 식으로만 정의된다는 뜻은 아니고, 블로그 설명을 위한 요약이다.

이 module의 핵심은 reward가 단순히 final clip quality를 평가하는 것이 아니라, streaming process 중간에서 search path를 정리하는 pruning signal 로 쓰인다는 점이다.

4) Stream-Scaled Memory Sinking

Memory Sinking은 Stream-T1에서 가장 system-like한 module이다. Streaming video generation에서는 context가 계속 누적된다. 하지만 KV cache는 무한히 커질 수 없으므로 일부 context는 evict된다. 보통 evicted context는 단순히 사라지거나 압축된다.

Stream-T1은 이 evicted context를 reward feedback과 연결한다. 좋은 후보를 만든 context와 그렇지 않은 context를 동일하게 다루지 않고, reward signal에 따라 memory update pathway를 다르게 잡는다. 목적은 이전 visual information이 이후 chunk를 계속 anchor하도록 만드는 것이다.

이 module은 다음 문제를 겨냥한다.

long video에서 배경이 조금씩 바뀌는 drift
subject appearance가 장면마다 미묘하게 달라지는 identity drift
이전 motion state를 잊어 camera나 object motion이 튀는 문제
과거 visual clue가 cache에서 밀려나면서 semantic continuity가 약해지는 문제

Memory Sinking은 단순 cache trick이 아니라, video generation에서 memory management와 quality reward를 결합하는 시도다. LLM에서 long-context memory가 inference quality를 좌우하듯, streaming video generator에서도 어떤 visual context를 남기고 어떤 context를 sink할지가 long-horizon quality를 좌우할 수 있다.

5) Stream-T1 as a feedback loop

세 module을 하나로 묶으면 Stream-T1은 다음과 같은 loop로 읽힌다.

이전 history에서 useful noise와 memory를 가져온다.
현재 chunk 후보들을 생성한다.
후보들을 local quality와 temporal coherence로 평가한다.
좋은 후보를 선택한다.
선택 결과를 다음 chunk의 noise prior와 memory state에 반영한다.

이것은 단순한 reranking이 아니다. Reranking은 output을 고른 뒤 끝난다. Stream-T1은 reranking 결과를 다음 step의 generator state에 다시 넣는다. 그래서 이 논문은 video generation TTS를 closed-loop streaming control 로 바꾼다고 볼 수 있다.

4. Training / Data / Recipe

4-1. Data

Stream-T1은 새로운 pretraining dataset을 제안하는 논문이 아니다. 핵심은 training-time data recipe보다 inference-time scaling recipe다.

논문/project page 기준으로 평가에서는 5-second와 30-second video generation setting을 모두 사용한다. Resolution은 832x480으로 제시된다. 비교는 CausVid, Self-Forcing, LongLive, 그리고 LongLive 위에 Stream-T1을 얹은 구성으로 이루어진다.

Evaluation metric은 크게 두 그룹이다.

Metric source	Metrics
VBench / VBench Long	Subject Consistency, Background Consistency, Motion Smoothness, Imaging Quality, Aesthetic Quality
VideoAlign	VQ, MQ, TA

여기서 VBench 계열은 frame-level quality와 temporal consistency를 비교적 표준화된 방식으로 본다. VideoAlign은 human-aligned video quality, motion quality, text alignment에 더 가까운 지표로 읽을 수 있다. 다만 metric definition과 scoring protocol은 원문에서 추가 확인 필요하다.

4-2. Training strategy

이 논문은 post-training을 새로 수행하는 recipe보다, existing streaming video generator 위에서 test-time scaling을 수행하는 recipe에 가깝다. 따라서 training strategy section은 다음처럼 해석하는 편이 맞다.

Stage	Role
Base generator preparation	LongLive 같은 streaming video generator를 사용
Chunk candidate generation	현재 chunk에 대해 여러 후보를 생성
Reward evaluation	local visual quality와 history-aware temporal coherence를 평가
Pruning	reward가 좋은 candidate를 선택
State update	선택 결과를 noise propagation과 memory sinking에 반영

이 구조에서는 learning rate나 epoch보다 다음 hyperparameter가 더 중요할 가능성이 크다.

chunk size
candidate count
reward window size
local reward와 history reward의 weighting
cache eviction policy
memory sinking update policy
noise propagation strength

다만 이들 구체 값은 project page summary만으로는 충분히 확인되지 않는다. 원문 PDF의 method section과 appendix를 기준으로 재확인해야 한다.

4-3. Engineering notes

Stream-T1을 실제로 구현한다면 핵심 engineering issue는 세 가지다.

1) TTS budget allocation

Full video candidate를 많이 만드는 대신, chunk마다 후보를 만들기 때문에 budget을 어디에 쓸지가 중요하다. 모든 chunk에 같은 후보 수를 쓰는 것이 최적인지, uncertainty가 큰 chunk에 더 많은 후보를 쓰는 것이 나은지는 추가 실험 포인트다.

실무적으로는 다음 전략이 가능해 보인다.

prompt complexity가 높은 chunk에 더 많은 candidate를 할당한다.
reward variance가 큰 chunk에 더 많은 candidate를 할당한다.
scene transition 구간에 더 많은 candidate를 할당한다.
stable scene 구간에서는 fewer candidates로 비용을 줄인다.

논문이 이 adaptive budget까지 충분히 다루는지는 원문에서 추가 확인 필요하다.

2) Reward computation cost

Reward Pruning은 inference quality를 높이지만, reward model evaluation도 비용이 든다. VideoAlign-style reward나 temporal coherence metric이 무거우면 TTS의 compute saving이 줄어들 수 있다.

따라서 practical deployment에서는 reward를 두 단계로 나누는 것이 자연스럽다.

cheap filter: blur, aesthetics, obvious motion failure
expensive verifier: temporal coherence, text alignment, identity consistency

Stream-T1의 핵심 아이디어는 이런 reward stack을 streaming state update와 연결할 수 있다는 점이다. 단순히 reward model이 좋으면 끝나는 문제가 아니라, reward output이 noise와 memory update까지 들어가야 한다.

3) Memory state as first-class object

Memory Sinking은 video generation system에서 cache를 first-class training/inference object로 다룬다. 이는 long video generation에서 특히 중요하다. Cache eviction을 단순 capacity management로 보면 temporal information이 조용히 사라진다. 반대로 reward-aware memory sinking을 쓰면, 좋은 chunk를 만든 visual context를 더 오래 보존할 수 있다.

이 관점은 LLM long-context engineering과 비슷하다. 모든 context를 보존할 수 없을 때 중요한 것은 더 큰 cache만이 아니라, 어떤 context를 어떤 형태로 남길지 이다.

5. Evaluation

5-1. Main results

논문/project page는 5-second와 30-second setting에서 Stream-T1을 비교한다. 아래 표는 핵심 수치만 정리한 것이다. 모든 수치는 원문 table 기준이며, 게시 전 최종 PDF와 재확인하는 것이 좋다.

5-second video, 832x480

Method	Subj.	Bg.	Motion	Image	Aes.	VQ	MQ	TA
CausVid	96.33	95.56	98.66	69.69	62.90	0.433	0.550	1.020
Self-Forcing	95.26	95.67	98.67	71.61	63.97	0.099	0.088	1.193
LongLive	97.00	96.78	99.12	71.28	65.28	0.285	0.350	1.193
Stream-T1	97.25	97.05	99.15	71.42	65.98	0.426	0.629	1.305

5-second setting에서는 Stream-T1이 Subject Consistency, Background Consistency, Motion Smoothness, Aesthetic Quality, MQ, TA에서 가장 높고, Imaging Quality와 VQ는 strong baseline과 근접하거나 일부 metric에서는 2위다.

이 결과를 해석할 때 중요한 점은 VBench 계열의 절대 차이가 크지 않은 항목도 있다는 것이다. 예를 들어 Motion Smoothness는 LongLive 99.12에서 Stream-T1 99.15로 증가폭이 작다. 반면 VideoAlign MQ와 TA는 더 뚜렷하게 좋아진다. 즉 short video에서는 이미 baseline이 강하므로, Stream-T1의 가치는 metric별로 나눠 봐야 한다.

30-second video, 832x480

Method	Subj.	Bg.	Motion	Image	Aes.	VQ	MQ	TA
CausVid	97.91	96.74	98.15	66.32	59.71	-0.144	0.328	0.501
Self-Forcing	97.18	96.37	98.35	68.35	59.19	-0.461	-0.216	0.656
LongLive	97.90	96.82	98.78	68.99	61.56	-0.169	-0.002	1.073
Stream-T1	98.43	97.18	99.03	69.10	62.11	-0.073	0.226	1.170

30-second setting이 이 논문의 더 중요한 실험이라고 본다. Stream-T1은 Subject Consistency, Background Consistency, Motion Smoothness, Imaging Quality, Aesthetic Quality, VQ, TA에서 가장 높고, MQ는 CausVid가 가장 높다.

여기서 주의할 점은 relative gain 해석이다. Project page는 일부 VideoAlign metric에서 큰 relative gain을 제시하지만, baseline 값이 0 근처이거나 negative인 경우 relative percentage는 매우 불안정하다. 예를 들어 LongLive의 MQ가 -0.002이고 Stream-T1이 0.226이면 relative gain은 숫자상 매우 크게 보일 수 있지만, 해석은 absolute score difference 중심으로 하는 편이 안전하다.

이 결과의 핵심은 30-second에서도 Stream-T1이 background, subject, motion 쪽 metric을 동시에 올렸다는 점이다. 이것은 Memory Sinking과 Noise Propagation이 long-horizon drift를 줄이는 방향으로 작동할 가능성을 보여준다.

5-2. Ablation

Project page의 30-second ablation은 세 module이 모두 필요한지 보여준다.

Method	Subj.	Bg.	Motion	Image	Aes.	VQ	MQ	TA
w/o Memory Sinking	98.30	97.04	98.92	69.51	61.90	-0.083	0.188	1.146
w/o Noise Propagation	98.35	97.14	98.98	69.07	61.99	-0.094	0.176	1.164
w/o Reward Pruning	98.04	96.88	98.87	69.17	61.22	-0.173	0.014	1.035
Ours	98.43	97.18	99.03	69.10	62.11	-0.073	0.226	1.170

이 ablation은 몇 가지를 말해준다.

첫째, Reward Pruning 제거의 영향이 꽤 크다. 특히 VideoAlign MQ와 TA가 크게 떨어진다. 이는 좋은 후보를 고르는 search component가 여전히 핵심이라는 뜻이다.

둘째, Noise Propagation은 VQ와 MQ에서 영향을 보인다. Project page qualitative 설명은 Noise Propagation을 제거하면 local structural artifact가 생긴다고 해석한다. 이는 initial noise prior가 chunk-level structural continuity에 영향을 준다는 주장과 맞다.

셋째, Memory Sinking은 background stability와 long-horizon consistency 쪽에 더 직접적으로 연결된다. 수치상 background consistency는 w/o Memory Sinking 97.04, Ours 97.18로 차이가 작지만, qualitative ablation에서 background stability degradation을 언급한다. 따라서 이 module은 숫자보다 visual inspection이 중요할 수 있다.

넷째, Imaging Quality는 w/o Memory Sinking에서 69.51로 Ours 69.10보다 높다. 이는 모든 metric을 동시에 단순 증가시키는 것이 아니라, temporal consistency와 frame-level image metric 사이 trade-off가 있을 수 있음을 보여준다. Stream-T1을 읽을 때는 aggregate claim보다 metric별 trade-off를 보는 것이 안전하다.

5-3. What really matters in the experiments

이 논문에서 진짜 중요한 실험 포인트는 세 가지다.

1) Short video보다 30-second result가 더 중요하다

5-second video에서는 baseline도 이미 꽤 안정적이다. Stream-T1이 여러 metric에서 개선되지만, 일부 VBench metric은 차이가 작다. 반대로 30-second에서는 temporal drift와 memory decay가 더 크게 드러난다. Stream-T1이 30-second setting에서 subject, background, motion, aesthetics, TA를 같이 올린다는 점이 더 중요한 signal이다.

2) Reward Pruning은 단순 선택기가 아니라 state update trigger다

Ablation에서 Reward Pruning 제거가 가장 눈에 띈다. 하지만 이 module의 의미는 best candidate를 고르는 것에서 끝나지 않는다. Reward feedback이 Noise Propagation과 Memory Sinking에 영향을 주기 때문에, pruning quality가 다음 chunk의 generation state까지 바꾼다.

즉 Stream-T1에서 reward는 score가 아니라 controller에 가깝다.

3) Relative gain보다 absolute score와 qualitative case를 봐야 한다

VideoAlign처럼 값이 0 근처이거나 negative일 수 있는 metric에서는 relative gain이 과장되어 보일 수 있다. 따라서 이 논문을 리뷰할 때는 “relative gain이 크다”보다 다음 질문이 중요하다.

같은 seed에서 temporal transition이 실제로 자연스러운가.
background와 subject identity가 길게 유지되는가.
motion smoothness가 frame-level blur나 staticness로 얻어진 것은 아닌가.
reward가 visual quality를 높이면서 prompt alignment를 유지하는가.
qualitative case가 metric과 일관되는가.

6. Limitations

Base generator dependence
- Stream-T1은 LongLive 같은 streaming video generator 위에 얹히는 inference framework로 읽는 것이 맞다.
- 따라서 다른 streaming backbone이나 non-streaming diffusion backbone에서 같은 정도로 작동하는지는 별도 검증이 필요하다.
Test-time compute는 줄어도 공짜는 아니다
- Chunk-level TTS가 full-video candidate search보다 효율적일 수 있지만, candidate generation과 reward evaluation은 여전히 추가 비용이다.
- Real-time 혹은 interactive video generation에 넣으려면 budget allocation과 reward cost가 중요해진다.
Reward model bias와 metric dependency
- Reward Pruning은 reward가 보는 품질을 강화한다.
- Reward가 aesthetics나 short-term coherence를 과하게 선호하면, 다른 semantic 또는 creative dimension이 손상될 수 있다.
30-second를 넘어선 ultra-long video는 추가 검증 필요
- 논문/project page는 5-second와 30-second benchmark를 강조한다.
- 1-minute 이상이나 scene transition이 복잡한 long-form video에서 memory sinking이 얼마나 안정적인지는 원문과 후속 실험을 더 봐야 한다.
Metric trade-off가 있다
- Ablation에서 일부 metric은 full method보다 특정 ablation이 더 높게 나온다.
- 따라서 Stream-T1을 모든 metric에서 지배적인 개선으로 읽기보다, temporal coherence 중심의 operating point로 읽는 것이 안전하다.
Memory Sinking의 내부 mechanism은 원문 재확인 필요
- Project page 설명만으로는 evicted KV context가 어떤 representation으로 저장되고 어떤 pathway로 업데이트되는지 세부 구현을 충분히 알기 어렵다.
- 실제 재현을 위해서는 paper method section과 code를 확인해야 한다.

7. My Take

7-1. Why this matters for my work

Stream-T1이 흥미로운 이유는 video generation의 TTS를 LLM-style best-of-N sampling으로만 보지 않는다는 점이다. LLM에서는 여러 answer를 생성하고 verifier로 고르는 방식이 꽤 자연스럽다. 하지만 video generation에서는 output이 긴 temporal signal이고, 후보 하나의 비용이 크며, chunk 간 state가 중요하다.

Stream-T1은 이 차이를 잘 이용한다. Streaming generator에는 이미 다음 정보가 있다.

chunk boundary
previous latent noise
reward history
KV cache context
evicted memory
sliding-window temporal context

이 정보들을 단순히 implementation artifact로 보지 않고, test-time scaling의 optimization state로 쓰는 것이 이 논문의 핵심이다.

이 방향은 앞으로 중요해질 가능성이 크다. Video generation은 점점 long-form, interactive, agentic setting으로 갈 것이다. 그때 중요한 것은 한 번에 완벽한 video를 뽑는 것보다, generation 중간에 quality를 평가하고, memory를 정리하고, 다음 chunk를 더 잘 시작하게 만드는 loop다.

7-2. Reuse potential

재사용하고 싶은 포인트는 네 가지다.

Chunk-level TTS
- 긴 output을 한 번에 search하지 않고, chunk마다 budget을 쓰는 방식이다.
- Video뿐 아니라 long-form audio, simulation trajectory, UI sequence generation에도 비슷하게 쓸 수 있다.
Reward as state update signal
- Reward를 final reranking에만 쓰지 않고, 다음 generation state를 업데이트하는 신호로 쓴다.
- Agent나 generative planning에서도 useful trajectory의 latent state를 다음 step에 반영하는 방식으로 확장 가능하다.
Noise propagation for continuity
- 이전 high-quality sample의 stochastic state를 다음 sample initialization에 반영한다.
- Diffusion/flow model에서 temporal continuity를 post-hoc smoothing보다 앞단에서 다루는 방법으로 볼 수 있다.
Reward-aware memory sinking
- Cache eviction을 단순히 오래된 정보 제거로 보지 않는다.
- Reward가 좋았던 visual context를 더 오래 남기고, long-horizon drift를 줄이는 memory policy로 해석할 수 있다.

7-3. Follow-up papers

Video-T1: Test-Time Scaling for Video Generation
LongLive: Long Video Generation baseline 관련 논문
CausVid
Self-Forcing
VBench
VideoAlign
VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
A Simple Baseline for Streaming Video Understanding

8. Summary

Stream-T1은 video generation TTS를 full-video candidate search가 아니라 streaming chunk-level control 문제로 바꾼다.
핵심 module은 Stream-Scaled Noise Propagation, Stream-Scaled Reward Pruning, Stream-Scaled Memory Sinking이다.
Reward는 후보를 고르는 데서 끝나지 않고, 다음 chunk의 noise prior와 memory update에 다시 들어간다.
5-second보다 30-second evaluation이 더 중요하며, long video에서 subject, background, motion consistency 개선을 확인하는 것이 핵심이다.
다만 test-time compute cost, reward bias, backbone dependence, metric trade-off는 반드시 같이 봐야 한다.

Twitter Facebook LinkedIn