Geometric Context Transformer for Streaming 3D Reconstruction Review

2026-05-17 14 분 소요

0. Introduction

Geometric Context Transformer for Streaming 3D Reconstruction은 3D reconstruction을 offline multi-view batch problem이 아니라, 계속 들어오는 video stream 위에서 풀어야 하는 online geometry problem으로 다시 정의하는 논문이다. 논문에서 제안하는 시스템 이름은 LingBot-Map이고, 핵심 구조는 Geometric Context Transformer, 핵심 attention mechanism은 Geometric Context Attention이다.

이 논문이 흥미로운 이유는 단순히 더 긴 비디오를 처리한다는 데 있지 않다. 진짜 포인트는 streaming state를 어떤 형태로 기억해야 하는가를 구조적으로 설계했다는 점이다. 기존 feed-forward 3D foundation model은 전체 image set을 한 번에 보고 camera pose, depth, point map을 예측하는 쪽에서 강했다. 하지만 streaming setting에서는 미래 frame이 없고, 과거 frame을 모두 저장하면 KV cache와 attention cost가 계속 커진다. 반대로 aggressive compression을 하면 scale, local geometry, long-range trajectory consistency가 무너진다.

LingBot-Map은 이 문제를 SLAM의 관점에서 읽는다. SLAM에는 보통 reference frame, local tracking window, global map 같은 역할 분리가 있다. 이 논문은 그 분리를 hand-crafted keyframe selection이나 bundle adjustment로 구현하지 않고, anchor context, pose-reference window, trajectory memory라는 learned attention state로 옮긴다.

한 줄 요약: GCT는 streaming 3D reconstruction에서 모든 과거 frame을 기억하는 대신, coordinate grounding을 위한 anchor, dense local geometry를 위한 pose-reference window, long-range drift correction을 위한 trajectory memory로 context를 분해해 real-time feed-forward reconstruction을 가능하게 만드는 논문이다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

3D foundation model이 offline reconstruction에서 streaming spatial intelligence 쪽으로 이동하는 흐름을 잘 보여준다.
Long-context Transformer 문제를 language가 아니라 visual geometry와 SLAM 관점에서 다시 보게 만든다.
Robotics, AR, embodied AI에서 필요한 persistent spatial memory를 model architecture 차원에서 어떻게 설계할 수 있는지 좋은 사례를 제공한다.

내가 보기엔 이 논문의 핵심 메시지는 다음과 같다. Streaming 3D reconstruction의 병목은 단순히 더 큰 ViT나 더 빠른 depth head가 아니다. 병목은 과거 observation 중 무엇을 dense token으로 유지하고, 무엇을 compact memory로 압축하며, 무엇을 coordinate reference로 고정할 것인가다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 continuous video stream에서 camera pose와 depth map, point cloud를 causal하게 예측하는 것이다.
모델은 현재 frame과 과거 frame만 볼 수 있고, future frame이나 전체 sequence를 한 번에 볼 수 없다.
목표는 세 가지를 동시에 만족하는 것이다.
- accurate local geometry
- long-range temporal consistency
- real-time bounded-memory inference
Offline 3D foundation model은 모든 view를 전역 attention으로 처리할 수 있지만, streaming model은 매 frame마다 state를 갱신해야 한다.
따라서 핵심 질문은 다음과 같다.

\[S_t = f(S_{t-1}, I_t)\]

여기서 $I_t$는 새로 들어온 frame이고, $S_t$는 현재까지의 geometric state다. 중요한 것은 $S_t$가 너무 커지면 streaming이 불가능하고, 너무 압축되면 drift와 geometry collapse가 생긴다는 점이다.

1-2. Why previous approaches are insufficient

Offline feed-forward methods는 전체 image set을 동시에 처리한다. VGGT나 Depth Anything 3 같은 계열은 강력하지만, full sequence access가 필요하므로 online streaming에는 맞지 않는다.
Causal attention 기반 streaming methods는 미래 frame을 보지 않는다는 점에서는 맞지만, history를 거의 그대로 들고 가면 memory와 compute가 sequence length에 따라 계속 증가한다.
Recurrent state 기반 methods는 state를 compact하게 만들 수 있지만, aggressive compression 때문에 오래된 geometric prior를 잊거나 long-range drift가 커질 수 있다.
Hybrid SLAM methods는 keyframe, pose graph, bundle adjustment를 통해 구조적 memory를 유지하지만, hand-crafted heuristic과 iterative optimization에 의존하기 쉽다.
Test-time training 기반 long-sequence method는 global consistency를 보강할 수 있지만, inference 중 parameter update가 필요하므로 real-time feed-forward system과는 성격이 다르다.

결국 기존 접근의 한계는 하나의 축으로 정리된다. 모든 것을 기억하면 느리고, 너무 적게 기억하면 틀어진다. GCT는 이 trade-off를 단순 cache eviction 문제가 아니라, geometric role에 따라 memory를 분해하는 문제로 본다.

2. Core Idea

2-1. Main contribution

GCT의 핵심 기여는 세 가지로 정리할 수 있다.

Geometric Context Attention
- Streaming state를 anchor context, local pose-reference window, trajectory memory로 나눈다.
- 각 context는 coordinate grounding, dense local registration, long-range drift correction이라는 서로 다른 역할을 가진다.
Progressive long-sequence training recipe
- 먼저 offline base model을 short multi-view data로 학습한 뒤, global attention을 GCA로 교체해 streaming model로 전환한다.
- 이후 view count를 24에서 320까지 점진적으로 늘리고, context parallelism으로 long-sequence training을 가능하게 한다.
Efficient streaming inference path
- FlashInfer 기반 paged KV cache, sparse KV layout, adaptive keyframe selection을 사용한다.
- 논문은 518 x 378 resolution에서 약 20 FPS와 10,000 frames를 넘는 long-sequence inference를 보고한다.

2-2. Design intuition

이 논문의 설계 직관은 classical SLAM에서 출발한다. Robust SLAM은 모든 관측을 같은 방식으로 저장하지 않는다. 어떤 frame은 coordinate system을 잡는 기준이 되고, 어떤 frame은 현재 tracking을 위한 local evidence가 되며, 어떤 summary는 loop나 drift를 줄이는 global map 역할을 한다.

GCT도 비슷하게 생각한다.

Context type	Role	Stored information
Anchor context	Coordinate and scale grounding	Initial frames with full tokens and anchor token
Pose-reference window	Local dense geometry	Recent frames with full image tokens
Trajectory memory	Long-range consistency	Compact camera, anchor, register tokens for evicted frames

이 분해가 중요한 이유는 attention cost의 성장 방식을 바꾸기 때문이다. 모든 과거 image token을 들고 가는 causal attention은 history가 길어질수록 무거워진다. 반면 GCA는 local window 안의 frame만 full image token으로 유지하고, window 밖 frame은 compact trajectory token만 남긴다.

직관적으로는 아래처럼 볼 수 있다.

\[S_t = A \cup W_t \cup M_t\]

여기서 $A$는 anchor context, $W_t$는 recent window, $M_t$는 trajectory memory다. 핵심은 $W_t$는 dense하지만 길이가 고정되고, $M_t$는 길어질 수 있지만 frame당 6개 token만 저장한다는 점이다. 그래서 model은 long-range context를 완전히 버리지 않으면서도, full image token의 폭발을 피한다.

3. Architecture / Method

3-1. Overview

Item	Description
Model name	LingBot-Map
Core architecture	Geometric Context Transformer
Core attention	Geometric Context Attention
Input	Continuous RGB image stream
Output	Camera pose and dense depth map per frame
Key idea	Anchor, local window, trajectory memory로 streaming state를 구조화
Backbone	DINOv2 initialized ViT backbone
Runtime focus	Paged KV cache and sparse KV attention for streaming inference

LingBot-Map은 각 frame을 ViT backbone으로 encode한 뒤, frame-wise attention과 GCA layer를 번갈아 적용한다. 이후 camera token으로 absolute camera pose를 예측하고, image token으로 depth map을 예측한다. 즉 전체 pipeline은 feed-forward지만, attention context는 streaming state를 유지한다.

3-2. Module breakdown

1) Anchor Context

Anchor context는 scale과 coordinate system을 고정하는 장치다.

Monocular reconstruction은 본질적으로 scale ambiguity를 가진다. Offline model은 전체 point cloud를 기준으로 normalization할 수 있지만, streaming model은 미래 frame을 볼 수 없다. 그래서 LingBot-Map은 처음 일부 frame을 anchor frame으로 지정하고, 이 frame들 사이에는 full attention을 적용한다. 또한 learnable anchor token을 추가해 이후 frame들이 anchor를 명시적으로 구분할 수 있게 한다.

Training에서는 ground-truth depth와 camera translation을 anchor frames에서 계산한 canonical scale로 normalize한다. 이 덕분에 이후 streaming frame은 계속 변하는 local coordinate가 아니라, anchor가 정의한 coordinate frame에 등록된다.

2) Local Pose-Reference Window

Pose-reference window는 현재 frame을 정확히 붙이기 위한 dense local evidence다.

Anchor만으로는 현재 frame과의 visual overlap이 약할 수 있다. 특히 움직이는 camera stream에서는 새 frame을 바로 직전 주변 frame과 dense하게 비교해야 relative pose와 depth가 안정된다. 그래서 GCA는 최근 frame window를 full image token 형태로 유지한다.

이 window는 local tracking에 해당한다. 먼 과거 전체를 모두 볼 필요는 없지만, 최근 몇 frame은 full token resolution으로 남아 있어야 한다. 논문은 이 window 안의 frame pair에 relative pose loss를 적용해 local consistency를 직접 학습시킨다.

3) Trajectory Memory

Trajectory memory는 long-range drift를 줄이기 위한 compact global record다.

Anchor와 local window만 있으면 중간 history가 사라진다. 이 경우 오래된 frame과 현재 frame 사이의 누적 pose error를 교정할 단서가 부족해진다. GCT는 window 밖으로 밀려난 frame의 full image token은 버리되, camera token, anchor token, register tokens를 합친 compact context token을 유지한다. 논문 설명 기준으로 evicted frame당 6개의 context token을 남긴다.

또한 이 trajectory token에는 Video RoPE를 넣어 temporal ordering을 부여한다. 단순히 과거 frame summary가 있는 것만으로는 충분하지 않고, 언제 관측된 frame인지가 attention 안에 들어가야 long-range trajectory를 해석할 수 있다는 판단이다.

4) Attention Mask Design

GCA는 global attention, causal attention, sliding window attention 사이의 중간 지점을 노린다.

Attention pattern	Advantage	Problem
Global attention	Strong global consistency	Not streaming
Causal attention	Streaming-compatible	KV cache grows with history
Sliding window	Bounded compute	Long-range context loss
GCA	Structured long-range memory	More architecture-specific state design

GCA의 핵심은 모든 과거를 같은 granularity로 보지 않는다는 점이다. Anchor는 full reference, recent window는 dense local evidence, old frames는 compact trajectory token이다. 이 때문에 long sequence에서도 state가 거의 bounded하게 유지된다.

5) Loss Function

논문의 loss는 depth, absolute pose, relative pose의 composite objective로 구성된다.

\[L = L_{depth} + L_{abs\_pose} + L_{rel\_pose}\]

Depth loss와 absolute pose loss는 VGGT 계열 정의를 따른다. 차이는 relative pose loss를 local pose-reference window 안의 frame pair에 적용한다는 점이다. 이 loss는 absolute pose만 맞추는 것보다, 인접 frame 사이의 relative motion을 직접 제약한다.

내가 보기엔 이 부분이 실험적으로 중요하다. Streaming reconstruction에서는 작은 rotation error가 길게 누적되면서 큰 trajectory drift가 된다. Relative pose loss는 이 작은 local error가 커지는 경로를 줄이는 역할을 한다.

6) Inference Modes

LingBot-Map은 두 가지 inference mode를 제공한다.

Mode	Use case	Trade-off
Direct Output	Sequence length가 training range 근처일 때	Inter-window alignment error가 없어 더 정확함
Visual Odometry mode	Tens of thousands frames처럼 매우 긴 sequence	Window reset과 Sim(3) alignment로 길이를 늘리지만 boundary drift가 생길 수 있음

Direct Output mode는 full three-level context를 유지하며 각 frame의 absolute pose와 depth를 바로 출력한다. VO mode는 sequence를 overlapping local windows로 나누고, window 사이를 Sim(3) alignment로 이어 붙인다. 이는 arbitrarily long sequence를 처리하기 위한 engineering choice지만, window boundary마다 alignment error가 누적될 수 있다.

4. Training / Data / Recipe

4-1. Data

이 논문은 training corpus를 꽤 크게 구성한다. 논문 기준으로 29개 dataset을 사용하며, indoor, outdoor, object-centric, synthetic, real-world scenario를 포함한다. Data format도 unordered multi-view collection과 temporally ordered video sequence로 나뉜다.

Stage 1에서는 broad geometric prior를 만들기 위해 multi-view와 video data를 함께 사용한다. Stage 2에서는 streaming reconstruction에 맞추기 위해 long trajectory video data의 sampling weight를 높인다. TartanAir, TartanAirV2, TartanAirGround, MidAir, MatrixCity, Waymo, VirtualKITTI, KITTI-360, ScanNet++, ScanNet, internal game dataset 등이 중요하게 쓰인다.

여기서 중요한 것은 단순히 dataset 수가 많다는 점이 아니다. Stage 1은 geometric prior를 만들고, Stage 2는 temporal trajectory를 학습한다. 즉 data mixture 자체가 architecture transition과 연결되어 있다.

4-2. Training strategy

학습은 크게 두 단계다.

Stage 1: Offline base model training

ViT backbone은 DINOv2에서 초기화한다.
Architecture는 VGGT 스타일의 frame attention과 cross-frame attention을 사용한다.
이 단계에서는 GCA가 아니라 global attention을 쓴다.
Input views는 sample당 2에서 24 사이로 random sampling한다.
Training은 160K iterations로 진행된다.
Image는 maximum dimension 518 pixels로 resize된다.
Photometric augmentation, co-jittering, per-frame transform을 사용해 appearance shortcut을 줄인다.
논문은 stage 1에 약 21,500 GPU hours가 필요했다고 보고한다.

Stage 1의 목적은 streaming 자체가 아니라 robust geometric prior다. 모든 view를 볼 수 있는 offline setting에서 먼저 depth, pose, cross-view matching 능력을 만들고, 이후 streaming 구조로 옮긴다.

Stage 2: Streaming model training

Stage 1 checkpoint에서 초기화한다.
Global attention을 GCA로 교체한다.
GCA의 QKV projection은 global attention과 같은 parameterization을 공유하므로 pretrained weights를 직접 transfer할 수 있다.
Training은 다시 160K iterations로 진행된다.
View curriculum은 24에서 320 views까지 linear하게 증가한다.
Local pose-reference window size는 16에서 64 사이로 random sampling한다.
Context parallelism은 Ulysses style strategy를 사용하며, parallelism dimension은 16이다.
구현은 TorchTitan과 Magi Attention 위에 올라간다.
논문은 stage 2에 약 15,360 GPU hours가 필요했다고 보고한다.

이 stage의 핵심은 long sequence를 처음부터 넣지 않는다는 점이다. 짧은 sequence에서 local geometry를 먼저 안정화하고, 이후 view count를 늘리면서 global consistency를 학습한다.

4-3. Engineering notes

Runtime은 FlashInfer 기반으로 구현된다.
FlashInfer는 paged KV cache management와 sparse KV layout attention kernel을 제공한다.
논문은 1000 frames, local window 64 setting에서 FlashInfer implementation이 20 FPS를 달성했고, 동일한 PyTorch contiguous KV baseline은 10.5 FPS였다고 보고한다.
매우 긴 sequence에서는 adaptive keyframe selection을 사용한다.
새 frame에 대해 depth와 pose를 먼저 예측하고, 가장 최근 keyframe과의 optical flow magnitude가 threshold를 넘으면 새 keyframe으로 KV cache에 추가한다.
입력이 training maximum views를 넘는 경우에는 keyframe만 유지해 KV cache 성장을 제어한다.

이 부분은 실무적으로 중요하다. 논문 제목은 Transformer architecture지만, 실제 streaming 성능은 paged KV cache, sparse KV kernel, keyframe selection까지 포함한 system design에서 나온다.

5. Evaluation

5-1. Main results

Evaluation은 camera pose estimation과 dense 3D reconstruction으로 나뉜다. Benchmark는 Oxford Spires, ETH3D, 7-Scenes, Tanks and Temples, NRGBD를 포함한다.

Oxford Spires sparse setting

Oxford Spires는 indoor-outdoor transition, revisit, scale variation이 섞인 challenging benchmark다. Sparse setting은 320 frames를 12-frame stride로 sampling해 training range 안에서 single-pass ability를 본다.

Method type	Method	AUC@15	AUC@30	ATE
Offline	VGGT	23.84	35.09	24.78
Offline	DA3	49.84	56.68	12.87
Optimization	VIPE	45.35	51.88	10.52
Online	CUT3R	5.98	14.95	18.16
Online	Wint3R	11.61	23.42	21.10
Online	LingBot-Map	61.64	75.16	6.42

이 결과는 꽤 강하다. LingBot-Map은 streaming online method이면서도 offline DA3와 optimization-based VIPE보다 높은 AUC@15와 낮은 ATE를 보인다. 논문 해석대로라면, Oxford Spires처럼 scene transition과 scale variation이 큰 benchmark에서는 offline model의 small-view training prior가 잘 맞지 않고, GCA의 streaming context가 더 유리하게 작동한다.

Oxford Spires dense setting

Dense setting은 full 3,840 frames로 long-sequence drift를 보는 stress test다.

Method	Sparse ATE	Dense ATE	FPS
CUT3R	18.16	32.47	29.21
TTT3R	19.35	25.05	28.97
Wint3R	21.10	32.90	3.88
InfiniteVGGT	30.49	31.75	7.78
Stream3R-w	33.03	33.73	13.66
LingBot-Map	6.42	7.11	20.29

가장 중요한 수치는 dense ATE 자체보다 sparse에서 dense로 갈 때의 degradation이다. LingBot-Map은 ATE가 6.42에서 7.11로 0.69만 증가한다. 이는 anchor, local window, trajectory memory가 long-range drift를 실제로 줄이고 있다는 강한 evidence다.

Cross-dataset pose estimation

ETH3D, 7-Scenes, Tanks and Temples에서도 LingBot-Map은 streaming baselines보다 높은 결과를 보인다.

Dataset	Metric	Best baseline	LingBot-Map
ETH3D	AUC@30	64.76	86.20
ETH3D	ATE	0.86	0.22
7-Scenes	AUC@30	73.70	78.59
7-Scenes	ATE	0.10	0.08
Tanks and Temples	AUC@30	81.33	92.80
Tanks and Temples	ATE	0.66	0.20

7-Scenes에서는 gap이 상대적으로 작다. 이는 room-scale indoor scene이라 long-range drift보다 local pose estimation이 더 중요하고, 많은 baseline도 어느 정도 버틸 수 있기 때문으로 볼 수 있다. 반대로 ETH3D와 Tanks and Temples에서는 long-range consistency와 scale handling이 더 중요해져 차이가 커진다.

Point cloud reconstruction

Dense reconstruction 품질은 F1 score로 정리할 수 있다.

Dataset	Best baseline F1	LingBot-Map F1
ETH3D	77.28	98.98
7-Scenes	78.81	80.39
NRGBD	56.96	64.26

여기서 주의할 점은 reconstruction score가 pose accuracy에 강하게 묶인다는 것이다. Trajectory drift가 크면 같은 surface가 여러 위치에 중복 projection되고, point cloud가 blur되거나 fragment된다. LingBot-Map의 3D reconstruction gain은 depth head만의 gain이라기보다, GCA가 pose drift를 줄인 효과가 reconstruction fidelity로 이어진 결과로 읽는 편이 맞다.

5-2. What really matters in the experiments

이 논문에서 진짜 중요한 실험은 최종 benchmark ranking보다 ablation과 full attention 비교다.

Ablation

Configuration	AUC@3	AUC@30	ATE	RPE-trans	RPE-rot
Rel. Loss only	9.80	65.84	8.59	1.62	2.57
Rel. Loss + Anchor Init.	13.63	68.71	7.88	1.60	2.90
Anchor Init. + Context Tokens	13.91	68.25	8.25	1.67	5.35
Rel. Loss + Anchor Init. + Context Tokens	15.75	69.92	7.46	1.48	2.26
Full GCA components	16.39	71.87	5.98	1.33	1.93

이 표에서 중요한 해석은 다음과 같다.

Anchor initialization은 coordinate grounding에 직접 기여한다.
Context tokens는 중간 history를 compact하게 남겨 long-range drift를 줄인다.
Relative pose loss는 local frame pair consistency를 강화한다.
Video RoPE는 trajectory memory가 시간 순서를 이해하게 만든다.

특히 Video RoPE를 넣었을 때 ATE가 7.46에서 5.98로 좋아지는 부분이 중요하다. Compact memory token만 있다고 해서 충분한 것이 아니라, 그 token이 temporal order를 가져야 한다는 뜻이다.

Pose-reference window vs full causal attention

논문은 window size 64의 bounded pose-reference window와 full causal attention도 비교한다.

Setting	ATE	RPE-trans	RPE-rot	FPS	Memory GB
Window size 64	5.98	1.33	1.93	20.29	13.28
Full causal attention	6.60	1.50	1.71	11.87	36.06

이 결과는 꽤 재미있다. Full causal attention은 더 많은 historical image token을 보지만, ATE와 RPE-trans는 오히려 bounded window가 더 좋다. 논문은 distant and less relevant historical token이 attention noise가 될 수 있다고 해석한다. 즉 long context가 항상 좋은 것이 아니라, 어떤 context를 어떤 granularity로 남길지가 더 중요하다.

6. Limitations

논문이 명시한 것처럼 explicit loop-closure detection은 아직 없다. Revisiting region에서 drift를 더 줄이려면 loop closure를 attention mechanism 안에 넣거나 별도 module로 붙일 필요가 있다.
Trajectory memory는 evicted frame당 compact token만 남기므로, tens of thousands frames scale에서 fine-grained geometry detail이 손실될 수 있다.
Feed-forward method이므로 test-time optimization이나 bundle adjustment를 통한 후처리 refinement는 하지 않는다. 어려운 scene에서는 optimization backend가 여전히 도움이 될 수 있다.
VO mode는 very long sequence에 유용하지만, window boundary마다 Sim(3) alignment error가 누적될 수 있다.
실험은 강하지만, sky masking, keyframe interval, camera head refinement iterations 같은 implementation choices가 실제 재현 성능에 영향을 줄 수 있다.
Dynamic scenes, moving objects, LiDAR, IMU integration은 future direction으로 남아 있다.
3D reconstruction score는 pose quality와 depth quality가 얽혀 있으므로, depth estimation 자체의 독립적인 gain으로 과해석하면 안 된다.

7. My Take

7-1. Why this matters for my work

이 논문은 3D reconstruction 논문이지만, 넓게 보면 streaming memory architecture 논문으로 읽을 수 있다. LLM long-context나 agent memory에서도 비슷한 문제가 있다. 모든 context를 들고 가면 비싸고, 요약하면 정보가 사라진다. GCT는 이 문제를 단순 summarization이 아니라 role-based memory decomposition으로 푼다.

가장 흥미롭게 본 점은 세 가지다.

Anchor는 model이 coordinate frame을 잃지 않게 하는 persistent reference다.
Local window는 현재 판단에 필요한 dense evidence다.
Trajectory memory는 오래된 history를 compact하게 보존하는 global trace다.

이 구조는 robotics나 embodied AI뿐 아니라, video understanding, long-horizon planning, agent memory에도 재사용 가능한 사고방식이다. Memory를 recent context와 summary로만 나누는 것보다, 어떤 종류의 정보가 어떤 downstream role을 갖는지 먼저 분해하는 것이 더 중요해 보인다.

7-2. Reuse potential

재사용하고 싶은 포인트는 다음과 같다.

Role-separated memory state
- Recent context, anchor context, compressed trajectory를 명확히 분리한다.
Compact token for evicted history
- 오래된 raw token은 버리되, task-specific summary token은 유지한다.
Temporal positional encoding for memory
- Memory token이 단순 set이 아니라 ordered trajectory가 되도록 만든다.
Progressive sequence curriculum
- 처음부터 긴 sequence를 학습하지 않고, view count를 늘려가며 안정화한다.
Full context is not always better
- Attention에 넣는 정보량보다 relevance와 granularity가 중요할 수 있다.

특히 5번이 중요하다. LLM이나 multimodal model에서도 long context를 무조건 많이 넣는 것이 아니라, anchor, local, memory처럼 역할별로 context contract를 설계해야 할 수 있다.

7-3. Follow-up papers

VGGT: Visual Geometry Grounded Transformer
DUSt3R: Geometric 3D Vision Made Easy
CUT3R: Continuous 3D Perception Model with Persistent State
StreamVGGT: Streaming Visual Geometry Transformer
Wint3R: Window-Based Streaming 3D Reconstruction
TTT3R: 3D Reconstruction as Test-Time Training
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

8. Summary

GCT는 streaming 3D reconstruction에서 과거 frame을 모두 저장하지 않고, anchor, pose-reference window, trajectory memory로 분해한다.
Anchor는 scale과 coordinate grounding을, local window는 dense local registration을, trajectory memory는 long-range drift correction을 담당한다.
Training은 offline base model을 먼저 만든 뒤 GCA로 교체하고, view count를 24에서 320까지 늘리는 progressive curriculum을 사용한다.
Oxford Spires dense setting에서 LingBot-Map은 3,840 frames에서도 ATE degradation을 작게 유지하며, 약 20 FPS를 보고한다.
다만 explicit loop closure, fine-grained memory loss, VO mode boundary drift, dynamic scene handling은 후속 과제로 남아 있다.

Twitter Facebook LinkedIn