GLM-5: from Vibe Coding to Agentic Engineering Review

2026-04-26 13 분 소요

0. Introduction

GLM-5는 말 그대로 알짜 technical report다. 이 보고서의 포인트는 bigger MoE 하나가 아니라, efficient attention 선택, long-context mid-training, asynchronous agent RL infrastructure, verifiable environment scaling, internal real-world evaluation harness를 한 묶음으로 제시한다는 데 있다. 제목의 “from vibe coding to agentic engineering”도 단순한 카피가 아니라, single-turn code generation에서 multi-step plan -> search -> edit -> run -> debug로 넘어가는 문제 설정 자체를 바꾸겠다는 선언에 가깝다.

요즘 coding/agent 논문을 보면 backbone, RL, evaluator, harness가 따로따로 서술되는 경우가 많다. 하지만 실제로 긴 문맥의 coding agent를 서비스에 붙여 보면 병목은 대부분 그 사이 인터페이스에서 터진다. attention cost, rollout idle time, off-policy drift, repo exploration, UI validation, build reproducibility 같은 것들이다. GLM-5는 바로 그 interface layer를 정면으로 다룬다.

한 줄 요약: GLM-5는 744B total / 40B active MoE backbone 위에 DSA, 200K mid-training, interleaved/preserved/turn-level thinking, fully asynchronous agent RL, on-policy cross-stage distillation, and judge-driven real-world evaluation을 결합해 open-weight coding model을 “vibe coding”에서 “agentic engineering”으로 밀어 올리려는 full-stack technical report다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

backbone만이 아니라 실제 agent system recipe를 같이 보여준다.
GDN, SWA, SimpleGDN 같은 efficient attention 대안을 비교해 놓고도 왜 최종적으로 DSA를 택했는지 reasoning이 비교적 잘 드러난다.
RL algorithm 하나보다 rollout infrastructure, environment construction, evaluation protocol이 더 중요해지는 현재 coding-agent 흐름을 정직하게 보여준다.

내가 보기엔 이 논문의 핵심은 “GLM이 더 커졌다”가 아니다. 오히려 open coding agent의 성능은 모델 아키텍처, RL 스케줄, context management, evaluation harness를 같이 설계해야 나온다는 점을 가장 노골적으로 보여주는 문서에 가깝다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 대상은 단순한 코드 생성기가 아니라, 긴 시간 동안 계획하고 검색하고 수정하고 검증하는 long-horizon coding/search agent다.
저자들이 말하는 vibe coding은 “첫 시도에서 그럴듯한 코드를 내는 모델”에 더 가깝고, agentic engineering은 대형 codebase를 탐색하고, 여러 step에 걸쳐 tool을 호출하고, 환경 피드백을 받아 자기 전략을 수정하는 모델에 가깝다.
이 setting에서는 다음 네 가지가 동시에 중요해진다.
1. 긴 문맥과 큰 코드베이스를 버틸 것
2. reasoning / coding / tool use / search를 한 모델 안에 묶을 것
3. 긴 rollout 동안 GPU idle time과 tail latency를 줄일 것
4. static benchmark를 넘어 실제 engineering workflow에 가까운 평가를 할 것
따라서 문제는 “더 좋은 코드 completion을 만들자”가 아니라, 실행 가능한 agent system으로서의 LLM을 어떻게 설계할 것인가다.

1-2. Why previous approaches are insufficient

dense attention은 길이가 길어질수록 비용이 빠르게 커진다. 특히 128K 이상 문맥에서는 long-horizon coding agent의 tool trace와 repo context가 누적되면서 attention cost가 바로 병목이 된다.
그렇다고 아무 efficient attention이나 붙이면 되는 것도 아니다. 논문은 SWA, GDN, SimpleGDN 같은 대안을 비교하면서, retrieval-heavy long-context task에서는 accuracy gap이 꽤 쉽게 생긴다고 보여준다.
RL도 마찬가지다. naive synchronous RL은 긴 agent rollout에서 심각한 GPU idle time을 만들고, asynchrony를 도입하면 다시 off-policy drift와 token mismatch 문제가 생긴다.
benchmark도 충분하지 않다. SWE-bench 같은 static single-commit setting은 useful하지만, repo exploration, frontend validation, multi-step chained task 같은 실제 개발 workflow의 누적 오류를 충분히 보지 못한다.
결국 이전 방식의 한계는 단순히 “모델이 약하다”가 아니다. 더 정확하게는 context efficiency, rollout systems, evaluation realism이 함께 부족하다는 점이다.

2. Core Idea

2-1. Main contribution

DSA adoption with explicit efficient-attention comparison: 단순히 sparse attention을 가져다 쓴 것이 아니라, MLA / SWA / GDN / SimpleGDN을 비교한 뒤 long-context fidelity를 보존하는 방향으로 DSA를 채택한다.
Long-context base-model recipe: 32K -> 128K -> 200K로 mid-training을 확장하고, software-engineering / long-context / agentic data를 별도 비중으로 강화한다.
Multi-stage post-training: SFT -> Reasoning RL -> Agentic RL -> General RL -> On-Policy Cross-Stage Distillation이라는 stage ordering으로, specialized capability와 broad assistant behavior를 순차적으로 맞춘다.
Asynchronous agent RL infrastructure: slime 기반의 decoupled rollout/training, Multi-Task Rollout Orchestrator, TITO gateway, double-sided importance sampling으로 long-horizon rollout 효율을 높인다.
Real-world agentic engineering evaluation: CC-Bench-V2, Agent-as-a-Judge, repo exploration, chained task evaluation 등으로 static benchmark 바깥의 성능을 측정하려고 한다.

2-2. Design intuition

이 논문의 설계 직관은 네 줄로 요약할 수 있다.

효율적 attention은 FLOPs보다 retrieval fidelity가 먼저다.
coding agent에서 긴 문맥은 대부분 “뭘 잊어도 되는가”보다 “뭘 절대 놓치면 안 되는가”가 더 중요하다. 그래서 저자들은 linear / windowed attention의 속도 이득보다, long-context retrieval gap을 더 위험하게 본다.
agentic RL은 optimization problem이면서 동시에 scheduling problem이다.
rollout이 길고 task 편차가 크면, 좋은 objective가 있어도 synchronous pipeline에서는 GPU가 놀게 된다. 따라서 이 논문은 RL을 수식만이 아니라 orchestration 문제로 다룬다.
specialization만으로는 generalist가 되지 않는다.
reasoning RL, agentic RL, general RL을 순서대로 밀면 capability regression이 누적될 수 있으므로, 마지막에 on-policy cross-stage distillation으로 다시 균형을 맞춘다.
real-world coding은 static patch accuracy보다 넓다.
frontend는 buildable해야 하고, UI interaction이 되어야 하며, backend는 unit test를 통과해야 하고, long-horizon task는 이전 step의 실수가 다음 step에 누적되지 않아야 한다.

내가 보기엔 GLM-5의 핵심은 “대형 open model 성능표”보다도, agentic engineering을 위해 무엇을 함께 최적화해야 하는지 체크리스트를 제공한다는 점이다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	long-context + long-horizon coding/search agent를 open-weight로 구현하면서 비용과 fidelity를 같이 관리
Base model	744B total / 40B active MoE, 256 experts, 80 layers
Key module	MLA + DSA + shared-parameter MTP + asynchronous slime RL + on-policy cross-stage distillation
Difference from GLM-4.5	355B->744B, 23T->28.5T, 128K->200K mid-training, 더 큰 coding/agent data와 RL infra 추가
Practical angle	verifiable SWE/terminal/search environments, context management, Agent-as-a-Judge, BF16/FP8 release artifact

3-2. Module breakdown

1) Backbone and attention efficiency

모델은 256 experts / 80 layers / 744B total / 40B active로 스케일된다.
attention 쪽은 MLA를 기본 축으로 가져가되, pre-training 과정에서 Muon Split을 도입해 GQA-8과의 성능 격차를 줄인다.
decoding cost를 줄이기 위해 MLA의 head dimension을 192에서 256으로 늘리고 head 수를 1/3 줄인 MLA-256 변형을 사용한다. training FLOPs와 parameter 수는 유지하면서 decoding 계산을 낮추는 쪽이다.
MTP는 draft model 역할도 하지만, 논문은 여기서 한 단계 더 나아가 3개의 MTP layer를 parameter sharing으로 처리해 memory cost를 늘리지 않고 acceptance length를 끌어올린다. private prompt set 기준 accept length는 DeepSeek-V3.2의 2.55보다 긴 2.76으로 보고된다.
중요한 점은, 이들이 efficient attention을 무비판적으로 고르지 않았다는 것이다. SWA interleave는 길이 일반화에서 급격한 성능 하락을 보이고, SWA pattern은 그보다 낫지만 여전히 retrieval gap이 남는다. GDN / SimpleGDN도 흥미롭지만, fine-grained retrieval task에서는 성능 하락이 남는다. 저자들은 이런 이유로 lossless by construction인 DSA를 최종 선택한다.

이 부분이 GLM-5에서 의외로 중요하다. 많은 사람은 GLM-5를 RL report로 읽겠지만, 실제로는 어떤 efficient attention을 채택할지 고르는 기준이 꽤 잘 적혀 있는 문서이기도 하다.

2) Continued pre-training and mid-training for long context

DSA는 dense base model에서 바로 처음부터 학습한 것이 아니라, continued pre-training으로 붙인다.
논문은 DSA adaptation이 dense warm-up + sparse adaptation의 두 단계로 이루어진다고 설명한다. warm-up 뒤 sparse adaptation은 20B tokens로 진행되며, long-context benchmark에서 기존 MLA와 거의 비슷한 성능을 회복한다고 보고한다.
또한 DSA는 long sequence에서 attention computation을 roughly 1.5-2x 줄이고, 128K context를 roughly half GPU cost로 다룰 수 있다고 설명한다.
mid-training은 context length를 32K (1T tokens) -> 128K (500B) -> 200K (50B)로 단계적으로 늘린다.
software-engineering 데이터는 repo-level code, commit diff, GitHub issue, pull request, relevant source file을 unified sequence로 엮는 방식이다. issue-PR pair는 약 1천만 개까지 넓히고, filtering 후 issue-PR portion이 약 160B unique tokens라고 적는다.
long-context 데이터는 자연 문서 + synthetic data를 같이 쓰며, interleaved packing, MRCR-like data, multi-file codebase 처리를 통해 lost-in-the-middle과 긴 대화 회상 문제를 줄이려 한다.

즉, GLM-5의 긴 문맥 능력은 “200K로 늘렸다”가 아니라, 어떤 데이터로 그 길이를 실제 coding/search workflow에 맞게 채웠는가를 같이 봐야 한다.

3) Post-training stack

post-training은 Figure 5 기준으로 Overall SFT -> Reasoning RL -> Agentic RL -> General RL -> On-Policy Cross-Stage Distillation 구조다.
SFT는 세 범주를 크게 다룬다.
- General Chat
- Reasoning
- Coding & Agent
SFT 단계에서 context length는 202,752 tokens까지 확장되며, 세 가지 thinking characteristic을 도입한다.
1. Interleaved Thinking: 응답과 tool call 전에 thinking
2. Preserved Thinking: coding agent 시나리오에서 thinking block 보존
3. Turn-level Thinking: turn별 reasoning on/off 제어
Reasoning RL은 GRPO 계열에 IcePop 아이디어를 접목하되 KL regularization은 제거한 형태다. 수학, 과학, 코드, tool-integrated reasoning(TIR)을 mixed RL로 학습한다.
DSA RL에서는 indexer의 top-k selection이 RL 안정성에 큰 영향을 주므로, non-deterministic operator 대신 deterministic torch.topk를 쓰고 indexer parameter는 기본적으로 freeze한다고 밝힌다.
Agentic RL은 fully asynchronous + decoupled 구조로 설계된다. 핵심은 아래 세 가지다.
1. Multi-Task Rollout Orchestrator
2. Token-in-Token-out (TITO) gateway
3. Direct double-sided importance sampling
General RL은 단순 helpfulness RLHF 서사가 아니라, foundational correctness / emotional intelligence / task-specific quality의 세 차원으로 objective를 나누고, rule-based reward + ORM + GRM을 혼합한다.
마지막의 On-Policy Cross-Stage Distillation은 earlier SFT/RL stage에서 얻은 capability를 final policy로 되가져오는 recovery step이다.

이 스택은 Nemotron-Cascade 계열과도 닮은 점이 있다. specialization stage를 계속 쌓되, 마지막에 broad policy로 다시 정리하려는 발상이기 때문이다.

4) Environment scaling and inference tricks

agentic RL을 위해 저자들은 over 10K verifiable SWE environments를 구축했다고 밝힌다. 수천 개 repo, 9개 프로그래밍 언어를 커버한다.
terminal task는 Harbor format 기반의 construction/refine pipeline으로 만들고, Docker construction accuracy는 90% 초과라고 적는다.
search task는 early-stage search agent가 본 URL을 수집해 2M+ high-information pages로 Web Knowledge Graph를 만들고, multi-hop question synthesis에 활용한다.
BrowseComp inference 쪽에서는 단순히 모델만 바꾸는 것이 아니라, keep-recent-k + discard-all 기반의 Hierarchical Context Management(HCM)를 넣는다. keep-recent-k 하나만으로도 GLM-5의 BrowseComp score를 55.3%에서 62.0%로 올리고, 최종 HCM 세팅에서는 75.9까지 간다고 적는다.
rollouts의 tail latency를 줄이기 위해 multi-node inference, FP8 rollout inference, MTP, PD disaggregation도 함께 사용한다.

여기까지 보면, GLM-5는 사실상 모델 하나보다 agent training platform에 더 가까운 그림을 보여준다.

4. Training / Data / Recipe

4-1. Data

base-model 전체 학습 budget은 28.5T tokens다.
본문은 base training이 27T corpus에서 시작했다고 설명하고, Figure 5는 pre-training을 18T general pre-training corpus + 9T code/reasoning corpus로 시각화한다.
web / code / math & science corpus 모두 filtering pipeline을 강화했고, pre-training 데이터 쪽에서는 synthetic / AI-generated / template-based data를 엄격히 피하려 했다고 적는다.
mid-training data는 크게 두 축이다.
1. long code & reasoning data
2. long context & agent data
software-engineering 데이터는 repo-level concatenation을 계속 유지하되, relevant file retrieval과 issue-level filtering을 강화해 더 풍부한 engineering context를 만든다.
search task synthesis는 2M+ web pages 규모의 WKG를 기반으로 high-difficulty multi-hop QA를 만든다.
agentic RL용 environment는 SWE / terminal / search / slide generation처럼 서로 다른 task family를 아우른다.

4-2. Training strategy

이 논문의 recipe는 아래 순서로 읽는 편이 깔끔하다.

Base pre-training / mid-training
코드와 reasoning 비중을 높인 대규모 pre-training 후, 32K -> 128K -> 200K로 context를 확장한다.
DSA sparse adaptation
dense base에서 sparse attention을 continued pre-training으로 붙인다.
Overall SFT
general chat + reasoning + coding/agent를 한 번에 넣고, thinking mode를 도입한다.
Reasoning RL
math / science / code / TIR mixed RL. training-inference mismatch를 줄이는 IcePop 계열 아이디어를 쓴다.
Agentic RL
coding/search agent에 대해 fully asynchronous rollout-training 구조를 사용한다.
General RL
broad assistant behavior와 human-style response quality를 정렬한다.
On-Policy Cross-Stage Distillation
앞선 stage checkpoint들을 teacher로 다시 사용해 capability regression을 완화한다.

중요한 건 이 순서가 임의로 붙은 것이 아니라는 점이다. reasoning sharpen -> agentic tool-use 강화 -> broad assistant behavior 회복 -> cross-stage rebalance 라는 흐름이 있다.

4-3. Engineering notes

1) DSA 선택은 “효율”이 아니라 “회수 가능한 성능” 관점이다

SWA, GDN, SimpleGDN을 비교한 뒤에도 최종적으로 DSA를 택한 이유는, retrieval-heavy task에서의 남는 gap이 coding agent에는 너무 치명적이라고 봤기 때문이다.
이 관점은 실전적이다. asymptotic complexity보다도, repo search / long QA / file retrieval에서 무엇이 얼마나 망가지는지가 더 중요하기 때문이다.

2) TITO는 사소한 구현 디테일이 아니다

asynchronous RL에서는 text를 다시 tokenize하는 순간 action-reward alignment가 깨질 수 있다.
논문이 token-in-token-out gateway를 별도 구성한 것은, tool-heavy streaming rollout에서 이런 mismatch가 실제로 충분히 심각하다는 뜻으로 읽힌다.

3) DSA RL 안정성은 deterministic operator와 indexer freezing에 기대고 있다

RL 중 DSA indexer는 생각보다 민감하다.
논문은 non-deterministic top-k operator가 RL quality를 급격히 망가뜨릴 수 있다고 보고하고, deterministic torch.topk와 indexer freezing을 사실상의 default로 둔다.

4) search agent에서는 context management가 그냥 잘 먹힌다

keep-recent-k와 discard-all을 조합한 HCM은 모델 자체를 다시 학습하지 않고도 BrowseComp를 크게 올리는 cheap trick에 가깝다.
이런 류의 inference-time policy가 model weight 못지않게 중요하다는 점을 이 paper가 잘 보여준다.

5) release artifact가 꽤 실전적이다

공식 GitHub는 GLM-5 / GLM-5-FP8 다운로드 링크를 제공하고, vLLM / SGLang 기반 local serving 예시도 같이 둔다.
즉, 이 paper는 논문으로 끝나지 않고 BF16/FP8 artifact + serving entry point까지 이어진다.

5. Evaluation

5-1. Main results

Setting	What the paper shows
Reasoning & General	HLE 30.5, HLE with tools 50.4, LongBench v2 64.5. Artificial Analysis Intelligence Index v4.0에서는 50점으로 open-weights leader라고 주장한다.
Coding	SWE-bench Verified 77.8, SWE-bench Multilingual 73.3, Terminal-Bench 2.0은 Terminus-2/Claude Code 양쪽 harness에서 56.2점대(verified 기준 60.7 / 61.1), CyberGym 43.2를 보고한다.
Agentic	BrowseComp 62.0, context management 포함 시 75.9, BrowseComp-ZH 72.7, τ²-Bench 89.7, MCP-Atlas public set 67.8, Vending-Bench 2는 $4,432다.
CC-Bench-V2	Frontend build success는 매우 높고, backend engineering Pass@1은 25.8, repo exploration Pass@1은 65.6, chained tasks Pass@1은 52.3으로 보고된다.
Evaluation harness credibility	Agent-as-a-Judge는 point-wise human agreement 94%, model ranking Spearman 85.7%를 제시한다.

이 표만 보면 GLM-5는 “다 잘한다”는 느낌이지만, 실제로는 strength가 꽤 비대칭적이다. repo exploration이나 build reliability는 강한데, chained task와 SWE-rebench 쪽에서는 아직 frontier closed model과 차이가 남는다.

5-2. What really matters in the experiments

1) 이 paper는 “큰 open model”보다 “큰 engineering recipe”에 가깝다

Figure 5는 사실상 이 논문의 핵심 그림이다.
대부분의 headline은 Table 7에 있지만, 성능 상승이 어디서 왔는지 이해하려면 DSA + mid-training + staged post-training + cross-stage distillation이 한꺼번에 묶인 Figure 5를 먼저 봐야 한다.
즉, GLM-5는 backbone contribution 하나로 설명하기 어렵다. 좋은 점이면서 동시에 attribution을 어렵게 만드는 지점이기도 하다.

2) efficient attention ablation이 의외로 중요하다

이 논문은 GDN과 SimpleGDN까지 명시적으로 비교한다.
내 해석으로는, 저자들이 DSA를 택한 이유는 “더 세련된 sparse attention이라서”보다 repo retrieval과 long-context reasoning에서 accuracy gap을 덜 남기기 위해서다.
특히 GDN/SimpleGDN 계열이 효율적이어도 fine-grained retrieval에서 128K 기준 gap을 남긴다는 관찰은, coding/search agent에서는 꽤 중요하다.

3) real-world coding 능력은 한 방향으로만 강해진 것이 아니다

frontend에서는 build success rate와 check-item 수준의 correctness가 강하지만, instance success rate에서는 Claude Opus 4.5보다 낮다.
backend engineering은 Claude와 거의 비슷한데, multi-step chained tasks는 아직 gap이 있다.
repo exploration에서는 오히려 Claude Opus 4.5보다 높게 나온다.

이 비대칭성은 중요한 힌트다. GLM-5는 파일을 찾고 맥락을 좁혀 가는 agentic search에는 강하지만, 긴 작업 체인 전체를 완결성 있게 마무리하는 능력은 아직 frontier closed model보다 약하다.

4) evaluation harness와 inference policy를 같이 봐야 한다

BrowseComp는 official OpenAI evaluation prompt와 o3-mini judge로 표준화했고, HCM이 크게 기여한다.
Terminal-Bench는 verified version을 따로 보고하며, MCP-Atlas는 public 500-task set + 10분 timeout으로 재평가한다.
CC-Bench-V2는 internal benchmark이고 Agent-as-a-Judge도 Claude 계열 도구를 쓴다.

이건 문제라기보다, coding-agent 평가는 harness-sensitive하다는 현실을 드러낸다. 따라서 GLM-5의 점수는 모델만이 아니라 prompt, judge, timeout, context policy, agent framework까지 포함해 읽는 편이 맞다.

5) long-horizon self-correction은 아직 숙제다

논문은 chained tasks에서 GLM-5가 GLM-4.7보다 확실히 좋아졌지만 Claude Opus 4.5와는 차이가 남는다고 솔직히 적는다.
저자들은 그 이유를 error compounding, long-context consistency, long-horizon self-correction 부족으로 해석한다.

이 부분은 오히려 좋다. “agentic engineering”을 말하면서도 아직 안 되는 부분을 숨기지 않기 때문이다.

6. Limitations

기여 attribution이 흐리다.
DSA, MTP, mid-training, SFT thinking mode, reasoning RL, async agent RL, general RL, cross-stage distillation, context management가 모두 함께 바뀐다. 그래서 어떤 성능 상승이 정확히 어디서 왔는지 분리하기 어렵다.
평가가 harness-sensitive하고 일부는 internal이다.
BrowseComp judge, verified Terminal-Bench, CC-Bench-V2, Agent-as-a-Judge, SWE-rebench 등은 실전성은 높지만, 타 논문과 완전히 apples-to-apples로 비교하기에는 주의가 필요하다.
frontier closed model과의 gap이 남아 있다.
frontend ISR, chained tasks, SWE-rebench 같은 end-to-end closure에 가까운 항목에서는 아직 Claude 계열과 차이가 남는다. 논문도 long-context consistency와 self-correction이 남은 과제라고 인정한다.
재현 비용이 매우 높다.
744B-A40B scale, DSA adaptation, async rollout infra, 대규모 verifiable environment construction은 대부분의 팀이 그대로 재현하기 어렵다. 따라서 이 paper는 “그대로 따라 하라”기보다 “어떤 병목을 먼저 풀어야 하는가”의 지도로 읽는 편이 낫다.

7. My Take

7-1. Why this matters for my work

이 논문은 agentic coding 모델을 리뷰할 때 architecture / RL / infra / eval을 분리해서 보면 안 된다는 점을 다시 확인시켜 준다.
특히 efficient attention을 고르는 기준, asynchronous RL의 실제 병목, repo exploration과 chained-task를 분리해 보는 평가 관점은 실무적으로 재사용 가치가 높다.
또 “좋은 coding model”과 “좋은 coding agent” 사이의 차이가 정확히 어디에서 발생하는지도 비교적 선명하게 보여준다.

7-2. Reuse potential

efficient attention selection playbook: long-context coding/search workload에서는 retrieval fidelity를 먼저 보라는 점.
TITO / DP-aware routing / PD disaggregation: long-horizon rollout infra를 설계할 때 바로 참고할 수 있다.
keep-recent-k + discard-all: 모델 재학습 없이 적용 가능한 cheap inference-time context policy다.
Agent-as-a-Judge + chained-task benchmark: internal evaluation harness를 설계할 때 꽤 좋은 출발점이다.
cross-stage distillation: 여러 specialized stage를 거친 뒤 final broad policy를 복원하는 패턴은 다른 reasoning/agent pipeline에도 이식 가능하다.

7-3. Follow-up papers

GLM-4.5: GLM-5가 무엇을 어떻게 확장했는지 보기 위한 직전 기준점
Jet-Nemotron: search-based layer selection과 efficient attention adaptation 맥락에서 연결됨
PivotRL: agentic RL을 full rollout이 아닌 selective/local update로 바꾸는 또 다른 방향
Nemotron-Cascade 2: specialization stage와 rebalancing stage를 어떻게 섞을지 보는 관점에서 좋은 비교 대상

8. Summary

GLM-5는 744B open model 소개라기보다 agentic engineering full-stack report에 가깝다.
논문의 핵심은 DSA, long-context mid-training, async RL infra, cross-stage distillation, real-world eval harness를 같이 묶었다는 데 있다.
efficient attention 비교에서 GDN/SimpleGDN까지 검토하고도 DSA를 택했다는 점은 long-context coding agent 관점에서 특히 흥미롭다.
repo exploration과 build reliability는 강하지만, chained tasks와 SWE-rebench에서는 아직 frontier closed model과 gap이 남는다.
따라서 이 paper는 “SOTA 점수표”보다 deployable coding agent recipe로 읽는 편이 훨씬 유익하다.

Twitter Facebook LinkedIn