ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration Review

2026-06-20 16 분 소요

0. Introduction

ARIS는 “AI가 논문을 대신 써준다” 정도로 읽으면 핵심을 놓치기 쉬운 논문이다. 이 논문이 진짜로 겨냥하는 문제는 자동화 그 자체가 아니라, 자동화된 연구 결과를 어디까지 믿을 수 있는가 이다. 최근 agent system은 idea generation, literature survey, coding, experiment running, paper writing까지 점점 긴 workflow를 다루기 시작했다. 그런데 long-horizon agent의 위험은 항상 눈에 띄는 failure로 나타나지 않는다. 더 위험한 형태는 그럴듯한 figure와 그럴듯한 claim을 만들었지만, 실제 evidence와 claim 사이가 비어 있는 경우다.

ARIS는 이 failure mode를 plausible unsupported success 로 부른다. 이 표현이 이 논문의 핵심이다. Research agent가 실패하면 차라리 빨리 보인다. 코드가 안 돌거나, 실험이 없거나, paper draft가 이상하면 사람이 멈출 수 있다. 문제는 long-running agent가 실험 로그, figure, manuscript를 모두 만든 뒤, 일부 claim이 raw evidence에 의해 충분히 지지되지 않는데도 전체 artifact는 꽤 설득력 있어 보이는 경우다.

그래서 ARIS는 single-agent self-refinement를 기본적으로 신뢰하지 않는다. 대신 executor model과 reviewer model을 다른 model family로 분리하고, reviewer가 중간 artifact를 계속 비판하며 revision을 요구하는 cross-model adversarial collaboration을 기본 설정으로 둔다. 더 중요한 점은 이 구조가 단순한 multi-agent chat이 아니라, Markdown skill, persistent wiki, workflow contract, evidence-to-claim audit, scientific editing, PDF visual inspection까지 포함한 research harness 로 구현된다는 점이다.

한 줄 요약: ARIS는 long-horizon autonomous research의 핵심 failure를 plausible unsupported success로 정의하고, executor와 reviewer를 다른 model family로 분리한 adversarial collaboration, Markdown skill 기반 workflow, evidence-to-claim assurance stack으로 연구 artifact의 신뢰성을 높이려는 open-source research harness다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

Autonomous research agent의 병목을 “더 똑똑한 model” 이 아니라 harness engineering 문제로 본다.
Self-reflection이나 self-review가 아니라 cross-family adversarial review 를 기본 단위로 둔다.
연구 workflow를 idea discovery, experiment bridge, auto-review loop, paper writing, rebuttal 같은 reusable workflow로 쪼개고, 각 단계 사이에 artifact contract를 둔다.
실험 결과와 논문 claim 사이를 별도 ledger로 연결하는 assurance stack을 강조한다.
평가가 아직 observational deployment 중심이라는 한계까지 같이 드러나기 때문에, agent research automation을 과장 없이 읽기 좋다.

ARIS의 핵심은 “AI researcher” 라기보다 AI research operating system 에 가깝다. 모델 하나가 연구를 잘한다는 주장이 아니라, 연구 과정에서 어떤 정보가 저장되고, 어떤 artifact가 검증되고, 어떤 claim이 evidence와 연결되어야 하는지를 system level에서 고정하려는 시도다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 autonomous ML research workflow에서 생기는 reliability failure다. 논문은 agent 성능이 model weight만으로 결정되지 않는다고 본다. 같은 model이라도 harness가 무엇을 저장하고, 무엇을 검색하고, 어떤 evidence를 context에 넣고, 어떤 artifact를 reviewer에게 보여주는지에 따라 결과 신뢰도가 크게 달라진다.

ARIS가 보는 long-horizon research workflow는 대략 다음과 같은 stage를 포함한다.

literature survey
idea generation
experiment planning
code implementation
experiment execution
result analysis
paper drafting
iterative review and revision
rebuttal writing

이 중 어느 한 단계라도 잘못되면 최종 manuscript가 틀릴 수 있다. 하지만 문제는 단순 error propagation이 아니다. Long-horizon agent는 중간 artifact를 계속 자기 narrative 안에 넣어 다음 step을 진행한다. 그러면 초기 framing error가 뒤쪽 manuscript claim에 조용히 스며들 수 있다.

예를 들어 다음 상황을 생각하면 된다.

Experiment log에는 일부 dataset만 성공했는데, manuscript는 method가 broadly robust하다고 쓴다.
Figure는 맞지만, figure caption이 실제 metric definition을 과장한다.
Ablation result가 weak한데, discussion에서는 causal mechanism처럼 설명한다.
Code change가 baseline에도 적용되었는지 불명확한데, final claim은 method gain으로만 읽힌다.
Reviewer가 executor summary만 보고 판단해서, raw artifact의 누락을 보지 못한다.

이런 문제는 final output만 보면 잘 보이지 않는다. 그래서 ARIS의 문제 정의는 “agent가 end-to-end research를 할 수 있는가” 보다 더 구체적이다. 핵심은 agent가 만든 claim이 raw evidence와 검증 가능한 방식으로 연결되어 있는가 이다.

1-2. Why previous approaches are insufficient

기존 autonomous research agent나 coding agent의 한계는 크게 네 가지로 볼 수 있다.

첫째, same-model self-refinement는 correlated error를 피하기 어렵다. 같은 model이 draft를 만들고 같은 model이 다시 review하면, model이 가진 blind spot이 그대로 유지될 수 있다. Self-reflection이 없는 것보다는 낫지만, 충분히 독립적인 assurance라고 보기 어렵다.

둘째, multi-agent committee는 비용과 coordination overhead가 커질 수 있다. 많은 agent를 두면 다양한 관점이 생길 수 있지만, long-horizon workflow에서는 message passing, role confusion, duplicated context, decision latency가 커진다. ARIS는 큰 committee보다 executor-reviewer 쌍의 adversarial loop를 더 practical한 operating point로 본다.

셋째, 기존 agent workflow는 modular resumption이 약하다. 긴 연구 workflow에서 실패가 나면 처음부터 다시 시작하거나, 사람이 context를 다시 정리해야 하는 경우가 많다. ARIS는 skill과 artifact contract를 통해 checkpoint 기반 recovery와 auditability를 확보하려 한다.

넷째, 많은 system은 final paper quality를 평가하지만, evidence-to-claim linkage를 별도 object로 관리하지 않는다. 논문 draft가 좋아졌는지 보는 것과, 각 claim이 어떤 result, 어떤 log, 어떤 figure에 의해 지지되는지 보는 것은 다르다. ARIS는 이 gap을 assurance layer로 분리한다.

따라서 ARIS는 autonomous research를 model capability benchmark가 아니라 reliable workflow design problem 으로 재정의한다.

2. Core Idea

2-1. Main contribution

ARIS의 핵심 기여는 세 가지로 정리할 수 있다.

Cross-model adversarial collaboration
- Executor model은 forward progress를 담당한다.
- Reviewer model은 다른 model family에서 선택되어 intermediate artifact를 비판한다.
- Reviewer는 executor의 narrative를 그대로 믿지 않고, file path와 raw artifact를 기준으로 critique하도록 설계된다.
- 목적은 same-model self-review가 놓치는 correlated blind spot을 줄이는 것이다.
Three-layer research harness
- Execution layer는 Markdown-defined skill, MCP model integration, persistent research wiki, deterministic figure generation을 제공한다.
- Orchestration layer는 end-to-end research workflow를 skill chain으로 묶는다.
- Assurance layer는 experiment integrity, result-to-claim mapping, claim audit, scientific editing, proof check, rendered PDF visual inspection을 담당한다.
Evidence-to-claim assurance
- ARIS는 claim을 final manuscript text로만 보지 않는다.
- Claim은 verified result와 raw evidence에 연결되어야 한다.
- Claim ledger를 통해 manuscript statement와 evidence source를 cross-check한다.

이 세 가지가 합쳐져 ARIS의 핵심 message가 나온다.

Autonomous research에서 중요한 것은 agent가 많은 step을 수행하는 것이 아니라, 각 step의 artifact가 다음 claim을 정당하게 지지하는지 검증 가능한 상태로 남는 것이다.

2-2. Design intuition

ARIS의 설계 직관은 꽤 실무적이다. 연구 자동화에서 가장 위험한 것은 agent가 아무것도 못 하는 것이 아니다. 오히려 agent가 너무 그럴듯하게 진행해서 사람이 마지막에만 보고 넘어가는 상황이 위험하다.

이를 막으려면 세 가지 조건이 필요하다.

첫째, role separation 이 필요하다. Code를 쓰고 실험을 돌린 executor는 자기 결과를 좋게 해석하려는 framing bias를 가질 수 있다. Reviewer는 다른 model family에서 와야 하고, executor summary보다 raw artifact를 봐야 한다.

둘째, state preservation 이 필요하다. 연구는 한 번의 chat completion이 아니다. 실험 계획, code path, run log, figure, draft, review comment, revision decision이 모두 이어진다. 이 정보를 persistent wiki와 artifact contract로 남겨야 다음 step에서 재사용하고 검증할 수 있다.

셋째, claim grounding 이 필요하다. 논문 글이 좋아 보인다는 것은 충분하지 않다. Claim이 어떤 evidence에 의해 support되는지 trace해야 한다. 그래서 ARIS는 manuscript polishing과 evidence audit을 분리한다.

내 해석으로는 ARIS가 제안하는 가장 중요한 shift는 다음이다.

기존 agent: task를 끝내는 model
ARIS: task artifact를 검증 가능한 상태로 남기는 harness

이 차이가 크다. Research automation에서 final output만 보는 system은 demo로는 좋아 보일 수 있지만, 실제 연구 과정에서는 어떤 claim을 믿어도 되는지 알기 어렵다. ARIS는 이 문제를 workflow와 audit design으로 풀려고 한다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	Autonomous ML research workflow에서 unsupported claim과 correlated self-review error를 줄이는 것
Core mechanism	Executor-reviewer cross-model adversarial collaboration
Execution layer	Markdown skill, MCP integration, persistent research wiki, deterministic figure generation
Orchestration layer	Idea discovery, experiment bridge, auto-review loop, paper writing, rebuttal workflow
Assurance layer	Integrity verification, result-to-claim mapping, claim audit, scientific editing, proof check, PDF visual inspection
Key assumption	Long-horizon single-agent execution is unreliable by default
Difference from self-refinement	Same model의 self-critique가 아니라 다른 model family의 reviewer를 사용
Difference from many-agent committee	많은 agent를 두기보다 executor-reviewer adversarial loop와 artifact audit에 집중

3-2. Module breakdown

1) Execution layer: Markdown skill as research primitive

ARIS의 execution layer는 reusable skill을 중심으로 구성된다. 논문과 공개 요약 기준으로 core skill은 65개 이상이며, 각 skill은 plain-text Markdown 형태의 SKILL.md로 정의된다. 이 구조가 중요한 이유는 skill이 특정 framework나 특정 host에 강하게 묶이지 않기 때문이다.

각 skill은 대략 다음 정보를 담는 contract처럼 작동한다.

input requirement
output artifact
procedural step
quality gate
failure handling
reviewer instruction

여기서 핵심은 skill이 단순 prompt가 아니라 검토 가능한 procedure object 라는 점이다. 사람이 읽을 수 있고, LLM도 읽을 수 있고, version control에도 올릴 수 있다. 연구 workflow에서 재현성과 auditability를 확보하려면 이 속성이 중요하다.

Execution layer에는 MCP 기반 model integration도 포함된다. 즉 Claude Code, Codex CLI, Cursor 같은 다양한 agent host에서 같은 skill을 실행할 수 있게 하려는 방향이다. ARIS가 특정 model product가 아니라 research harness라고 주장하는 이유가 여기 있다.

또 하나 중요한 요소는 persistent research wiki다. Research wiki는 이전 idea, experiment, finding, claim, failure mode를 저장해 iterative reuse를 가능하게 한다. 이건 GenericAgent류 memory system과도 닮아 있다. 다만 ARIS에서는 memory의 목적이 일반 task reuse보다 scientific artifact reuse and audit 에 더 가깝다.

2) Orchestration layer: workflow를 skill chain으로 묶는다

ARIS의 orchestration layer는 개별 skill을 end-to-end workflow로 묶는다. 공개 요약 기준 주요 workflow는 다음과 같다.

Workflow	Role
Idea discovery	Literature survey, novelty check, experiment planning
Experiment bridge	Experiment plan을 runnable code와 execution log로 변환
Auto-review loop	Paper와 result를 reviewer critique 기반으로 반복 개선
Paper writing	Narrative report를 structured manuscript와 PDF로 변환
Rebuttal	Reviews를 분석하고 response draft를 구성

이 workflow는 Discovery, Experimentation, Manuscript, Post-Submission 같은 research phase와 연결된다. 중요한 점은 workflow가 linear script가 아니라 adjustable effort setting과 reviewer routing을 가진다는 것이다. 즉 사용자는 빠른 sanity-check mode로 돌릴 수도 있고, 더 강한 assurance mode로 돌릴 수도 있다.

orchestration layer의 핵심은 research step 사이의 artifact boundary를 명시하는 것 이다. 예를 들어 idea discovery의 output은 그냥 chat transcript가 아니라 experiment plan이어야 한다. Experiment bridge의 output은 그냥 code가 아니라 runnable code와 experiment log여야 한다. Auto-review loop는 그냥 draft rewrite가 아니라 reviewer critique, action recommendation, revision artifact를 남겨야 한다.

이런 boundary가 있어야 later audit이 가능하다.

3) Cross-model adversarial review: 다른 model family가 왜 중요한가

ARIS에서 executor와 reviewer는 가능한 다른 model family를 사용한다. 예를 들어 executor가 Claude 계열이면 reviewer는 GPT, Gemini, GLM 같은 다른 family를 사용하는 식이다. 논문은 이를 self-review blind spot을 줄이기 위한 기본 정책으로 둔다.

중요한 것은 reviewer가 단순히 executor summary를 읽는 것이 아니라는 점이다. 공개 설명에서는 reviewer에게 file path를 넘기고, raw artifact를 직접 보게 하는 방식이 강조된다. 이 선택은 작아 보이지만 매우 중요하다. Executor가 자기 결과를 요약하면 그 요약 자체가 framing이 된다. Reviewer가 raw code, run log, figure, manuscript section을 직접 봐야 독립성이 생긴다.

Review loop는 다음과 같은 형태로 볼 수 있다.

Executor가 artifact를 생성한다.
Reviewer가 artifact를 rubric에 따라 critique한다.
Critique가 action item으로 변환된다.
Executor가 action item을 수행한다.
Quality threshold 또는 max round까지 반복한다.
필요한 경우 third model이 rescue diagnosis를 제공한다.

이 구조는 adversarial하지만 완전히 open-ended하지 않다. Review, action extraction, revision, convergence check가 workflow 안에서 정해진다. 즉 ARIS의 multi-agent collaboration은 free-form debate가 아니라 artifact-centered adversarial control loop 에 가깝다.

4) Assurance layer: evidence-to-claim audit cascade

ARIS의 가장 중요한 부분은 assurance layer다. 논문은 실험 claim이 evidence에 의해 지지되는지 확인하기 위해 세 단계의 audit process를 둔다.

Stage	Question	Output
Integrity verification	Experiment artifact 자체가 믿을 만한가	Verified or suspicious result
Result-to-claim mapping	어떤 result가 어떤 claim을 지지하는가	Claim ledger
Claim auditing	Manuscript claim이 ledger와 raw evidence에 맞는가	Claim-level revision request

이를 더 구체적으로 보면, claim은 다음과 같은 구조를 가져야 한다.

\[claim_i -> result_j, artifact_k, audit_status_i\]

여기서 중요한 것은 claim이 text span이 아니라 traceable object라는 점이다. Manuscript가 “our method improves robustness” 라고 말하면, 그 claim은 어떤 table, 어떤 run log, 어떤 baseline condition에 의해 support되는지 연결되어야 한다.

이 구조는 실제 paper writing에서 매우 중요하다. 사람이 논문을 쓸 때도 가장 흔한 실수는 evidence보다 claim을 조금 더 세게 쓰는 것이다. Agent는 이 실수를 훨씬 더 빨리, 더 그럴듯하게 만들 수 있다. ARIS는 claim ledger와 zero-context reviewer audit을 통해 이 gap을 줄이려 한다.

5) Scientific editing, proof checks, PDF visual inspection

ARIS는 evidence audit만 다루지 않는다. Paper writing 단계에서는 scientific editing pipeline도 포함된다. 공개 요약에 따르면 five-pass scientific editing, mathematical proof check, rendered PDF visual inspection이 assurance layer에 들어간다.

이 부분이 흥미로운 이유는 paper quality가 단순 text quality가 아니기 때문이다.

논리 흐름이 맞는가
수식 전개가 맞는가
figure와 caption이 일치하는가
table value와 text claim이 맞는가
rendered PDF에서 layout이 깨지지 않는가
citation이 claim을 실제로 support하는가

LLM은 텍스트로 보기에는 좋은 draft를 만들 수 있지만, final PDF에서는 figure overflow, broken reference, unreadable table 같은 문제가 생길 수 있다. ARIS가 rendered PDF visual inspection을 넣는 것은 practical하다. Research artifact는 markdown이나 LaTeX source가 아니라 최종적으로 읽히는 PDF이기 때문이다.

6) Self-improvement loop: harness도 개선 대상이다

ARIS는 prototype self-improvement loop도 언급한다. 이 loop는 research trace를 기록하고, harness improvement를 제안한다. 다만 중요한 제약이 있다. 제안된 improvement는 reviewer approval을 거친 뒤에만 채택된다.

이 설계는 agent system에서 매우 중요하다. Agent가 자기 skill이나 workflow를 마음대로 고치면 빠르게 적응할 수 있지만, 동시에 unsafe shortcut이나 unverified rule을 만들 수 있다. ARIS는 self-improvement를 허용하되, 그 자체도 adversarial review 대상에 넣는다.

이 부분이 ARIS의 가장 장기적인 가능성이다. 좋은 research harness는 고정된 workflow가 아니라, 반복 실행에서 failure mode를 발견하고 skill을 개선해야 한다. 다만 그 개선 역시 audit 가능해야 한다.

4. Training / Data / Recipe

4-1. Data

ARIS는 model training paper가 아니다. 따라서 pretraining dataset, SFT data, RL reward data 같은 일반적인 ML 논문의 data section으로 읽으면 핵심을 놓친다. 여기서 중요한 data는 research workflow artifact 다.

ARIS가 다루는 data object는 대략 다음과 같다.

skill definition
research wiki entry
idea report
experiment plan
code artifact
run log
figure artifact
result summary
claim ledger
reviewer critique
revision patch
manuscript draft
rendered PDF
rebuttal draft

즉 ARIS의 data는 training corpus가 아니라, research process가 만들어내는 versioned artifact collection이다. 이 data가 구조적으로 남아야 나중에 claim audit과 reproducibility check가 가능하다.

논문은 skill inventory를 framework baseline으로 제시한다. 다만 공개 요약만으로는 skill category별 정확한 수량, version, release snapshot, coverage table을 모두 확인하기 어렵다. 블로그 게시 전 원문 Table 5를 다시 확인하는 것이 좋다.

4-2. Training strategy

ARIS에는 일반적인 의미의 gradient training strategy가 없다. 대신 workflow execution strategy가 있다. 이를 training recipe 대신 operational recipe로 읽는 편이 맞다.

전체 운영 흐름은 다음처럼 볼 수 있다.

Research direction이 입력된다.
Idea discovery workflow가 literature와 novelty를 검토하고 experiment plan을 만든다.
Experiment bridge workflow가 plan을 runnable code와 run log로 변환한다.
Auto-review loop가 draft, result, reviewer critique를 반복해서 개선한다.
Paper writing workflow가 manuscript와 PDF를 만든다.
Assurance layer가 evidence-to-claim audit과 editing pass를 수행한다.
Rebuttal workflow가 review response를 만든다.

중요한 점은 각 workflow가 artifact contract로 연결된다는 것이다. 예를 들어 experiment plan 없이 experiment bridge가 무작정 코드를 생성하면 auditability가 낮아진다. Run log 없이 manuscript claim을 만들면 result-to-claim mapping이 약해진다. ARIS는 각 stage output을 다음 stage input으로 명시해 이 문제를 줄인다.

4-3. Engineering notes

ARIS를 실제 agent workflow 설계 관점에서 보면 재사용할 만한 engineering note가 많다.

Reviewer에게 summary가 아니라 artifact를 보여준다
- Executor summary는 framing contamination을 만들 수 있다.
- Reviewer는 raw code, log, figure, table, draft를 직접 봐야 한다.
Skill은 code가 아니라 Markdown contract로 둔다
- Markdown skill은 사람이 읽고 수정하기 쉽다.
- LLM도 그대로 읽을 수 있어 host portability가 높다.
- Version control과 community contribution에도 유리하다.
Claim ledger를 별도 artifact로 둔다
- Manuscript text 안에 claim을 묻어두면 audit이 어렵다.
- Claim, result, artifact path, audit status를 분리해 관리해야 한다.
Figure generation은 deterministic해야 한다
- 논문 figure는 narrative decoration이 아니라 evidence artifact다.
- Figure script, data source, output path가 재현 가능해야 한다.
PDF visual inspection은 optional luxury가 아니다
- 최종 artifact는 PDF다.
- Source가 맞아도 rendered output이 깨지면 reviewer 경험과 reproducibility가 흔들린다.
Self-improvement도 review 대상이어야 한다
- Agent가 workflow를 개선하는 것은 유용하지만 위험하다.
- Proposed harness improvement는 reviewer approval을 거쳐야 한다.

5. Evaluation

5-1. Main results

ARIS의 evaluation은 일반적인 benchmark paper와 다르다. 공개 요약 기준으로 평가는 observational deployment tracking과 single overnight operational run 중심이다. 즉 이 논문은 특정 benchmark에서 agent가 몇 percent 더 높은 score를 냈다는 식의 controlled evaluation을 전면에 내세우지 않는다.

주요 평가 축은 다음과 같다.

Evaluation axis	What it checks
Cross-family review	Reviewer가 executor와 다른 model family인지
Adversarial review	Artifact critique와 revision request가 workflow에 들어가는지
Composability	Skill과 workflow가 modular하게 조합되는지
E2E research workflow	Idea부터 paper 또는 rebuttal까지 연결되는지
Assurance stack	Evidence, claim, citation, PDF quality를 audit하는지
Cross-platform portability	Claude Code, Codex CLI, Cursor 같은 host에서 쓸 수 있는지

논문은 ARIS가 composable skill, end-to-end workflow, assurance stack, cross-platform portability를 통합한다는 점을 강조한다. 하지만 이 결과는 기능 coverage 비교에 가깝다. 따라서 이를 “ARIS가 연구 품질을 causally 더 높였다” 로 읽으면 과장이다.

이 논문의 main result는 performance score가 아니라 operational feasibility 다. 즉 ARIS는 autonomous research workflow를 실제로 여러 stage로 나누고, adversarial review와 assurance gate를 넣어 돌릴 수 있음을 보여준다.

5-2. What really matters in the experiments

이 논문에서 가장 중요한 평가 포인트는 세 가지다.

1) Unsupported claim pruning이 실제로 가능한가

ARIS의 핵심 failure mode가 plausible unsupported success라면, 평가는 claim pruning을 봐야 한다. Single overnight run에서 unsupported claim을 pruning하고 manuscript를 iterative refine했다는 관찰은 흥미롭다. 다만 controlled benchmark는 아니므로, 어떤 종류의 claim이 얼마나 줄었는지, human expert와 비교해 precision/recall이 어떤지는 추가 확인이 필요하다.

2) Cross-family reviewer가 같은 model reviewer보다 실제로 더 나은가

ARIS의 핵심 설계는 cross-family adversarial review다. 하지만 공개 요약 기준으로는 reviewer family ablation이 충분히 보이지 않는다. 예를 들어 다음 비교가 있으면 더 강한 evidence가 된다.

same-model self-review
same-family reviewer
cross-family reviewer
multi-reviewer committee
human expert reviewer

이 비교가 있어야 cross-family design의 causal value를 더 정확히 말할 수 있다. 현재 단계에서는 ARIS가 설계 철학과 practical harness를 제시한 것으로 읽는 것이 안전하다.

3) Assurance overhead와 research throughput의 trade-off

강한 assurance는 비용을 만든다. Reviewer call, artifact audit, PDF inspection, claim ledger construction은 모두 time과 token을 쓴다. ARIS가 실제 연구 환경에서 유용하려면, assurance가 연구 속도를 얼마나 늦추는지와 unsupported claim을 얼마나 줄이는지를 같이 봐야 한다.

이 논문은 reliability-first system으로 읽는 것이 맞다. 따라서 throughput benchmark만으로 평가하면 핵심을 놓친다. 반대로 reliability gain을 정량화하지 못하면 practical adoption도 어렵다. 후속 연구에서는 reliability-cost Pareto curve가 중요해질 것이다.

6. Limitations

Evaluation이 아직 observational하다
- 공개 요약 기준으로 ARIS 평가는 deployment tracking과 single overnight run 중심이다.
- 이는 feasibility evidence로는 의미 있지만, cross-model reviewer나 assurance layer의 causal contribution을 입증하기에는 부족하다.
Cross-family review의 효과가 충분히 분리되어 있지 않다
- ARIS의 핵심은 다른 model family reviewer를 쓰는 것이다.
- 하지만 same-model self-review, same-family review, human review와의 controlled ablation이 더 필요하다.
Agent가 만든 science 자체의 novelty를 평가하기 어렵다
- Workflow가 manuscript를 개선할 수 있다는 것과, 정말 새로운 과학적 기여를 만들 수 있다는 것은 다르다.
- Novelty, correctness, reproducibility, usefulness를 분리해서 봐야 한다.
Assurance stack이 token과 time cost를 크게 만들 수 있다
- Evidence audit, claim ledger, reviewer loop는 모두 비용을 만든다.
- 실제 연구팀에서 쓸 때는 assurance level을 task risk에 맞게 조절해야 한다.
Markdown skill은 portable하지만 execution safety는 별도 문제다
- Skill이 Markdown으로 정의되어 있으면 읽기 쉽지만, 실제 실행은 shell, code, GPU, file system과 연결된다.
- Sandbox, permission boundary, secret handling, data governance가 필요하다.
Reviewer도 hallucinate할 수 있다
- 다른 model family reviewer가 항상 맞는 것은 아니다.
- Reviewer가 raw artifact를 잘못 읽거나, 지나치게 보수적인 critique를 줄 수 있다.
- 따라서 reviewer verdict도 audit 대상이어야 한다.
Claim ledger의 granularity가 어렵다
- Claim을 너무 coarse하게 잡으면 unsupported subclaim이 묻힌다.
- 너무 fine-grained하게 잡으면 overhead가 커진다.
- 어떤 claim unit이 적절한지는 domain마다 달라질 수 있다.

7. My Take

7-1. Why this matters for my work

이 논문은 agent 논문이라기보다 research workflow reliability paper 로 읽는 편이 좋다. 최근 agent system을 보면 capability demo는 많지만, 실제로 research process에 넣으려면 다음 질문이 더 중요해진다.

이 결과를 믿을 수 있는가
어떤 artifact가 이 claim을 support하는가
중간 실험 실패가 manuscript narrative에 어떻게 반영되었는가
reviewer가 executor의 framing을 그대로 받아들이고 있지는 않은가
figure, table, claim, citation이 서로 맞는가

ARIS는 이 질문에 대해 꽤 명확한 system answer를 준다. 답은 더 큰 model 하나가 아니라, artifact boundary, reviewer independence, claim ledger, audit gate를 갖춘 harness다.

특히 좋게 본 점은 ARIS가 automation과 assurance를 분리하지 않는다는 것이다. 보통 자동화 시스템은 end-to-end speed를 강조한다. 하지만 연구에서는 speed만큼 중요한 것이 claim discipline 이다. ARIS는 agent가 만든 research artifact를 검토 가능한 구조로 남기는 것을 핵심으로 둔다.

7-2. Reuse potential

재사용하고 싶은 포인트는 다음과 같다.

Executor-reviewer family split
- 같은 model self-review보다 다른 family reviewer를 붙이는 것이 더 현실적인 blind spot reduction 장치가 될 수 있다.
Reviewer sees raw artifacts
- Summary가 아니라 file path와 raw output을 reviewer에게 준다.
- 이건 coding agent, document AI, data analysis agent에도 그대로 적용 가능하다.
Claim ledger
- 논문뿐 아니라 report, benchmark analysis, product metric report에도 쓸 수 있다.
- 모든 claim은 evidence artifact에 연결되어야 한다.
Markdown skill contract
- LLM-readable이면서 human-editable한 procedure format은 practical하다.
- Complex workflow를 code framework가 아니라 text contract로 관리할 수 있다.
Rendered artifact inspection
- PDF, slide, dashboard, chart 같은 final artifact는 source가 아니라 rendered output을 봐야 한다.
Assurance level as config
- 모든 task에 beast mode audit을 걸 필요는 없다.
- Risk가 높은 task에는 강한 audit, 낮은 task에는 빠른 review로 조절하는 방식이 실용적이다.

7-3. Follow-up papers

The AI Scientist
Agent Laboratory
AutoResearch
GenericAgent
SWE-agent
OpenHands
Voyager
Reflexion
Self-Refine
MetaGPT

8. Summary

ARIS는 autonomous research의 핵심 failure를 plausible unsupported success로 정의한다.
해결책은 single-agent self-refinement가 아니라, 다른 model family reviewer를 붙이는 cross-model adversarial collaboration이다.
시스템은 execution, orchestration, assurance 세 layer로 구성되고, Markdown skill과 artifact contract를 중심으로 동작한다.
가장 중요한 설계는 evidence-to-claim audit이다. 좋은 manuscript보다 claim이 raw evidence에 연결되는지가 핵심이다.
다만 평가는 아직 observational 성격이 강하므로, cross-family review와 assurance stack의 causal effect는 후속 controlled benchmark가 필요하다.

Twitter Facebook LinkedIn