Eywa: Heterogeneous Scientific Foundation Model Collaboration Review

2026-06-17 14 분 소요

0. Introduction

Heterogeneous Scientific Foundation Model Collaboration은 “과학용 multi-agent framework를 하나 더 만들었다” 정도로 읽으면 핵심을 놓치기 쉬운 논문이다. 이 논문이 진짜로 겨냥하는 문제는 agent 수가 부족하다는 것이 아니다. 오히려 LLM agent가 모든 것을 natural language로 serialize해서 처리하려는 습관이 과학 domain에서는 구조적으로 약하다는 점을 정면으로 다룬다.

과학 문제에는 자연어 질문만 있는 것이 아니다. Time series, tabular feature, molecular structure, material property, clinical signal, 경제 지표처럼 모델이 원래 다루어야 하는 representation이 따로 있다. 그런데 LLM agent pipeline에서는 이런 data를 대체로 text로 바꿔 넣거나, domain-specific model을 단순 tool처럼 호출한다. 그러면 domain model은 높은 수준의 reasoning loop에 참여하지 못하고, LLM은 구조화된 signal을 잃은 상태로 추론하게 된다.

Eywa의 메시지는 단순하다. 과학 agent는 LLM끼리 더 많이 대화하게 만드는 것만으로는 부족하다. 언어 모델의 reasoning interface와 domain-specific foundation model의 native computation을 연결하는 contract가 필요하다. 이 논문은 그 contract를 Tsaheylu bond라고 부르고, 이를 single-agent, multi-agent, dynamic orchestration setting으로 확장한다.

한 줄 요약: Eywa는 LLM agent가 모든 과학 data를 text로 flatten하는 대신, domain-specific foundation model을 query compiler와 response adapter로 연결해 scientific reasoning loop 안에서 직접 협업하게 만드는 heterogeneous agentic framework다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

Scientific AI agent의 병목을 “더 강한 LLM”이 아니라 language-only interface bottleneck으로 재정의한다.
Domain-specific model을 passive tool이 아니라 agentic system 안의 specialized actor로 편입한다.
EywaAgent, EywaMAS, EywaOrchestra를 통해 single agent, fixed multi-agent, dynamic routing을 하나의 design space로 정리한다.
EywaBench를 통해 natural language, time series, tabular data가 섞인 과학 task에서 utility와 token cost를 같이 본다.
실무적으로는 schema, compiler, adapter, planner를 어떻게 scientific workflow에 붙일지 생각하게 만든다.

이 논문은 “scientific foundation model 하나를 더 만들자”가 아니라, 이미 존재하는 specialized scientific model들을 LLM agent system에 어떤 interface로 넣을 것인가에 대한 systems paper에 가깝다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 LLM-centric agent가 과학 domain에서 structured, non-linguistic data를 다루기 어렵다는 점이다.

일반적인 LLM agent는 문제를 natural language context로 바꾼 뒤 reasoning, planning, tool call을 반복한다. 이 방식은 document QA, code assistant, web browsing처럼 language가 중심인 task에서는 강하다. 하지만 과학 domain에서는 data의 원래 구조가 더 중요할 때가 많다.

예를 들어 time series forecasting에서는 temporal resolution, covariate relation, normalization, missing pattern이 중요하다. Tabular prediction에서는 feature distribution, categorical interaction, target masking이 중요하다. Materials나 bio domain으로 가면 graph, coordinate, sequence, physical constraint가 중요해진다. 이 정보를 모두 text로 바꾸면 LLM이 이해하기 쉬운 형태가 될 수는 있지만, domain model이 쓰던 inductive bias는 약해진다.

Eywa의 문제 설정은 그래서 다음 질문에 가깝다.

LLM은 reasoning과 orchestration을 맡고, domain-specific foundation model은 native computation을 맡게 할 수 있는가?
Domain model의 output을 LLM이 다시 reasoning에 사용할 수 있는 language-compatible representation으로 바꿀 수 있는가?
Single agent, multi-agent, dynamic routing setting에서 이런 heterogeneous collaboration을 일관된 interface로 만들 수 있는가?

즉 핵심은 “LLM이 과학을 더 잘 풀게 하자”가 아니다. 더 정확히는 LLM과 scientific FM이 서로 다른 representation space를 유지한 채 협업하게 하자이다.

1-2. Why previous approaches are insufficient

기존 접근의 한계는 크게 네 가지다.

첫째, serialization bottleneck이다. Structured data를 text prompt로 바꾸면 LLM이 읽을 수는 있지만, 원래 domain structure가 약해진다. 긴 table을 text로 나열하거나 time series를 숫자 sequence로 넣는 것은 token cost도 크고, feature relation도 잘 보존하지 못한다.

둘째, passive tool problem이다. Domain-specific model을 tool로 붙일 수는 있다. 하지만 tool은 보통 LLM이 정한 query를 받아 output을 반환하는 수동적인 component에 머문다. 이런 구조에서는 domain model이 planning과 decision making 과정에 직접 참여하지 못한다.

셋째, homogeneous multi-agent의 한계다. 여러 LLM agent를 붙이면 debate나 reflection은 가능하다. 하지만 agent들이 모두 language-only reasoning을 한다면, time series나 tabular prediction 같은 specialized computation을 잘하는 모델의 inductive bias는 여전히 쓰지 못한다.

넷째, fixed topology의 비효율이다. 어떤 task는 single domain FM만 있으면 충분하고, 어떤 task는 LLM-only reasoning으로 충분하며, 어떤 task는 여러 agent의 협업이 필요하다. 처음부터 하나의 multi-agent topology를 고정하면 task마다 cost와 utility의 trade-off가 달라진다.

Eywa는 이 문제를 system interface 문제로 본다. LLM agent가 모든 것을 text로 먹는 구조를 유지한 채 prompt만 개선하는 것이 아니라, LLM과 scientific FM 사이의 boundary를 명시적으로 설계한다.

2. Core Idea

2-1. Main contribution

Eywa의 핵심 기여는 크게 세 가지로 정리할 수 있다.

Tsaheylu bond
- LLM과 domain-specific foundation model 사이의 bidirectional interface다.
- Query compiler는 LLM의 task state를 FM이 받을 수 있는 structured invocation으로 바꾼다.
- Response adapter는 FM output을 LLM reasoning loop에 다시 넣을 수 있는 representation으로 바꾼다.
Three-level agentic design
- EywaAgent는 single-agent pipeline 안에서 LLM과 scientific FM을 연결한다.
- EywaMAS는 기존 multi-agent system에 EywaAgent를 plug-in하여 heterogeneous collaboration을 만든다.
- EywaOrchestra는 planner가 task마다 agent configuration, model, topology를 동적으로 선택한다.
EywaBench
- Natural language, time series, tabular data가 섞인 scientific benchmark다.
- Physical, life, social science domain을 포함하고, 9개 sub-domain과 27개 domain-modality cell을 커버한다.
- Utility뿐 아니라 token consumption과 inference time도 함께 본다.

논문의 중요한 포인트는 Eywa가 single recipe가 아니라 design space라는 점이다. 단순한 tool-use agent로도 쓸 수 있고, multi-agent system의 agent를 일부 교체하는 방식으로도 쓸 수 있으며, planner를 두고 dynamic orchestration까지 갈 수도 있다.

2-2. Design intuition

Eywa의 설계 직관은 매우 실무적이다.

LLM은 general reasoning, task decomposition, instruction following에 강하다. 반면 domain-specific FM은 특정 data modality와 prediction task에 강하다. 과학 task에서는 둘 중 하나를 선택하는 것이 아니라, 둘이 서로 다른 역할을 해야 한다.

Tsaheylu interface는 이를 다음처럼 나눈다.

LLM: task state를 해석하고, 어떤 specialized computation이 필요한지 결정한다.
Query compiler: LLM state를 FM input schema로 변환한다.
Scientific FM: native representation 위에서 prediction 또는 analysis를 수행한다.
Response adapter: FM output을 LLM이 다시 사용할 수 있는 structured text 또는 structured message로 변환한다.
Agent controller: 다음 step에서 FM을 다시 invoke할지, language-only reasoning을 할지 결정한다.

이를 아주 단순한 형태로 쓰면 다음과 같다.

\[u_k = phi_k(s_t)\] \[o_k = F_k(x, u_k)\] \[z_k = psi_k(o_k)\]

여기서 $s_t$는 현재 task state, $F_k$는 domain-specific foundation model, $u_k$는 FM invocation, $o_k$는 FM output, $z_k$는 LLM이 다시 읽을 수 있는 adapted response다.

이 구조에서 중요한 것은 LLM이 FM을 그냥 “불러보는” 것이 아니라, FM이 다루는 input/output space를 schema로 고정한다는 점이다. 즉 prompt engineering만으로 해결하려는 것이 아니라, agent communication interface를 typed contract에 가깝게 만든다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	LLM-centric agent를 structured scientific FM과 협업 가능한 heterogeneous agent system으로 확장
Core interface	Tsaheylu bond: query compiler plus response adapter
Single-agent variant	EywaAgent
Multi-agent variant	EywaMAS
Dynamic variant	EywaOrchestra
Benchmark	EywaBench, natural language / time series / tabular scientific tasks
Main claim	Cross-modality heterogeneity가 LLM-only agent collaboration보다 scientific task에서 더 유용함
Practical angle	Schema, adapter, planner를 통해 scientific FM을 agent loop에 연결

3-2. Module breakdown

1) Tsaheylu interface

Tsaheylu interface는 Eywa의 가장 작은 building block이다. 이름은 Avatar의 연결 개념에서 가져왔지만, system 관점에서는 FM과 LLM 사이의 typed communication contract라고 보면 된다.

구성은 두 함수다.

query compiler $phi_k$
- LLM의 current state $s_t$를 domain-specific FM이 이해할 수 있는 structured input $u_k$로 바꾼다.
- 예를 들어 time series task에서는 target variable, forecast horizon, covariate selection, history window 같은 설정을 만들어야 한다.
- tabular task에서는 feature schema, target type, train-test split, metric type 등을 맞춰야 한다.
response adapter $psi_k$
- FM output $o_k$를 LLM이 다시 사용할 수 있는 representation $z_k$로 바꾼다.
- raw prediction만 넘기는 것이 아니라 confidence, error summary, predicted trend, class decision, explanation hint 등을 함께 구성할 수 있다.

이 interface의 목적은 LLM이 FM을 직접 fine-tune하지 않고도 native inference를 사용할 수 있게 하는 것이다. 즉 Eywa는 scientific FM을 language model 안으로 distill하지 않는다. 대신 FM을 외부 actor로 유지하고, communication layer를 만든다.

2) EywaAgent

EywaAgent는 가장 단순한 형태다. 기존 single-agent LLM pipeline을 EywaAgent로 대체한다.

일반 single agent는 다음처럼 동작한다.

\[z_t = M(s_t)\]

EywaAgent는 control policy가 먼저 decide한다.

\[a_t = C(s_t)\]

여기서 $a_t$는 “invoke” 또는 “skip”이다. skip이면 LLM이 그대로 reasoning한다. invoke이면 query compiler가 FM 호출을 만들고, FM output을 response adapter가 다시 reasoning state로 변환한다.

이 구조가 중요한 이유는 모든 task에서 FM을 호출하지 않는다는 점이다. 어떤 task는 language reasoning만으로 충분하고, 어떤 task는 numerical model에 위임해야 한다. EywaAgent는 이 선택을 agent loop 안에 넣는다.

실무적으로는 이게 아주 중요하다. domain FM은 항상 공짜가 아니다. 서버를 띄워야 하고, input schema를 맞춰야 하고, inference latency가 있다. 따라서 EywaAgent의 핵심은 “FM을 붙였다”가 아니라 언제 FM을 불러야 하는지 결정한다에 가깝다.

3) EywaMAS

EywaMAS는 EywaAgent를 multi-agent system으로 확장한다. 기존 multi-agent framework에서는 여러 LLM agent가 서로 message를 주고받는다. EywaMAS에서는 agent set 안에 plain LLM agent와 EywaAgent가 함께 들어갈 수 있다.

이 구조의 장점은 plug-and-play다. 기존 multi-agent protocol을 크게 바꾸지 않고도, 특정 agent를 scientific FM-backed agent로 교체할 수 있다. 예를 들어 하나의 agent는 language reasoning을 담당하고, 다른 agent는 time series forecasting FM을 붙고, 또 다른 agent는 tabular FM을 붙는 식이다.

여기서 중요한 해석은 heterogeneous가 단순히 “서로 다른 LLM backbone”이라는 뜻이 아니라는 점이다. Eywa에서 heterogeneity는 modality와 computation type의 차이를 포함한다. 즉 서로 다른 LLM agent를 섞는 것보다, LLM agent와 tabular-FM-backed agent, time-series-FM-backed agent를 섞는 것이 더 본질적인 heterogeneity일 수 있다.

4) EywaOrchestra

EywaOrchestra는 dynamic orchestration layer다. Fixed multi-agent system은 task가 달라져도 같은 topology와 같은 agent set을 사용한다. 하지만 실제 scientific workflow는 task마다 필요한 model과 collaboration pattern이 다르다.

EywaOrchestra는 planner 또는 conductor가 candidate configuration space에서 적절한 system을 선택한다. 선택 대상은 agent role, LLM backbone, attached FM, communication topology 등을 포함한다.

개념적으로는 다음과 같은 문제다.

\[c_star = argmin_c cost(task, c)\]

실제 system에서는 true loss를 모르기 때문에 conductor가 task description과 available experts를 보고 approximate routing을 수행한다.

이 design은 production 관점에서 꽤 중요하다. 모든 요청에 가장 비싼 multi-agent configuration을 쓰면 cost가 커진다. 반대로 항상 single-agent만 쓰면 복잡한 task를 놓친다. Orchestra는 이 둘 사이에서 task-adaptive operating point를 찾으려는 시도다.

5) EywaBench

EywaBench는 이 framework를 평가하기 위한 benchmark다. 프로젝트 페이지 기준으로 released EywaBench-V1은 200개 task를 포함하고, material, energy, space, biology, clinic, drug, economy, business, infrastructure의 9개 scientific sub-domain을 커버한다. 또한 natural language, time series, tabular modality가 조합되어 27개 domain-modality cell이 구성된다.

EywaBench가 중요한 이유는 단순 QA benchmark가 아니라는 점이다. LLM이 language reasoning만 잘하면 풀 수 있는 문제가 아니라, structured prediction과 domain-specific computation이 같이 필요한 task를 넣으려 한다.

공개 요약 자료 기준으로 EywaBench는 DeepPrinciple, MMLU-Pro, fev-bench, TabArena 계열 source에서 구성된다. 이 부분은 원문 table 기준으로 다시 확인할 필요가 있지만, benchmark design의 의도는 분명하다. Language-only, time-series, tabular task를 한 framework 안에 넣어 heterogeneous system의 필요성을 직접 평가한다.

4. Training / Data / Recipe

4-1. Data

이 논문은 새로운 foundation model을 pretrain하는 논문이 아니다. 따라서 training data보다 중요한 것은 evaluation task와 system component다.

EywaBench-V1은 200개 curated task로 구성된다. 프로젝트 페이지는 이 split이 더 큰 benchmark construction pipeline에서 sampled representative slice라고 설명한다. 핵심 축은 다음과 같다.

Axis	Description
Parent domain	Physical science, life science, social science
Sub-domain	Material, energy, space, biology, clinic, drug, economy, business, infrastructure
Modality	Natural language, time series, tabular data
Cell coverage	27 domain-modality combinations
Evaluation focus	Utility, token consumption, inference time

실험에서 사용된 domain-specific FM으로는 time series 쪽의 Chronos와 tabular data 쪽의 TabPFN이 언급된다. 이 선택은 현실적이다. Chronos와 TabPFN은 각각 LLM이 text로 직접 처리하기 어려운 structured prediction 문제에서 강한 inductive bias를 제공한다.

다만 여기서 한계도 보인다. Scientific FM이라고 하면 protein structure, molecular dynamics, crystal graph, PDE surrogate, climate model, remote sensing model까지 넓게 볼 수 있다. 이 논문의 empirical setting은 그 전체를 다 실험했다기보다, heterogeneous framework의 가능성을 time series와 tabular domain에서 먼저 보인 것으로 읽는 편이 안전하다.

4-2. Training strategy

Eywa 자체는 model weight를 새로 학습하는 training recipe라기보다, agentic inference framework다. 따라서 핵심 recipe는 다음 세 가지다.

FM service wrapping
- Domain-specific FM을 structured input/output schema를 가진 service로 노출한다.
- 논문 요약 자료는 이를 MCP 기반 remote service 형태로 설명한다.
Compiler and adapter design
- Query compiler는 task state를 FM invocation으로 바꾼다.
- Response adapter는 FM output을 LLM-readable message로 바꾼다.
- 이 두 모듈의 품질이 실제 성능을 크게 좌우한다.
Agent and topology selection
- EywaAgent는 single-agent replacement로 사용한다.
- EywaMAS는 fixed multi-agent setup에서 heterogeneous agents를 구성한다.
- EywaOrchestra는 conductor가 task별 configuration을 선택한다.

즉 Eywa의 학습 전략은 gradient update보다 system composition strategy에 가깝다. 어떤 model을 어떤 schema로 감싸고, 어떤 state에서 invoke하며, 어떤 output을 다시 reasoning context로 넣을지가 핵심이다.

4-3. Engineering notes

실무 관점에서 Eywa를 읽을 때 중요한 engineering point는 다음과 같다.

Schema가 곧 성능이다
- FM input/output schema가 모호하면 LLM이 잘못된 query를 만든다.
- Scientific model은 일반 tool보다 parameter와 data constraint가 더 강하므로 schema design이 중요하다.
Response adapter는 단순 summarizer가 아니다
- FM output을 LLM에게 설명해주는 수준을 넘어서, 다음 reasoning step에서 쓸 수 있는 structured evidence를 만들어야 한다.
- Prediction value, confidence, uncertainty, feature importance, failure condition을 함께 전달하는 것이 더 유용할 수 있다.
Token saving은 cost saving만이 아니다
- Structured data를 text로 길게 serialize하지 않으면 token cost가 줄어든다.
- 동시에 LLM context 안에 irrelevant numeric dump가 줄어들기 때문에 reasoning context의 density도 올라갈 수 있다.
Dynamic routing은 planner quality에 묶인다
- Orchestra가 좋은 operating point를 찾으려면 task classifier 또는 conductor가 충분히 정확해야 한다.
- 잘못된 routing은 비싼 system을 불필요하게 쓰거나, 필요한 FM을 호출하지 않는 문제로 이어질 수 있다.
Scientific FM은 tool reliability 문제가 크다
- FM마다 input validity, calibration, extrapolation limit이 다르다.
- Agent가 FM output을 무조건 믿으면 잘못된 prediction이 downstream reasoning을 오염시킬 수 있다.

5. Evaluation

5-1. Main results

공개 요약과 프로젝트 페이지 기준으로, 실험은 EywaBench에서 single-agent, multi-agent, dynamic orchestration configuration을 비교한다. 비교 대상은 LLM-only single agent, homogeneous LLM multi-agent, heterogeneous LLM multi-agent, EywaAgent, EywaMAS, EywaOrchestra 계열로 볼 수 있다.

대표 결과는 다음처럼 요약된다.

System	Main observation
EywaAgent	Single LLM baseline보다 utility를 올리고 token consumption을 줄이는 것으로 보고됨
EywaMAS	Homogeneous LLM-only multi-agent보다 cross-modality collaboration이 더 효과적임을 보임
EywaOrchestra	Fixed expert-designed system에 가까운 utility를 유지하면서 cost를 줄이는 dynamic routing을 보임

수치적으로는 공개 요약에서 EywaAgent가 single-LLM baseline 대비 약 7% utility improvement와 약 30% token reduction을 보인다고 설명된다. 또한 EywaOrchestra는 fixed EywaMAS와 유사한 utility를 유지하면서 inference cost를 약 11% 줄인 것으로 요약된다. 이 수치는 원문 table 기준으로 다시 확인해야 하지만, 방향성은 명확하다.

중요한 점은 Eywa가 단순히 “더 많은 agent를 쓰면 성능이 오른다”고 주장하지 않는다는 것이다. 오히려 논문의 message는 cross-modality heterogeneity가 LLM-only heterogeneity보다 더 중요할 수 있다는 데 있다.

5-2. What really matters in the experiments

1) Utility와 cost를 같이 봐야 한다

Agent system은 accuracy만 보면 안 된다. Multi-agent debate나 reflection은 utility를 올릴 수 있지만 token cost와 latency를 크게 키운다. EywaBench는 utility와 token consumption, inference time을 함께 보려 한다.

이 관점에서 EywaAgent의 token reduction은 꽤 중요하다. LLM에게 긴 table이나 time series를 text로 다 넣는 대신, specialized FM을 호출하고 condensed response만 받으면 context가 훨씬 작아질 수 있다.

2) Heterogeneous LLM보다 heterogeneous modality가 더 중요하다

여러 LLM backbone을 섞는 것은 useful할 수 있다. 하지만 모두 language-only reasoning을 한다면 structured scientific data에 대한 inductive bias는 크게 달라지지 않는다. Eywa는 이 지점을 바꾼다. Heterogeneity를 model provider 차이가 아니라 representation and computation 차이로 정의한다.

이 논문에서 가장 중요한 실험적 메시지는 여기에 있다. Scientific tasks에서는 agent persona diversity보다, 필요한 data modality를 native하게 처리하는 expert를 연결하는 것이 더 직접적인 이득을 줄 수 있다.

3) Dynamic orchestration은 production-friendly하다

항상 가장 큰 multi-agent system을 쓰는 것은 expensive하다. 반대로 항상 single-agent를 쓰는 것은 difficult structured task에서 약할 수 있다. EywaOrchestra는 task별로 system configuration을 선택해 utility-cost Pareto point를 찾으려 한다.

이건 실제 서비스 설계와 잘 맞는다. 어떤 request는 language-only answer로 충분하고, 어떤 request는 TabPFN 같은 tabular predictor가 필요하며, 어떤 request는 time series forecast와 language explanation이 같이 필요하다. Dynamic orchestration은 이런 routing 문제를 agent framework 내부로 넣는다.

4) Benchmark scope는 아직 조심해서 봐야 한다

EywaBench는 흥미롭지만, released V1은 200개 task다. 균형 잡힌 cell coverage는 장점이지만, 각 sub-domain의 깊이는 제한될 수 있다. 또한 Chronos와 TabPFN 중심 실험이 scientific FM collaboration 전체를 대표한다고 보기는 어렵다.

따라서 이 논문을 “scientific agent problem을 해결했다”로 읽기보다는, scientific agent를 language-only pipeline에서 heterogeneous model collaboration pipeline으로 바꾸는 baseline framework로 읽는 편이 맞다.

6. Limitations

실험에 쓰인 domain-specific FM 범위가 제한적이다.
- 공개 요약 기준으로 중심 FM은 Chronos와 TabPFN이다.
- Protein, molecule, material graph, PDE, climate, robotics simulation 같은 더 복잡한 scientific FM으로 확장했을 때 같은 interface가 충분한지는 추가 검증이 필요하다.
EywaBench-V1의 task 수가 작다.
- 200개 task는 balanced evaluation에는 적합하지만, domain별 statistical robustness를 보려면 더 큰 split이 필요하다.
- 특히 27개 domain-modality cell을 모두 커버하면 cell당 sample 수는 제한적일 수밖에 없다.
Query compiler와 response adapter가 핵심 병목이다.
- 좋은 compiler가 없으면 FM invocation이 잘못된다.
- 좋은 adapter가 없으면 FM output이 LLM reasoning에 제대로 통합되지 않는다.
- 논문의 framework는 이 모듈을 중요하게 만들지만, 자동으로 최적의 compiler/adapter를 학습하는 문제는 아직 열려 있다.
Dynamic orchestration은 conductor의 판단에 의존한다.
- 어떤 configuration이 적절한지 잘못 판단하면 cost와 quality가 모두 나빠질 수 있다.
- Configuration space가 커질수록 planner evaluation cost도 커질 수 있다.
Utility score 해석이 쉽지 않다.
- Natural language, time series, tabular task를 하나의 utility score로 합치면 비교는 쉬워진다.
- 하지만 modality별 failure mode는 가려질 수 있다.
- Scientific workflow에서는 average utility보다 domain-specific error와 calibration이 더 중요할 때가 많다.
Real-world scientific workflow는 더 길고 stateful하다.
- 실제 연구 workflow는 dataset cleaning, model selection, uncertainty analysis, experiment planning, hypothesis revision이 반복된다.
- Eywa는 좋은 interface baseline이지만, long-horizon scientific discovery loop 전체를 다룬다고 보기에는 아직 이르다.

7. My Take

7-1. Why this matters for my work

이 논문은 agent system을 설계할 때 꽤 중요한 기준을 준다. 요즘 agent 논문은 tool-use, multi-agent debate, planner, memory를 많이 다룬다. 하지만 scientific AI나 enterprise AI에서는 더 근본적인 문제가 있다. 많은 중요한 signal이 natural language가 아니라는 것이다.

Document AI, OCR, tabular analytics, time series monitoring, medical signal, financial risk, manufacturing sensor 같은 실무 문제에서도 비슷하다. LLM은 orchestration과 explanation에는 강하지만, 모든 raw signal을 text로 바꿔서 추론하게 만들면 cost와 accuracy 모두에서 손해를 볼 수 있다.

Eywa가 좋은 이유는 이 문제를 “LLM이 못한다”로 끝내지 않는다는 점이다. 대신 LLM이 잘하는 reasoning과 domain model이 잘하는 native prediction을 분리하고, 그 사이 interface를 query compiler와 response adapter로 명확히 둔다.

이 논문의 핵심 교훈은 다음이다.

좋은 scientific agent는 모든 것을 말로 설명하는 agent가 아니라, 말로 조율하고 native model로 계산하는 agent다.

7-2. Reuse potential

재사용하고 싶은 포인트는 다섯 가지다.

Compiler-adapter pair
- Tool call을 단순 JSON call로 보지 말고, task state를 expert model input으로 바꾸는 compiler와 expert output을 reasoning evidence로 바꾸는 adapter로 나눈다.
Schema-first scientific tool design
- Scientific FM은 input constraint가 강하다.
- Prompt보다 schema와 validation layer가 먼저 와야 한다.
Cost-aware orchestration
- 모든 request에 multi-agent를 쓰지 않는다.
- Task complexity와 modality requirement에 따라 single agent, FM-backed agent, multi-agent를 선택한다.
Cross-modality heterogeneity
- 서로 다른 LLM persona보다 서로 다른 computation module이 더 큰 차이를 만들 수 있다.
- 특히 time series, table, graph, image, document는 각자 맞는 expert가 필요하다.
Utility-cost Pareto evaluation
- Agent 성능은 accuracy만 보면 부족하다.
- Token consumption, latency, tool cost, failure recovery까지 같이 봐야 한다.

7-3. Follow-up papers

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
Chronos: Learning the Language of Time Series
Model Context Protocol documentation
ChemCrow: Augmenting Large-Language Models with Chemistry Tools
ScienceAgentBench
Toolformer and tool-use learning papers

8. Summary

Eywa는 LLM agent가 structured scientific data를 text로 flatten하는 한계를 heterogeneous model collaboration 문제로 재정의한다.
핵심 interface는 query compiler와 response adapter로 구성된 Tsaheylu bond다.
EywaAgent, EywaMAS, EywaOrchestra는 single-agent, fixed multi-agent, dynamic orchestration을 각각 담당한다.
EywaBench는 natural language, time series, tabular modality가 섞인 scientific benchmark로 utility와 cost를 함께 본다.
이 논문의 실무적 메시지는 명확하다. Scientific agent는 language-only agent가 아니라, LLM reasoning과 domain FM computation을 schema로 연결한 system이어야 한다.

Twitter Facebook LinkedIn