SkillOpt: Executive Strategy for Self-Evolving Agent Skills Review

2026-05-31 12 분 소요

0. Introduction

SkillOpt는 agent skill을 prompt 부속물이 아니라 학습 가능한 외부 상태로 보는 논문이다. 여기서 skill은 특정 benchmark에 넣는 짧은 system prompt가 아니라, agent가 evidence를 어떻게 모으고, tool을 어떻게 쓰고, 실패를 어떻게 피하고, output format을 어떻게 맞출지 담은 compact한 natural-language procedure다.

이 논문의 출발점은 꽤 직접적이다. Agent를 실제로 쓰다 보면 model weight를 바꾸기 어렵고, 매번 사람이 prompt를 고치기도 어렵고, one-shot skill generation은 시작점보다 항상 좋아진다는 보장이 없다. 그렇다면 skill 문서 자체를 weight처럼 trainable state로 두고, rollout, reflection, learning rate, validation gate를 갖춘 optimizer로 업데이트할 수 있지 않을까?

SkillOpt의 답은 “skill을 학습한다”는 것이다. Frozen target model은 그대로 두고, 별도의 optimizer model이 scored trajectory를 읽어 add/delete/replace edit를 제안한다. 그 edit는 바로 반영되지 않는다. Candidate skill이 held-out selection split에서 현재 skill보다 엄격하게 좋아질 때만 accept된다. Reject된 edit도 버리지 않고 negative feedback으로 저장한다.

한 줄 요약: SkillOpt는 frozen agent의 compact skill document를 trainable external state로 두고, rollout evidence, minibatch reflection, bounded edit, validation gate, rejected-edit buffer, slow/meta update로 skill을 안정적으로 최적화하는 text-space optimizer다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

Agent post-training을 model weight가 아니라 reusable skill artifact 관점에서 다시 정의한다.
Prompt optimization, skill evolution, agent memory 사이의 경계를 꽤 깔끔하게 정리한다.
Validation gate와 textual learning rate를 넣어 self-revision을 controlled optimization loop로 바꾼다.
Direct chat, Codex, Claude Code 같은 서로 다른 harness에서 같은 best_skill.md artifact를 쓰는 방향을 보여준다.
결과가 inference-time cost 증가가 아니라 training-time skill search 비용으로만 발생한다는 점이 실무적으로 중요하다.

이 논문에서 제일 중요한 포인트는 skill을 agent stack의 배포 가능한 학습 산출물로 본다는 점이다.

기존 prompt optimization은 흔히 prompt text 하나를 잘 고르는 문제로 보인다. SkillOpt는 조금 다르다. Agent가 반복적으로 실패하는 절차를 관찰하고, 그 절차를 문서로 고치고, held-out validation으로 통과한 rule만 남긴다. 그래서 이 논문은 prompt engineering 자동화라기보다 agent procedure training에 가깝다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 frozen LLM agent를 어떻게 domain-adapt할 것인가다.

일반적인 선택지는 세 가지다.

Model weight를 fine-tuning한다.
사람이 prompt나 skill을 직접 작성한다.
LLM이 trajectory를 보고 skill을 한 번 생성하거나 반복 수정한다.

하지만 실서비스 agent에서는 셋 다 제한이 있다. Closed frontier model은 weight update가 어렵다. Open model fine-tuning도 task별 비용과 infra가 필요하다. 사람이 skill을 계속 관리하면 scale이 안 나온다. LLM self-revision은 plausible한 edit를 많이 내지만, 실제 target model과 harness에서 성능이 올라가는지 보장하기 어렵다.

그래서 SkillOpt는 다음 질문을 던진다.

Skill 문서를 model weight처럼 trainable state로 볼 수 있는가.
Rollout evidence를 batch 단위로 모으고, failure/success pattern을 minibatch reflection으로 분석할 수 있는가.
Text edit에도 learning rate, scheduler, validation gate 같은 optimization control을 넣을 수 있는가.
Offline optimization 후에는 target agent가 final best_skill.md 하나만 읽게 만들 수 있는가.

논문에서 skill은 agent context에 삽입되는 natural-language policy다. Direct chat에서는 system 또는 developer instruction 앞부분에 들어가고, tool-use harness에서는 persistent procedural memory처럼 작동한다.

1-2. Why previous approaches are insufficient

기존 방식의 한계는 skill update가 대체로 uncontrolled text rewrite라는 점이다.

Approach	Main idea	Limitation
Human skill	사람이 domain rule 작성	scale이 어렵고 benchmark별 maintenance cost가 큼
One-shot LLM skill	task description에서 skill 생성	feedback을 반영하지 않아 starting point 이상을 보장하기 어려움
Trace2Skill	trajectory에서 reusable lesson distillation	iterative validation-gated training loop는 아님
TextGrad / GEPA	textual feedback으로 prompt evolution	reusable skill document와 harness transfer가 중심은 아님
EvoSkill	failure analysis로 skill folder evolution	harness-backed setting에서 비교되지만 single portable best_skill.md와 다름

SkillOpt가 보는 핵심 gap은 “편집의 discipline”이다. 좋은 skill edit는 보기에는 그럴듯해야 하는 것이 아니라, held-out split에서 target model과 target harness를 실제로 개선해야 한다. 이 차이가 크다.

SkillOpt는 자연어 edit를 optimizer step으로 다루되, accept 여부는 반드시 validation score로 결정한다.

이 점 때문에 SkillOpt는 일반적인 self-refine loop와 다르다. Reflection이 edit proposal을 만들 수는 있지만, proposal을 믿지는 않는다. Proposal은 검증 대상일 뿐이고, selection split이 통과시킨 edit만 deployed artifact에 남는다.

2. Core Idea

2-1. Main contribution

SkillOpt의 핵심 기여는 agent skill optimization을 아래와 같은 training loop로 구성한 것이다.

Frozen target model이 current skill로 task rollout을 수행한다.
Harness가 messages, tool calls, observations, command outputs, final score를 기록한다.
Optimizer model이 실패 trajectory와 성공 trajectory를 따로 minibatch reflection한다.
Reflection 결과를 add/delete/replace style patch 후보로 만든다.
Candidate edit pool을 merge하고 ranking한 뒤 textual learning rate budget만큼만 적용한다.
Candidate skill을 held-out selection split에서 평가한다.
현재 selection score보다 strictly improve할 때만 accept한다.
Best accepted skill만 best_skill.md로 export한다.

이를 간단히 쓰면 아래와 같다.

\[rollout_t = H(M, task, skill_t)\] \[edits_t = Optimizer(reflect(rollout_t), buffer_t, meta_t)\] \[candidate_t = Apply(skill_t, TopK(edits_t, L_t))\] \[skill_{t+1} = Gate(candidate_t, skill_t, Val)\]

여기서 $M$은 frozen target model, $H$는 harness, $L_t$는 textual learning rate budget, $Val$은 selection split evaluator다. 중요한 점은 optimizer model이 target model의 weight를 바꾸지 않는다는 것이다. Optimizer가 바꾸는 것은 skill document뿐이다.

2-2. Design intuition

이 논문의 설계 직관은 다음 문장으로 정리할 수 있다.

“Agent가 반복적으로 실패하는 절차가 있다면, 그 절차는 model 내부가 아니라 외부 skill 문서에도 학습될 수 있다.”

예를 들어 SpreadsheetBench에서는 agent가 workbook 구조를 제대로 보지 않고 preview에 의존하거나, formula를 써놓고 grader가 읽는 static value를 저장하지 않을 수 있다. 이런 실패는 model weight를 업데이트하지 않아도 skill rule로 고칠 수 있다. DocVQA에서는 table, form, chart, legend에서 question을 정확한 visual region에 먼저 bind하라는 rule이 필요할 수 있다. ALFWorld에서는 object identity, visited location, search frontier, progress lock 같은 procedural state discipline이 중요하다.

이런 rule은 task instance 하나를 외우는 것이 아니다. 여러 trajectory에서 반복되는 failure pattern을 압축한 operating procedure다. SkillOpt는 이 operating procedure를 bounded edit와 held-out gate로 축적한다.

핵심은 text-space search를 무제한 rewrite가 아니라 conservative optimization으로 만든다는 점이다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	frozen agent를 skill document만으로 domain-adapt
Trainable state	single natural-language skill document
Target model	task를 실행하는 frozen model
Optimizer model	rollout evidence를 읽고 skill edit를 제안하는 별도 model
Update type	add/delete/replace 또는 rewrite suggestion
Step-size control	textual learning rate, edit budget
Model selection	held-out selection split gate
Negative feedback	rejected-edit buffer
Long-horizon memory	epoch-wise slow update and optimizer-side meta skill
Deployment artifact	best_skill.md

SkillOpt를 한 문장으로 보면, language agent를 위한 outer-loop optimizer다. Inner loop는 target model이 task를 수행하는 execution이고, outer loop는 optimizer model이 skill을 수정하는 training이다.

3-2. Module breakdown

1) Rollout evidence

Rollout은 단순 final answer log가 아니다. 논문은 task metadata, messages, tool calls, observations, command outputs, final answer, verifier feedback, spreadsheet preview, document reference, compact execution trace 같은 정보를 evidence로 모은다.

이 설계는 중요하다. Skill edit가 final answer만 보고 만들어지면 실패 원인을 잘못 짚을 수 있다. 특히 code/tool agent에서는 결과가 틀린 이유가 reasoning이 아니라 file operation, formula handling, output formatting, verifier interaction일 수 있다.

2) Minibatch reflection

Optimizer model은 failure와 success를 분리해서 본다.

Failure minibatch는 missing rule이나 corrective rule을 찾는다. Success minibatch는 이미 잘 작동하는 behavior를 보존하거나 강화한다. 이후 failure proposal과 success proposal을 따로 merge하고, final merge에서는 failure correction을 우선한다.

이 구조는 단순히 실패를 고치는 것보다 중요하다. 성공 pattern을 보존하지 않으면 skill update가 regression을 만들 수 있다.

3) Bounded text updates

SkillOpt의 textual learning rate는 한 step에서 적용할 edit 수의 cap이다. 예를 들어 default setting에서는 textual learning rate 4와 cosine decay를 사용한다. 이 budget이 없으면 optimizer model이 skill을 크게 rewrite하면서 이전에 잘 되던 procedure를 지울 수 있다.

Bounded update는 아래 같은 의미를 갖는다.

\[|edit_t| <= L_t\]

여기서 $L_t$는 step별 edit budget이다. Weight-space learning rate가 parameter update magnitude를 제한하듯, text-space learning rate는 skill 문서가 한 번에 움직이는 범위를 제한한다.

4) Validation gate and rejected buffer

Candidate skill은 selection split에서 평가된다. 현재 selection score보다 strictly greater일 때만 current skill이 된다. Tie도 accept하지 않는다.

이 gate가 없으면 reflection loop는 자기 확신에 빠지기 쉽다. 자연어 diagnosis는 설득력 있어 보여도 실제 target model과 evaluator에서는 성능을 낮출 수 있다. SkillOpt는 이 위험을 selection score로 막는다.

Reject된 edit도 버리지 않는다. Rejected edit buffer는 어떤 edit 방향이 harmful했는지 저장하고, 다음 reflection과 merge 단계에 negative feedback으로 들어간다.

5) Slow/meta update

Slow update는 epoch 끝에서 adjacent epoch를 비교한다. 같은 sampled tasks를 previous epoch skill과 current epoch skill로 실행하고, improvements, regressions, persistent failures, stable successes로 나눈다. 그 결과를 protected slow-update field에 넣는다.

Meta skill은 deployed target model이 보지 않는다. Optimizer-side memory로만 쓰인다. 어떤 edit pattern이 좋았고, 어떤 reject가 반복되었고, 어떤 failure가 persistent했는지를 future optimizer prompt에 전달한다.

Slow update는 deployed skill을 길게 만드는 장치가 아니라, training loop가 장기 방향성을 잃지 않게 하는 장치다.

3-3. Why this is not just prompt optimization

SkillOpt는 prompt optimization과 닮았지만 동일하지 않다.

Prompt optimization은 보통 prompt candidate의 score를 높이는 데 집중한다. SkillOpt는 agent procedure를 reusable artifact로 만든다. Output formatting, tool policy, evidence gathering, verifier usage, search frontier, workbook inspection 같은 절차를 skill에 축적한다.

따라서 이 논문은 prompt optimization paper라기보다 agent adaptation systems paper에 가깝다. Prompt text를 바꾸는 방법이 아니라, agent가 학습한 procedure를 audit 가능한 파일로 export하는 방법을 제안하기 때문이다.

4. Training / Data / Recipe

4-1. Data and benchmarks

논문은 여섯 개 benchmark를 사용한다.

Benchmark	Role
SearchQA	extractive QA
SpreadsheetBench	spreadsheet code and tool use
OfficeQA	local-document and tool-augmented QA
DocVQA	multimodal document QA
LiveMathematicianBench	mathematical multiple-choice reasoning
ALFWorld	embodied sequential decision making

Target model은 GPT family와 Qwen family를 포함한 7개 model로 구성된다. Execution mode는 direct chat, Codex harness, Claude Code harness다. GitHub README 기준으로 repo는 SearchQA, ALFWorld, DocVQA, LiveMathematicianBench, SpreadsheetBench, OfficeQA config를 제공한다.

중요한 점은 train/selection/test split의 역할 분리다. Training split은 rollout evidence를 만들고, selection split은 candidate skill accept/reject에만 쓰고, test split은 final report에만 쓴다. 이 구조가 없으면 skill edit가 validation overfitting인지 실제 generalization인지 구분하기 어렵다.

4-2. Default optimization recipe

논문 appendix의 default recipe는 아래와 같다.

Component	Default
Epochs	4
Rollout batch size	40
Reflection minibatch size	8
Textual learning rate	4
Scheduler	cosine decay
Minimum learning rate	2
Slow update samples	20
Candidate selection	held-out validation gate
Edit mode	patch mode
Optimizer memory	meta skill enabled
Rejected feedback	rejected-edit buffer enabled

이 recipe에서 특히 중요하게 보는 부분은 batch size보다 gate다. Table 2를 보면 rollout batch, reflection minibatch, scheduler 선택에 따라 어느 정도 변동은 있지만, 논문이 강조하는 안정성은 결국 bounded update, rejected buffer, slow/meta update, validation gate 조합에서 나온다.

즉 SkillOpt의 recipe는 큰 prompt rewrite가 아니라 작은 edit를 많이 제안하고, 소수만 통과시키는 방향이다.

4-3. Engineering notes

GitHub README 기준 구현은 Python 3.10 이상을 요구하고, pip install -e . 방식으로 설치한다. Benchmark data는 repo에 포함되지 않고, 사용자가 train/val/test split directory를 준비해야 한다.

Output directory에는 config, history, runtime state, best_skill.md, skill snapshot, per-step artifacts, slow_update log, meta_skill log가 남는다. 이 구조는 연구용으로도 중요하다. SkillOpt는 자연어 edit를 사용하므로, 어떤 edit가 통과했고 어떤 edit가 reject되었는지 추적 가능해야 한다.

실무적으로는 세 가지를 확인해야 한다.

Evaluator가 충분히 reliable한가.
Selection split이 production distribution을 대표하는가.
Optimized skill이 다른 harness나 model에서 안전하게 transfer되는가.

여기서 하나라도 약하면 SkillOpt는 성능을 올리는 것처럼 보이지만, 실제로는 selection split의 shortcut을 학습할 수 있다.

5. Evaluation

5-1. Main results

논문은 SkillOpt가 6개 benchmark, 7개 target model, 3개 harness 조합에서 측정된 52개 cell 모두에서 best 또는 tied-best라고 보고한다. 비교군은 no skill, human skill, LLM skill, Trace2Skill, TextGrad, GEPA, EvoSkill이다.

GPT-5.5 direct chat 기준으로는 평균 no-skill 대비 +23.5 point를 보고한다. Benchmark별로 보면 SearchQA +9.6, SpreadsheetBench +38.9, OfficeQA +39.0, DocVQA +12.4, LiveMath +29.3, ALFWorld +11.9다.

Codex harness와 Claude Code harness에서도 같은 skill optimization interface가 작동한다. Abstract 기준으로 GPT-5.5에서 Codex agentic loop는 +24.8, Claude Code는 +19.1 point average gain을 보고한다. Table row의 개별 gain 평균과도 이 값이 잘 맞는다.

Setting	Reported signal
Direct chat	GPT-5.5 average +23.5 over no skill
Codex harness	GPT-5.5 average +24.8 over no skill
Claude Code harness	GPT-5.5 average +19.1 over no skill
Overall matrix	best or tied-best on 52 of 52 cells

이 결과를 읽을 때 중요한 점은 SkillOpt가 model weight를 바꾸지 않았다는 것이다. 개선은 optimizer model의 training-time call과 skill document edit에서 나온다. Deployment 시에는 target model이 best_skill.md만 읽는다.

5-2. Ablation

Table 3의 component ablation은 이 논문의 설계 주장을 뒷받침한다.

Component	Default result	Removed or changed result	Interpretation
Learning-rate form	87.1 / 77.5 / 61.3	without lr: 84.6 / 75.7 / 57.3	bounded edit가 destructive rewrite를 줄임
Rejected buffer	87.1 / 77.5 / 61.3	without buffer: 85.5 / 72.9 / 58.9	rejected edit가 negative feedback으로 작동
Slow/meta update	87.1 / 77.5 / 61.3	without both: 86.3 / 55.0 / 59.7	long-horizon guidance가 중요

특히 SpreadsheetBench에서 slow/meta update를 빼면 77.5에서 55.0으로 크게 떨어진다. 이는 spreadsheet task가 단순 answer pattern이 아니라 workbook inspection, formula handling, static value materialization 같은 절차적 rule의 누적을 요구하기 때문으로 읽을 수 있다.

5-3. Transfer

논문은 optimized skill을 세 방향으로 transfer한다.

Cross-model transfer
Cross-harness transfer
Cross-benchmark transfer

Cross-harness transfer가 특히 실용적이다. Codex에서 훈련한 SpreadsheetBench skill을 Claude Code에 넣어도 positive gain이 나오고, 반대 방향도 positive다. 이는 skill이 특정 CLI command recipe만 외운 것이 아니라, workbook-level procedure를 일부 학습했을 가능성을 보여준다.

이 부분이 SkillOpt의 실무적 가치다. Skill이 한번 최적화된 뒤 다른 model이나 harness에서 재사용된다면, training-time search cost를 amortize할 수 있다.

5-4. What really matters in the experiments

이 논문의 결과를 볼 때 세 가지를 분리해야 한다.

첫째, SkillOpt가 모든 task에서 압도적인 algorithm인지보다, skill artifact가 held-out gate를 통해 실제로 개선되는지를 봐야 한다.

둘째, optimizer model의 강함은 deployment cost가 아니라 training cost다. 강한 optimizer를 쓰면 training-time 비용은 늘지만, final best_skill.md 사용 시 inference path에는 추가 model call이 없다.

셋째, learned rule이 instance-specific인지 procedural인지 봐야 한다. 논문 Figure 4의 representative rules는 SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMath, ALFWorld에서 특정 instance 이름을 외우기보다 procedure를 담는다. 이 점이 transfer claim과 연결된다.

6. Limitations

SkillOpt는 reliable feedback signal이 필요하다. Automatic verifier, exact match, executable check, held-out score가 있어야 validation gate가 의미를 갖는다. Open-ended task에서는 human evaluator나 model-based evaluator의 품질이 병목이 된다.
Deployment artifact는 compact하지만 training loop는 공짜가 아니다. Rollout computation, optimizer model call, validation evaluation이 필요하다. One-off task라면 비용 대비 이득이 작을 수 있다.
논문은 single portable skill을 최적화한다. Highly heterogeneous domain에서는 하나의 skill로 여러 disjoint procedure를 모두 담기 어려울 수 있고, skill library나 router가 필요할 수 있다.
Optimized skill은 training distribution의 heuristic을 encode할 수 있다. 다른 model, 다른 harness, 다른 task setting으로 옮길 때는 반드시 held-out 재평가가 필요하다.
Benchmark data가 GitHub repo에 포함되지 않는다. 구현은 공개되어도 동일 split과 동일 scorer를 재현하려면 별도 data preparation과 evaluator alignment가 필요하다.

7. My Take

7-1. Why this matters for my work

SkillOpt는 agent engineering에서 꽤 중요한 방향을 잡는다. Agent 성능 개선을 항상 weight update, bigger model, longer context로만 풀 필요는 없다. 실패가 절차적이라면 외부 skill 문서에 누적하는 것이 더 싸고, 더 audit 가능하고, 더 배포하기 쉬울 수 있다.

특히 enterprise workflow, document QA, spreadsheet automation, coding agent처럼 tool interaction이 많은 영역에서는 model이 틀리는 이유가 pure reasoning failure가 아닐 때가 많다. Evidence binding, file inspection, output schema, verifier usage, retry policy 같은 절차가 부족해서 실패한다. 이런 것은 skill artifact로 고치기 좋다.

SkillOpt는 agent skill을 prompt text가 아니라 versioned training artifact로 다루는 쪽에 가깝다.

7-2. Reuse potential

실무 적용 관점에서는 아래처럼 쓸 수 있을 것 같다.

Agent evaluation suite를 먼저 만든다.
Domain별 initial skill을 사람이 최소한으로 작성한다.
Training split에서 rollout evidence를 모은다.
Optimizer model이 failure/success pattern을 skill edit로 제안한다.
Selection split gate를 통과한 edit만 main skill에 반영한다.
Production 배포 전 별도 canary split으로 재검증한다.

특히 coding agent나 spreadsheet agent에서는 edit_apply_report.json, skill snapshot, rejected buffer log가 중요하다. Natural-language skill은 사람이 읽을 수 있으므로, regression이 생기면 어떤 rule이 문제였는지 추적할 수 있다.

7-3. Follow-up papers

같이 보면 좋은 논문과 시스템은 아래와 같다.

GEPA: reflective prompt evolution can outperform reinforcement learning.
TextGrad: automatic differentiation via text.
Trace2Skill: distill trajectory-local lessons into transferable agent skills.
EvoSkill: automated skill discovery for multi-agent systems.
SkillLens: model-generated agent skills를 분석하는 companion project.
Voyager: LLM agent에서 skill library와 open-ended exploration을 연결한 대표 시스템.

8. Summary

SkillOpt는 frozen agent의 skill document를 trainable external state로 보고 text-space optimization loop를 만든다.
핵심 장치는 rollout evidence, minibatch reflection, bounded edit, validation gate, rejected-edit buffer, slow/meta update다.
Deployment 시에는 optimizer model을 부르지 않고 best_skill.md만 target agent에 넣는다.
6개 benchmark, 7개 target model, 3개 harness에서 broad evaluation을 수행했고, 52개 measured cell 모두 best 또는 tied-best라고 보고한다.
가장 중요한 한계는 reliable evaluator와 held-out selection split이 필요하다는 점이다.

Twitter Facebook LinkedIn