From Context to Skills: Can Language Models Learn from Context Skillfully? Review

2026-06-18 14 분 소요

0. Introduction

From Context to Skills는 “긴 context를 넣으면 모델이 알아서 배운다”라는 말을 꽤 불편하게 만드는 논문이다. 이 논문이 묻는 질문은 단순히 long-context LLM이 문서를 잘 읽는가가 아니다. 더 정확히는, 모델이 새 context를 보고 그 안의 규칙, 절차, 예외, 판단 기준을 재사용 가능한 skill로 뽑아낼 수 있는가다.

LLM을 실제 업무에 붙이면 context learning 문제는 자주 나온다. 기술 문서, 내부 정책, 논문, 제품 매뉴얼, 실험 프로토콜, 코드베이스 convention처럼 모델의 parametric knowledge 밖에 있는 정보를 주고, 그 안에서 문제를 풀게 만든다. 이때 단순 RAG나 long context prompting은 어느 정도 작동하지만, 매번 전체 문서를 다시 읽고 그때그때 reasoning하게 만든다는 점에서 비효율적이고 불안정하다.

이 논문은 그 중간 지점을 잡는다. context를 그대로 답변에 넣는 대신, 먼저 context-specific natural-language skills를 만든다. 그리고 이 skill을 이후 inference prompt에 붙여서, 모델이 같은 context domain에서 더 안정적으로 추론하게 만든다. 중요한 점은 이 skill을 사람이 써주지 않는다는 것이다. Ctx2Skill은 Challenger, Reasoner, Judge가 들어간 multi-agent self-play loop로 task와 rubric을 만들고, failure를 분석해 skill을 갱신하며, 마지막에는 Cross-Time Replay로 가장 일반화가 좋은 skill set을 고른다.

한 줄 요약: Ctx2Skill은 복잡한 context에서 사람이 skill을 annotation하지 않아도, multi-agent self-play와 Cross-Time Replay를 통해 context-specific natural-language skill을 자동으로 발견, 정제, 선택하는 inference-time skill augmentation framework다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

Long context와 RAG의 다음 병목이 단순 retrieval이 아니라 context를 어떤 reusable procedure로 바꿀 것인가로 이동하고 있다.
Agent memory, SOP learning, document AI, codebase assistant에서 필요한 “문서를 읽고 작업 규칙을 만드는” 과정을 꽤 직접적으로 다룬다.
외부 verifier가 없는 상황에서 self-play feedback을 어떻게 만들고, 그 과정이 over-specialization으로 망가지지 않게 할지 보여준다.
생성된 skill이 model weight update가 아니라 text artifact라서, 해석 가능하고 다른 model에 옮겨 쓰는 workflow와 연결된다.
Ctx2Skill의 핵심은 더 큰 모델보다 strong model이 만든 skill을 weaker model에 붙이는 amortized augmentation 관점으로도 읽을 수 있다.

이 논문은 context learning paper이면서 동시에, LLM agent가 경험을 어떻게 외부화해야 하는지에 대한 systems paper에 가깝다. 모델이 context를 매번 새로 읽는 대신, context에서 뽑은 working rulebook을 만들고 그 rulebook을 재사용한다는 발상이다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 context learning이다. 여기서 context learning은 모델이 pretraining에서 이미 알고 있는 지식으로 답하는 것이 아니라, 주어진 context 안에 새로 들어 있는 지식과 규칙을 읽고 이를 문제 해결에 적용하는 능력을 뜻한다.

예를 들어 다음과 같은 task를 생각할 수 있다.

새로 주어진 technical manual을 읽고, 특정 operation rule을 적용한다.
도메인 문서 안의 예외 조항을 찾아 decision을 내린다.
실험 보고서의 procedure를 보고 누락된 step을 판단한다.
rule system을 읽고 새로운 case에 올바른 action을 매핑한다.
논문이나 설계 문서에서 implied constraint를 추론해 downstream question에 답한다.

이런 task는 일반적인 QA와 다르다. 답이 context 안에 단순 span으로 있지 않을 수 있고, context의 여러 부분을 조합해 rule을 적용해야 한다. 그래서 모델 입장에서는 context를 단순 evidence collection으로 보는 것보다, 그 안의 절차와 판단 기준을 skill로 바꾸는 것이 더 중요해진다.

논문은 이를 inference-time skill augmentation 문제로 본다. 즉 context에서 rule, procedure, exception, reasoning strategy를 natural language skill로 추출하고, 이후 task solving 때 이 skill을 prompt에 추가한다.

1-2. Why previous approaches are insufficient

기존 접근의 한계는 크게 세 가지로 볼 수 있다.

첫째, manual skill annotation은 비싸다. 긴 technical context마다 사람이 rulebook을 써주면 품질은 높을 수 있지만, scale이 맞지 않는다. 특히 내부 문서, domain-specific report, product manual처럼 context가 계속 바뀌는 환경에서는 사람이 매번 skill을 작성하는 방식이 병목이 된다.

둘째, automated skill construction에는 외부 feedback이 부족하다. 수학이나 코드처럼 정답 verifier가 있으면 generated solution을 채점할 수 있다. 하지만 dense document understanding에서는 “이 skill이 정말 도움이 되는가”를 자동으로 말해주는 외부 환경이 없다. context만 가지고 task도 만들고, rubric도 만들고, response도 평가해야 한다.

셋째, long context prompting은 reusable knowledge를 남기지 않는다. 모델이 매 query마다 전체 context를 다시 읽으면, 같은 rule을 계속 재추론한다. 이건 token cost도 크고, 답변 일관성도 약해질 수 있다. context learning system이 실용적이려면, 한 번 읽은 context에서 얻은 lesson을 압축된 artifact로 남겨야 한다.

그래서 이 논문의 핵심 질문은 다음처럼 정리할 수 있다.

외부 label도, human-written skill도, task-specific verifier도 없을 때, LLM은 context만 보고 자신이 쓸 skill을 스스로 만들 수 있는가?

2. Core Idea

2-1. Main contribution

Ctx2Skill의 핵심 기여는 context-specific skill generation을 self-play 문제로 바꾼다는 점이다.

전체 구조는 다음 세 축으로 구성된다.

Self-play task generation and judging
- Challenger가 context를 바탕으로 probing task와 rubric을 만든다.
- Reasoner는 현재 skill set을 사용해 task를 푼다.
- Judge는 answer가 rubric을 만족하는지 binary feedback을 준다.
Failure-driven skill evolution
- Proposer가 실패나 성공의 원인을 분석한다.
- Generator가 분석 결과를 natural-language skill update로 바꾼다.
- Challenger와 Reasoner 양쪽 skill set이 모두 진화한다.
Cross-Time Replay
- self-play가 갈수록 extreme task와 over-specialized skill로 붕괴되는 것을 막는다.
- iteration별 Reasoner skill 후보를 hard probe와 easy probe에서 다시 평가한다.
- 마지막 skill을 단순히 final iteration에서 고르지 않고, hard/easy balance가 가장 좋은 시점에서 고른다.

이 구조에서 중요한 것은 update 대상이 model parameter가 아니라 skill text라는 점이다. Ctx2Skill은 fine-tuning이 아니다. Context를 읽고, 그 context에서 유효한 procedure를 text artifact로 만들고, inference 시점에 다른 LLM에도 붙일 수 있게 만든다.

2-2. Design intuition

Ctx2Skill의 직관은 꽤 명확하다. 좋은 skill을 만들려면 좋은 question이 필요하고, 좋은 question을 만들려면 현재 Reasoner가 무엇을 못하는지 알아야 한다. 그래서 Challenger와 Reasoner를 같이 진화시킨다.

Reasoner가 너무 쉽게 풀면, Challenger는 더 sharp한 probe를 만들어야 한다.
Reasoner가 실패하면, Reasoner skill은 그 실패를 설명하는 rule을 추가해야 한다.
Challenger가 shallow task만 만들면, skill은 superficial summary로 남는다.
Challenger가 너무 극단적인 task만 만들면, skill은 일반성을 잃는다.

이 균형이 어렵다. self-play는 강력하지만, 자기 자신이 만든 task와 자기 자신이 만든 rubric 안에서만 순환하기 때문에 쉽게 pathology가 생긴다. 논문이 말하는 adversarial collapse가 바로 이 지점이다. Challenger가 점점 더 특이하고 어려운 task를 만들고, Reasoner skill은 그 특이한 case만 커버하도록 길어지고 과적합된다.

Cross-Time Replay는 이 문제를 해결하기 위한 validation mechanism이다. 마지막 iteration이 항상 best라고 가정하지 않는다. 대신 self-play 과정에서 모인 easy/hard probe를 사용해 여러 시점의 skill set을 다시 평가하고, 균형이 좋은 시점의 skill을 선택한다.

개념적으로는 아래처럼 볼 수 있다.

\[S_R^* = \arg\max_i \rho_h(i) \cdot \rho_e(i)\]

여기서 $S_R^i$는 iteration $i$의 Reasoner skill set이고, $\rho_h(i)$와 $\rho_e(i)$는 각각 hard probe와 easy probe에서의 solving rate다. product를 쓰는 이유는 한쪽만 잘하는 skill set을 penalize하기 위해서다. hard task만 풀고 easy task를 잊어도 안 되고, easy task만 유지하면서 hard task를 못 풀어도 안 된다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	Complex context에서 reusable natural-language skill을 자동으로 발견하고 선택
Output	Context-specific Reasoner skill set
Learning signal	Challenger-generated task, rubric, Judge binary feedback
Key agents	Challenger, Reasoner, Judge, Proposer, Generator
Core mechanism	Multi-agent self-play plus failure-driven skill update
Stability module	Cross-Time Replay
Update target	Model weights가 아니라 text skill artifact
Difference from prior work	Human skill annotation이나 external verifier 없이 context 내부 feedback만 사용

3-2. Module breakdown

1) Context-specific skill set

Ctx2Skill이 만들고자 하는 것은 일반 instruction prompt가 아니다. 특정 context에 묶인 skill set이다. 여기서 skill은 다음과 같은 역할을 한다.

context 안의 중요한 rule을 압축한다.
rule application 순서를 명시한다.
exception이나 edge case를 따로 적는다.
response가 만족해야 할 rubric-like constraint를 보존한다.
downstream task에서 model이 어느 부분을 먼저 확인해야 하는지 안내한다.

이 skill은 사람이 읽을 수 있는 natural language로 되어 있다는 점이 중요하다. 그래서 skill 자체를 inspection할 수 있고, 필요하면 사람이 수정하거나 다른 model prompt에 옮길 수 있다. 내 관점에서는 Ctx2Skill의 skill set은 domain-specific mini SOP에 가깝다.

2) Challenger

Challenger는 context를 보고 task와 rubric을 만든다. 단순히 쉬운 fact question을 만드는 것이 아니라, Reasoner의 context understanding을 probe할 수 있는 task를 만들어야 한다.

좋은 Challenger는 다음을 해야 한다.

context의 surface detail만 묻지 않는다.
여러 rule을 조합해야 하는 task를 만든다.
answer를 judge할 수 있도록 rubric을 같이 만든다.
Reasoner가 이미 잘하는 부분이 아니라 취약한 부분을 찌른다.

논문은 Challenger도 skill을 가진 agent로 본다. 즉 Challenger skill은 어떤 task를 만들어야 Reasoner의 weak point를 잘 찾을 수 있는지에 대한 meta-skill이다. 이게 중요한 이유는, task generator가 약하면 Reasoner skill도 약하게 진화하기 때문이다.

3) Reasoner

Reasoner는 현재 context와 Reasoner skill set을 가지고 Challenger task를 해결한다. 여기서 Reasoner의 답변은 downstream user answer가 아니라 skill evolution을 위한 training signal source다.

Reasoner가 실패하면, 그 실패는 단순히 점수 손실로 끝나지 않는다. 실패 case는 Proposer와 Generator로 넘어가 skill update 후보가 된다. 즉 Reasoner는 self-play loop 안에서 skill의 현재 한계를 드러내는 probe target이다.

중요한 점은 Reasoner skill이 계속 누적된다는 것이다. 이전 iteration에서 배운 rule이 다음 iteration의 reasoning에 들어가고, 다음 실패가 다시 skill을 바꾼다. 따라서 skill은 context summary가 아니라 iterative refinement 결과다.

4) Judge

Judge는 Reasoner answer를 Challenger rubric과 비교해 binary feedback을 준다. Public summary 기준으로 evaluation task에서도 all-or-nothing scoring이 사용된다. 즉 task의 모든 rubric을 만족해야 solved로 본다.

이 strict scoring은 context learning에서 타당하다. 실제 rule application task에서는 일부 rubric을 만족해도 핵심 예외를 놓치면 답이 틀릴 수 있다. 다만 이 구조는 Judge quality에 강하게 의존한다. Rubric이 틀리거나 Judge가 rubric을 잘못 적용하면, skill update도 잘못된 방향으로 갈 수 있다.

5) Proposer and Generator

Ctx2Skill에서 흥미로운 설계는 diagnosis와 generation을 분리한다는 점이다. Proposer는 무엇을 고쳐야 하는지 분석하고, Generator는 그 분석을 실제 skill text로 바꾼다.

이 분리는 agent system 설계에서 꽤 중요하다. 실패 로그를 바로 skill 문장으로 바꾸면, skill이 길어지거나 localized patch처럼 변하기 쉽다. 반대로 먼저 failure pattern을 추상화하면, skill update가 더 일반적인 rule로 정리될 가능성이 높다.

실제로 이 구조는 다음과 같은 workflow로 볼 수 있다.

Judge가 Reasoner failure를 표시한다.
Proposer가 failure의 원인을 diagnosis한다.
Generator가 diagnosis를 skill update로 변환한다.
Updated skill set이 다음 self-play iteration에 사용된다.

이 과정은 model weight update 없이 일어나는 textual learning이다.

6) Cross-Time Replay

Cross-Time Replay는 이 논문의 안정화 장치다. Self-play는 보통 마지막 policy나 마지막 skill을 쓰고 싶게 만든다. 하지만 Ctx2Skill에서는 마지막 skill이 가장 좋은 skill이라는 보장이 없다.

논문은 self-play 중 대표 probe를 모은다.

hard probe: iteration 중 어려웠던 failed task
easy probe: iteration 중 쉽게 해결된 successful task

그 다음 과거 iteration의 Reasoner skill candidates를 이 probe set에서 다시 평가한다. 최종 skill은 hard probe와 easy probe 둘 다에서 균형 있게 좋은 skill이다.

이 설계의 의미는 명확하다. Skill evolution이 단순히 더 많은 rule을 추가하는 과정이면, 시간이 지날수록 skill은 길고 brittle해질 수 있다. Cross-Time Replay는 “얼마나 최신인가”보다 “대표 case에 대해 균형이 좋은가”를 기준으로 skill을 고른다.

이 장치가 Ctx2Skill을 단순 self-improvement prompt loop와 구분해주는 핵심이다. 많은 self-evolving agent는 경험이 쌓일수록 좋아질 것이라고 가정하지만, 실제로는 over-specialized memory가 performance를 망칠 수 있다. Ctx2Skill은 그 위험을 explicit하게 다룬다.

4. Training / Data / Recipe

4-1. Data

이 논문은 model training data보다 evaluation benchmark와 context construction이 중요하다. Public summary 기준으로 Ctx2Skill은 CL-bench에서 평가된다.

Item	Description
Benchmark	CL-bench
Contexts	500 complex contexts
Tasks	1,899 tasks
Rubrics	31,000개 이상 verification rubrics
Categories	Domain Knowledge Reasoning, Rule System Application, Procedural Task Execution, Empirical Discovery and Simulation
Scoring	모든 rubric을 만족해야 solved로 보는 all-or-nothing scoring
Judge	GPT-5.1 judge, human agreement 90% 이상으로 보고됨

이 benchmark 선택이 중요한 이유는 Ctx2Skill이 단순 QA benchmark를 겨냥하지 않기 때문이다. CL-bench는 context 안의 새 지식과 규칙을 적용하는 task를 포함한다. 따라서 skill augmentation이 진짜 의미를 가지려면, 단순 answer extraction이 아니라 rule application과 procedural reasoning에서 이득을 보여야 한다.

4-2. Training strategy

Ctx2Skill에는 일반적인 gradient-based training이 없다. 대신 context마다 inference-time self-play를 수행한다.

전체 flow는 다음처럼 볼 수 있다.

Context를 입력으로 받는다.
초기 Challenger skill과 Reasoner skill을 만든다.
Challenger가 probing tasks와 rubrics를 생성한다.
Reasoner가 current skill set으로 task를 푼다.
Judge가 answer를 rubric 기준으로 평가한다.
Proposer와 Generator가 failure/success pattern을 skill update로 바꾼다.
여러 iteration의 skill candidates를 저장한다.
Cross-Time Replay로 최종 Reasoner skill을 선택한다.
최종 skill을 downstream task solving prompt에 붙인다.

이걸 training이라고 부른다면, parameter training이 아니라 skill artifact training에 가깝다. Model weight는 고정되어 있고, 학습되는 것은 prompt에 들어갈 natural-language skill set이다.

4-3. Engineering notes

실무적으로 이 논문에서 가져갈 만한 engineering note는 다음과 같다.

Skill generation은 offline amortization이 가능하다
- 같은 context에 대해 여러 번 질문할 예정이라면, 처음에 skill을 만들어두는 비용을 나중에 회수할 수 있다.
- 내부 매뉴얼, policy document, 제품 스펙, codebase convention처럼 반복 사용되는 context에 특히 맞다.
Task generator 품질이 skill 품질을 좌우한다
- Challenger가 shallow question만 만들면 Reasoner skill도 shallow summary가 된다.
- 좋은 self-play는 좋은 adversarial probe generator에서 시작된다.
Rubric이 곧 feedback interface다
- 외부 verifier가 없으므로 Challenger rubric과 Judge가 learning signal의 핵심이다.
- Rubric 설계가 부정확하면 skill도 부정확해질 수 있다.
마지막 skill을 무조건 쓰면 위험하다
- Self-play는 over-specialization으로 갈 수 있다.
- Cross-Time Replay처럼 historical candidates를 다시 평가하는 validation step이 필요하다.
Skill text는 inspectable해야 한다
- 실제 product에서는 generated skill을 그대로 쓰기보다, 중요 domain에서는 사람이 빠르게 검수할 수 있어야 한다.
- Natural-language artifact라는 점은 이 검수를 가능하게 만든다.

5. Evaluation

5-1. Main results

Public summary 기준으로 Ctx2Skill은 CL-bench의 4개 context learning category에서 여러 backbone model에 대해 solving rate를 개선한다.

대표적으로 보고된 수치는 다음과 같다.

Backbone	Without skill	With Ctx2Skill	Comment
GPT-4.1	11.1	16.5	skill augmentation 후 Gemini 3 Pro without skill 15.8을 넘는 것으로 보고됨
GPT-5.1	21.2	25.8	stronger model에서도 gain 유지
GPT-5.2	18.2	21.4	backbone이 바뀌어도 개선 유지

이 수치는 두 가지를 시사한다.

첫째, Ctx2Skill은 단순히 약한 모델을 조금 보정하는 장치가 아니다. Stronger model에서도 context-specific skill이 추가 이득을 줄 수 있다.

둘째, skill augmentation은 model generation gap을 일부 좁힐 수 있다. GPT-4.1 + Ctx2Skill이 Gemini 3 Pro without skill보다 높게 보고된 점은, context learning에서는 base model capability뿐 아니라 context-specific procedure를 얼마나 잘 제공하느냐가 중요하다는 메시지다.

다만 이 결과는 반드시 원문 Table 1에서 세부 setting을 다시 확인해야 한다. Judge model, category별 breakdown, baseline prompt, skill generation model, inference model이 어떻게 나뉘었는지가 해석에 중요하다.

5-2. What really matters in the experiments

1) Ctx2Skill의 이득은 task category 전체에 걸쳐 나온다

논문은 CL-bench의 4개 category에서 consistent gain을 보고한다. 이 점이 중요하다. 만약 특정 rule task에서만 좋아졌다면, Ctx2Skill은 narrow prompt engineering으로 보일 수 있다. 하지만 domain knowledge reasoning, rule system application, procedural task execution, empirical discovery and simulation에 걸쳐 개선이 나오면, skill이 context learning의 여러 형태에 도움이 된다는 해석이 가능하다.

2) Challenger evolution이 중요하다

Ablation summary 기준으로 Challenger skill evolution을 제거했을 때 가장 큰 drop이 나타난다. 이건 꽤 중요한 결과다. Reasoner skill만 잘 만들면 되는 것처럼 보이지만, 실제로는 좋은 failure case를 만들어내는 Challenger가 없으면 skill이 제대로 진화하지 않는다.

Skill learning에서 teacher 역할은 외부 label이 아니라 좋은 probe를 만드는 agent가 맡는다. 따라서 Ctx2Skill의 핵심은 Reasoner가 아니라 Challenger-Reasoner pair다.

3) Cross-Time Replay는 overfitting 방지 장치다

Cross-Time Replay는 두 번째로 중요한 component로 보고된다. Fixed iteration baseline은 early round 이후 성능이 떨어지는 adversarial collapse를 보일 수 있다고 요약된다.

이 결과는 self-evolving agent 시스템 전반에 중요한 메시지를 준다. 경험을 계속 쌓는다고 항상 좋아지지 않는다. 특히 self-generated task와 self-generated skill이 닫힌 루프를 만들면, 시스템은 점점 이상한 internal distribution에 적응할 수 있다. Historical validation은 필수다.

4) Skill quality도 별도로 평가한다

Public summary 기준으로 논문은 generated skill을 conciseness, faithfulness, clarity, effectiveness, reusability의 5개 dimension에서 GPT-4.1 judge로 평가하고, AutoSkill4Doc 같은 baseline보다 낫다고 보고한다.

이 평가가 중요한 이유는 task solving rate만으로는 skill이 좋은지 알 수 없기 때문이다. Skill이 너무 길거나, context에 충실하지 않거나, 재사용성이 낮으면 product에서는 쓰기 어렵다. Ctx2Skill은 skill을 text artifact로 만드는 만큼, artifact quality를 별도로 봐야 한다.

5) Stronger skill generator to weaker inference model workflow가 가능해 보인다

AlphaXiv summary는 stronger model이 만든 skill이 weaker model에 잘 transfer되고, reverse direction은 덜 효과적이었다고 설명한다. 이 결과는 실무적으로 중요하다. 비싼 model을 매번 serving하지 않고, 비싼 model로 context-specific skill을 offline 생성한 뒤, 작은 model에 붙여 inference하는 구조가 가능해지기 때문이다.

다만 transferability 결과는 원문 실험 조건을 반드시 확인해야 한다. 어떤 model이 skill generator였는지, skill이 어느 downstream model에 붙었는지, context별로 skill 재생성이 있었는지에 따라 비용 해석이 크게 달라진다.

6. Limitations

Feedback loop가 LLM judge와 rubric에 강하게 의존한다
- 외부 verifier가 없는 상황을 다루는 논문이므로 어쩔 수 없는 면이 있다.
- 하지만 Challenger rubric이 틀리거나 Judge가 rubric을 잘못 해석하면, skill update가 잘못된 방향으로 강화될 수 있다.
Context마다 self-play 비용이 든다
- Ctx2Skill은 simple prompt augmentation보다 훨씬 무겁다.
- 같은 context를 여러 번 재사용할 수 있는 환경에서는 amortize할 수 있지만, one-off query에는 비용 대비 이득이 작을 수 있다.
Skill이 text artifact라서 길이와 충돌 문제가 생길 수 있다
- Skill set이 계속 길어지면 prompt budget을 먹고, 서로 모순되는 rule이 들어갈 수 있다.
- Cross-Time Replay가 과적합을 줄이지만, skill compression과 conflict resolution은 별도 문제로 남는다.
CL-bench 중심 결과를 실제 enterprise document로 바로 일반화하면 위험하다
- 실제 문서는 OCR error, table structure, layout, access control, policy versioning, prompt injection 같은 문제가 있다.
- Context learning benchmark에서 좋은 skill이 production document workflow에서도 바로 안전하다고 보기는 어렵다.
Adversarial collapse는 완전히 사라진 문제가 아니다
- 논문은 Cross-Time Replay로 이를 완화하지만, self-generated task distribution이 얼마나 representative한지는 여전히 열린 문제다.
- Hard/easy probe만으로 모든 generalization failure를 잡을 수는 없다.
Security 관점이 중요하다
- Untrusted context에서 skill을 생성하면, 그 context에 들어 있는 악성 instruction이나 hidden policy가 skill로 압축될 수 있다.
- 실제 시스템에서는 skill extraction 전에 source trust, instruction filtering, human review, sandboxing이 필요하다.
Skill freshness와 versioning 문제가 남는다
- Context가 업데이트되면 기존 skill이 stale해질 수 있다.
- Product manual이나 policy document에서는 skill cache를 언제 invalidate할지가 중요하다.

7. My Take

7-1. Why this matters for my work

이 논문이 흥미로운 이유는 context learning을 “긴 문서를 잘 읽는 문제”에서 “문서에서 작업 가능한 rulebook을 만드는 문제”로 바꿔 보기 때문이다.

실무에서 LLM을 document AI나 internal assistant로 붙이면, 모델이 매번 긴 문서를 다시 읽는 구조는 금방 비싸지고 불안정해진다. 특히 OCR된 계약서, 정책 문서, 업무 매뉴얼, API spec, codebase convention처럼 반복적으로 참조되는 context는 한 번 읽고 끝나는 것이 아니라, 여러 task에서 반복 활용된다.

이때 Ctx2Skill식 접근은 꽤 자연스럽다.

첫 단계에서 strong model로 context-specific skill을 만든다.
그 skill을 사람이 빠르게 검수한다.
이후 serving에서는 cheaper model에 skill을 붙여 task를 푼다.
Context가 바뀌면 skill cache를 invalidate하고 다시 만든다.

즉 Ctx2Skill은 RAG의 대체재라기보다, RAG와 함께 쓰는 context-to-procedure compiler에 가깝다. Retrieval은 evidence를 가져오고, skill은 evidence를 어떻게 읽고 적용할지 알려준다.

7-2. Reuse potential

재사용해볼 만한 포인트는 다음과 같다.

Document-specific SOP generation
- 내부 매뉴얼이나 정책 문서에서 검증 가능한 skill set을 만든다.
- 단순 summary가 아니라 rule application checklist로 만든다.
Challenger-style evaluation generation
- 새 document domain에 대해 task와 rubric을 자동 생성해 evaluation set을 만든다.
- 이건 skill generation뿐 아니라 document AI benchmark generation에도 유용하다.
Cross-Time Replay for agent memory
- Agent가 memory나 skill을 계속 업데이트할 때 final memory를 무조건 믿지 않는다.
- 과거 memory candidates를 representative probe에서 다시 평가한다.
Strong-to-small skill transfer
- 큰 모델로 context-specific skill을 만들고, 작은 모델이 그 skill을 사용하게 한다.
- Serving cost를 줄이고, skill artifact를 사람이 inspect할 수 있다.
Skill quality rubric
- Skill을 conciseness, faithfulness, clarity, effectiveness, reusability로 따로 평가하는 관점은 그대로 재사용할 수 있다.
- 특히 document AI에서는 faithfulness와 reusability가 중요하다.

7-3. Follow-up papers

CL-bench: A Benchmark for Context Learning
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
GenericAgent: Contextual Information Density Maximization
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
AutoSkill4Doc
Voyager: An Open-Ended Embodied Agent with Large Language Models
Self-Evolving GPT

8. Summary

Ctx2Skill은 context를 단순 prompt evidence가 아니라 reusable natural-language skill로 바꾸는 framework다.
핵심은 Challenger, Reasoner, Judge self-play와 Proposer/Generator 기반 failure-driven skill update다.
Cross-Time Replay는 self-play가 over-specialized skill로 붕괴되는 것을 막기 위해 historical skill candidates를 hard/easy probe에서 다시 평가한다.
CL-bench 기준으로 여러 backbone과 4개 context learning category에서 solving rate 개선이 보고된다.
실무적으로는 RAG 이후 단계의 context-to-procedure compiler, document-specific SOP generator, strong-to-small skill transfer workflow로 읽을 수 있다.

Twitter Facebook LinkedIn