CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence Review

2026-06-22 13 분 소요

0. Introduction

CiteVQA는 Document VQA benchmark를 정답 채점에서 끝내지 않고, 답변을 만든 근거가 실제 문서의 어느 위치인지까지 같이 평가하려는 논문이다. 기존 Doc-VQA는 대체로 final answer가 맞는지만 본다. 하지만 법률, 금융, 의료, 감사 문서처럼 근거 추적이 중요한 환경에서는 정답이 맞아도 잘못된 표, 문단, 이미지에서 근거를 가져오면 실무적으로는 실패에 가깝다.

이 논문이 흥미로운 이유는 문제를 단순히 OCR 정확도나 long-context QA 정확도로 보지 않는다는 점이다. CiteVQA는 모델에게 답변과 함께 element-level bounding box citation을 요구한다. 즉 model output은 answer text 하나가 아니라, source PDF, page, bbox, evidence element까지 포함해야 한다. 이 설정은 문서 지능 시스템을 “답변 생성기”가 아니라 “검증 가능한 evidence attribution system”으로 보게 만든다.

한 줄 요약: CiteVQA는 711개 PDF, 1897개 질문, 7개 domain, 2개 language로 구성된 document VQA benchmark이며, answer accuracy와 element-level evidence citation을 함께 보는 SAA metric으로 MLLM의 attribution hallucination을 드러낸다.

이 논문을 지금 볼 가치가 있는 이유는 다음과 같음.

Doc-VQA 평가를 answer-only에서 answer + evidence grounding으로 바꾸는 benchmark design을 보여준다.
long PDF, multi-document QA, bilingual document, table/image/equation evidence를 함께 다루므로 실제 Document AI failure mode와 가깝다.
current MLLM들이 정답은 맞히지만 citation은 틀리는 “Attribution Hallucination”을 정량적으로 보여준다.

이 논문의 핵심은 benchmark 자체보다 평가 철학의 이동이다.

문서 QA에서 answer만 맞히는 것은 충분하지 않다. 특히 RAG, OCR agent, enterprise search, legal QA, medical chart QA에서는 answer가 어디에서 왔는지 사람이 바로 검증할 수 있어야 한다. CiteVQA는 이 요구사항을 metric으로 강제한다.

1. Problem Setting

1-1. Problem definition

이 논문이 겨냥하는 문제는 MLLM 기반 document intelligence에서 answer correctness와 evidence faithfulness가 분리되어 있다는 점이다.

입력은 long-form PDF document 또는 여러 PDF document다.
질문은 factual retrieval, complex synthesis, multimodal parsing, quantitative reasoning을 포함한다.
모델은 textual answer만 내면 안 되고, 답변을 뒷받침하는 source document region까지 citation으로 반환해야 한다.
citation은 page-level이 아니라 element-level bounding box를 요구한다.
평가 metric은 answer가 맞는지와 citation이 맞는지를 함께 본다.

기존 Doc-VQA benchmark에서는 모델이 다음과 같이 행동해도 높은 점수를 받을 수 있다.

문서의 다른 위치를 보고 정답과 우연히 같은 값을 출력한다.
pretraining knowledge나 common sense로 답을 추측한다.
정답은 맞지만 citation을 엉뚱한 page나 table에 붙인다.
여러 문서가 주어졌을 때 gold document가 아닌 유사 문서에서 근거를 찾는다.

CiteVQA는 이런 case를 실패로 본다. answer가 맞아도 evidence attribution이 틀리면 SAA를 받지 못한다.

1-2. Why previous approaches are insufficient

기존 document VQA 평가의 한계는 크게 네 가지로 볼 수 있다.

Answer-only scoring
- final answer만 비교하면 model이 어떤 evidence를 사용했는지 알 수 없다.
- answer가 맞아도 reasoning path가 wrong evidence에 기반했을 수 있다.
Coarse citation
- 일부 benchmark는 page-level evidence를 제공하지만, page 안에서 실제 근거 element를 찾았는지까지는 보지 않는다.
- long PDF에서는 같은 page 안에도 여러 paragraph, table, figure가 있으므로 page-level hit만으로는 부족하다.
Real-world document complexity
- 실제 PDF는 long context, complex layout, cross-page evidence, table, image, equation, bilingual content를 포함한다.
- single-page OCR QA로는 이런 구조를 충분히 평가하기 어렵다.
Trustworthy AI requirement
- legal, finance, medicine 같은 domain에서는 answer보다 audit trail이 더 중요할 수 있다.
- 사람이 answer를 검증하려면 source region이 바로 추적 가능해야 한다.

따라서 CiteVQA의 문제 설정은 “문서를 잘 읽는가”가 아니라 “문서를 근거까지 정확히 짚으면서 읽는가”에 가깝다.

2. Core Idea

2-1. Main contribution

CiteVQA의 핵심 기여는 세 가지로 정리할 수 있다.

첫째, answer와 citation을 joint evaluation한다.

모델은 answer와 함께 evidence bbox를 반환해야 한다. 평가도 answer correctness, evidence recall, evidence relevance, SAA를 함께 본다.

둘째, long realistic PDF 기반 benchmark를 만든다.

711개 PDF에서 1897개 question을 만들었고, 평균 page 수는 40.6이다. 문서 domain은 7개 macro-domain과 30개 sub-domain으로 구성된다. 영어와 중국어가 모두 들어가며, single-document와 multi-document setting을 모두 포함한다.

셋째, 자동화된 evidence annotation pipeline을 제안한다.

manual bbox annotation은 비싸고 inconsistent하기 쉽다. CiteVQA는 document parsing, MLLM agent, QA template distillation, answerability verification, masking ablation, expert validation을 엮어서 crucial evidence를 생성한다.

2-2. Design intuition

이 논문의 설계 직관은 명확하다. 신뢰 가능한 document AI는 answer뿐 아니라 source verification까지 잘해야 한다.

문서 QA에서 답변 생성은 두 단계로 나눠볼 수 있다.

Evidence selection
- 어떤 문서, 어떤 page, 어떤 element가 질문에 필요한 근거인지 찾는다.
Answer synthesis
- 그 evidence를 바탕으로 final answer를 만든다.

기존 benchmark는 2번만 강하게 본다. CiteVQA는 1번을 명시적으로 평가한다. 그리고 1번과 2번이 둘 다 성공해야 sample을 성공으로 인정한다.

이 관점에서 SAA는 단순한 extra metric이 아니다. SAA는 document QA system이 answer generator인지, evidence-grounded assistant인지 가르는 기준이다.

중요한 설계 포인트는 evidence를 page가 아니라 element로 본다는 점이다.

Page citation은 사람이 다시 찾는 비용이 크다. 특히 40 page 이상의 PDF에서 page 하나를 던져주는 것은 실제 audit에는 충분하지 않다. CiteVQA는 bbox 단위 citation을 요구해서, answer가 특정 paragraph, table cell, figure, equation에 직접 연결되도록 만든다.

3. Architecture / Method

3-1. Overview

Item	Description
Goal	answer correctness와 evidence attribution을 함께 평가하는 document VQA benchmark
Input	single PDF 또는 multiple PDFs, page screenshots, parsed document elements
Output	answer plus element-level bbox citations
Core metric	Strict Attributed Accuracy, SAA
Dataset scale	711 PDFs, 1897 questions, 7 macro-domains, 30 sub-domains
Average length	40.6 pages per document
Key phenomenon	Attribution Hallucination
Difference from prior work	answer-only Doc-VQA가 아니라 answer + citation joint scoring

CiteVQA pipeline은 크게 다음 단계로 구성된다.

Document collection
Multi-document linking
Evidence package extraction
QA construction
Quality control
Evidence ablation
Expert evaluation and auxiliary validation

이 구조에서 중요한 것은 benchmark 생성 과정이 단순 QA synthesis가 아니라 evidence chain synthesis라는 점이다. 질문이 먼저 있고 답을 맞히는 것이 아니라, 문서의 특정 evidence package를 중심으로 question-answer pair를 만든다.

3-2. Module breakdown

1) Document collection

CiteVQA는 100M+ raw PDF corpus에서 시작한다. 논문은 Common Crawl을 주요 source로 언급한다. 여기서 stratified sampling을 통해 약 250k candidate document를 고르고, MLLM 기반 2-stage annotation을 수행한다.

Coarse-grained stage: primary domain과 language를 식별한다.
Fine-grained stage: domain 내부 sub-category를 분류한다.
최종적으로 711개 PDF를 선택한다.

최종 dataset은 7개 macro-domain과 30개 sub-domain을 포함한다. 평균 page 수는 40.6이고 median은 30.0이다. 이 수치는 단일 page OCR benchmark가 아니라 long-form document QA에 가깝게 만들기 위한 선택이다.

2) Multi-document linking

CiteVQA는 single-document QA만 다루지 않는다. multi-document setting에서 여러 PDF의 evidence를 엮어야 하는 질문도 포함한다.

Multi-document linking은 두 단계로 이해할 수 있다.

Dense retrieval
- vector similarity로 관련 document candidate를 찾는다.
LLM-based fine-grained alignment
- LLM이 document section hierarchy를 보고 anchor section과 candidate section 사이의 logical bridge를 찾는다.

이 과정은 단순히 유사한 문서를 묶는 것이 아니다. 질문 생성에 쓸 수 있는 evidence chain을 만들기 위해, 어떤 section들이 논리적으로 연결되는지 찾는 과정이다.

3) Evidence package extraction

논문은 MinerU2.5를 사용해 PDF를 parse하고, document ID, page number, bbox coordinate, OCR content를 포함하는 fine-grained element를 얻는다. 이후 high-performance MLLM agent가 parsed bbox space를 탐색하면서 여러 page 또는 여러 document에 흩어진 supporting facts를 묶는다.

이 묶음이 Evidence Package다.

Evidence Package는 CiteVQA에서 가장 중요한 중간 산출물이다. QA pair는 이 package를 기반으로 생성되고, metric에서 crucial evidence도 여기서 나온다.

4) QA construction

CiteVQA는 open-source dataset의 real-world questions를 template으로 distill한다. 논문 appendix의 예시는 다음 네 유형으로 정리된다.

Question type	Meaning
Factual Retrieval	특정 entity, metric, time period에 대한 값을 찾는 질문
Complex Synthesis	여러 evidence를 종합해 outlook이나 conclusion을 만드는 질문
Quantitative Reasoning	두 값의 차이, 계산, 비교가 필요한 질문
Multimodal Parsing	visual style, table, paragraph, figure 위치를 찾는 질문

QA 생성 과정에서 MLLM은 evidence package에 맞는 logical template을 고르고, 그 template constraint 안에서 질문과 답을 만든다. 이 방식은 random question generation보다 더 통제된 benchmark construction에 가깝다.

5) Quality control

CiteVQA는 자동 pipeline이 만든 QA를 그대로 쓰지 않는다. 주요 filter는 다음과 같다.

Answerability verification
- evidence screenshot만 보고도 answer가 가능한지 확인한다.
Paraphrasing
- template smell을 줄이고 question style을 다양화한다.
Relevance filtering
- zero-document self-test로 common knowledge만으로 풀리는 질문을 제거한다.
Crucial evidence identification
- evidence package의 각 bbox element를 하나씩 mask한다.
- mask 후 모델이 정답을 만들지 못하면 그 element를 crucial evidence로 labeling한다.

여기서 masking ablation이 중요하다. 단순히 annotation model이 “이 부분이 중요하다”고 말하는 것이 아니라, evidence element를 제거했을 때 answerability가 깨지는지 본다. 이 방식은 Recall metric의 ground truth를 더 실험적으로 만들려는 시도다.

6) Evaluation metrics

CiteVQA의 metric은 answer와 attribution을 나눠서 본 뒤, 마지막에 joint success를 본다.

Metric	Meaning
Rec.	predicted evidence가 crucial evidence와 IoU@0.5 기준으로 얼마나 겹치는지
Rel.	predicted evidence가 answer를 semantic하게 support하는지
Ans.	predicted answer가 gold answer와 semantic하게 맞는지
SAA	answer와 evidence attribution이 모두 맞을 때만 인정하는 sample-level metric
Page.	correct page를 찾았는지 보는 page-level recall
Prec. / F1	predicted bbox의 precision과 localization balance

직관적으로 SAA는 다음처럼 볼 수 있다.

\[SAA = AnswerCorrect * EvidenceGrounded\]

그리고 evidence recall은 직관적으로 다음과 같다.

\[Rec = matched\_crucial\_evidence / total\_crucial\_evidence\]

실제 논문에서는 Rec.를 IoU@0.5 기준으로 계산하고, Rel.과 Ans.는 LLM judge를 사용한다. Table 3에서는 Rel.과 Ans.의 0-5 score를 100-point scale로 변환해 다른 metric과 함께 비교한다.

4. Training / Data / Recipe

4-1. Data

CiteVQA의 dataset 구성은 다음과 같다.

Statistic	Value
Documents	711
Macro / Micro domains	7 / 30
Avg. / Median pages	40.6 / 30.0
Document language EN / ZH	451 / 260
Total questions	1897
Single-Doc	987
Multi (1-Gold)	487
Multi (N-Gold)	423
Avg. / Max evidence elements	2.57 / 10

Question type distribution은 다음과 같다.

Question type	Count	Ratio
Complex Synthesis	839	44.23
Factual Retrieval	499	26.30
Multimodal Parsing	352	18.56
Quantitative Reasoning	207	10.91

Evidence source distribution도 중요하다.

Evidence source	Count	Ratio
Text	2082	70.12
Table	653	21.99
Image	209	7.04
Equation	25	0.84

이 구성이 중요한 이유는 document QA를 pure text retrieval로 축소하지 않는다는 점이다. Evidence의 약 30 percent가 table, image, equation에서 나온다. 따라서 모델은 OCR text만 보는 것이 아니라 layout, visual region, table structure, equation position까지 다뤄야 한다.

4-2. Training strategy

CiteVQA 자체는 model training paper가 아니라 benchmark paper다. 하지만 dataset construction과 auxiliary training validation이 포함되어 있다.

주요 recipe는 다음과 같다.

PDF candidate filtering
- 100M+ raw PDF에서 250k candidate를 만들고, MLLM annotation으로 domain과 language를 분류한다.
Document parsing
- MinerU2.5로 page, bbox, OCR content, visual element를 추출한다.
Evidence package generation
- MLLM agent가 parsed element space를 탐색하고, scattered supporting facts를 묶는다.
QA synthesis
- open-source dataset에서 distill한 template을 evidence package에 적용한다.
Quality filtering
- answerability, relevance, paraphrasing, common knowledge filtering을 수행한다.
Crucial evidence ablation
- bbox element를 mask하면서 정답 생성에 꼭 필요한 element를 식별한다.
Expert and training validation
- 200개 sample에 대해 expert audit을 수행하고, auxiliary SFT training validation으로 pipeline의 usefulness를 확인한다.

4-3. Engineering notes

이 논문을 engineering 관점에서 보면, benchmark 생성보다 evaluation harness가 더 중요할 수 있다.

첫째, PDF rendering이 통제되어야 한다. 논문은 Gemini series에는 native File API를 사용하고, 다른 모델에는 150 DPI page screenshot을 순차적으로 제공한다. context limit이 다른 모델에는 resolution을 조정한다.

둘째, output format이 엄격해야 한다. 모델은 answer와 evidence bbox를 함께 내야 하며, evaluation script가 document index, page index, bbox coordinate를 읽을 수 있어야 한다.

셋째, judge selection이 중요하다. 논문은 automated evaluation에서 Qwen3-VL-235B-A22B를 primary judge로 사용하고, human study로 judge agreement를 검증한다. 200개 sample에서 Gemini-3-Flash, Qwen3-VL-235B, human expert의 Rel.과 Ans. score 차이에 대해 Friedman test를 수행했고, p-value가 0.05를 넘는다고 보고한다.

넷째, resolution trade-off가 크다. 논문 appendix는 answer accuracy보다 evidence attribution metric이 resolution drop에 더 민감하다고 설명한다. 즉 문서 QA system에서 screenshot 해상도는 단순 visual quality issue가 아니라 attribution metric을 직접 바꾸는 변수다.

재현할 때 가장 먼저 고정해야 하는 것은 model보다 document input contract다.

PDF를 어떤 DPI로 render하는지, page order를 어떻게 넣는지, bbox coordinate를 어떤 coordinate system으로 표현하는지, downscaling 후 bbox를 어떻게 복원하는지가 모두 metric에 영향을 준다.

5. Evaluation

5-1. Main results

논문은 20개 MLLM을 평가한다. 주요 결과는 다음과 같다.

Model	Rec.	Rel.	Ans.	SAA
Gemini-3.1-Pro-Preview	66.0	83.6	86.1	76.0
Gemini-3-Flash-Preview	45.4	75.7	84.5	65.4
GPT-5.4	31.0	67.5	87.1	59.0
Gemini-2.5-Pro	27.4	59.8	82.2	47.0
Seed2.0-Pro	28.5	54.9	81.3	44.1
Qwen3-VL-235B-A22B	11.3	35.3	72.3	22.5
Gemma-4-31B	11.6	35.0	69.8	20.2
Qwen3-VL-8B	1.0	14.7	61.2	7.5

가장 중요한 결과는 GPT-5.4의 Ans.가 87.1인데 SAA는 59.0이라는 점이다. 즉 answer correctness만 보면 매우 강하지만, evidence attribution까지 같이 보면 성능이 크게 내려간다. Gemini-3.1-Pro-Preview도 strongest system이지만 SAA는 76.0에 머문다. open-source MLLM 중 strongest로 보고된 Qwen3-VL-235B-A22B는 SAA 22.5다.

이 결과는 단순히 open-source가 closed-source보다 약하다는 이야기가 아니다. 더 중요한 메시지는 answer accuracy와 attribution accuracy가 서로 다른 capability라는 점이다.

5-2. What really matters in the experiments

1) Attribution hallucination이 정량화된다

논문은 answer는 맞지만 citation이 틀리는 현상을 “Attribution Hallucination”이라고 부른다.

이 failure mode는 document AI에서 특히 위험하다. 사용자는 answer를 보고 trust할 수 있지만, citation을 눌렀을 때 전혀 다른 table이나 paragraph가 나오면 system은 audit 관점에서 실패한다.

CiteVQA의 Table 3은 이 현상을 숫자로 보여준다. GPT-5.4는 Ans. 87.1이지만 SAA 59.0이고, Qwen3.6-Plus는 Ans. 85.9지만 SAA 17.5다. 이는 answer generation과 evidence localization 사이의 gap이 매우 크다는 뜻이다.

2) Page-level recall도 아직 충분하지 않다

BBox localization은 어렵다는 것을 쉽게 예상할 수 있다. 하지만 논문 appendix 결과를 보면 일부 모델은 correct page를 찾는 단계에서도 크게 흔들린다.

Overall Page-level Recall 예시는 다음과 같다.

Model	Overall Page.	Overall Rec.	Overall F1
Gemini-3.1-Pro-Preview	87.9	66.0	58.9
GPT-5.4	83.4	31.0	25.4
Qwen3-VL-235B-A22B	57.8	11.3	10.0
Qwen3-VL-8B	31.1	1.0	0.8

이 표에서 보듯, page-level hit와 bbox-level hit 사이에도 큰 차이가 있다. 특히 GPT-5.4는 page-level recall은 높지만 bbox-level Rec.와 F1이 낮다. 이는 모델이 문서의 대략적 위치는 찾을 수 있어도, 정확한 evidence element를 pinpoint하는 것은 별도 문제임을 보여준다.

3) Multi-document setting은 attribution을 더 어렵게 만든다

Multi (N-Gold) setting에서는 여러 gold document에서 evidence를 가져와야 한다. 이때 model은 relevant document를 찾고, page를 찾고, bbox를 찾고, answer를 합성해야 한다.

논문 appendix는 GPT-5.4의 Page-level Recall이 Single-Doc 88.5에서 Multi (N-Gold) 75.4로 내려가고, F1도 29.6에서 20.6으로 내려간다고 보고한다. Qwen3-VL-235B-A22B 역시 Page-level Recall이 64.4에서 50.5로 내려간다.

이는 enterprise RAG와도 직접 연결된다. 실제 업무에서는 하나의 PDF만 던지는 경우보다, 계약서, 부속합의서, 정책문서, 청구서, 이메일 첨부 파일이 함께 들어오는 경우가 많다. Multi-document attribution이 약하면 answer-only QA는 그럴듯해도 audit trail은 무너질 수 있다.

4) Document type별 난도가 다르다

논문은 Academic Tech domain에서는 Gemini-3.1-Pro-Preview가 SAA 85.0까지 올라가지만, Publishing and Media domain에서는 highest SAA가 63.3에 그친다고 보고한다.

이 차이는 layout complexity와 관련이 있다. academic paper는 structure가 비교적 standardized하다. 반면 newspaper나 magazine류 문서는 typography, non-linear layout, image-text interleaving이 복잡하다. 즉 document intelligence benchmark에서는 domain 평균만 보면 안 되고, document layout type별 score를 따로 봐야 한다.

5) Dataset construction quality도 같이 봐야 한다

CiteVQA는 automated benchmark construction을 쓰기 때문에 annotation quality가 핵심이다. 논문은 200개 randomly selected sample에 대해 PhD-level expert audit을 수행한다.

Table 7에서 Human Expert 기준 question difficulty, answer quality, evidence quality는 각각 2.97, 4.43, 4.91로 보고된다. Gemini-3-Flash와 Qwen3-VL-235B judge 결과도 유사하게 나온다. 이는 자동 pipeline의 evidence quality가 꽤 높다는 신호지만, 완전한 manual gold standard와 동일하다고 해석하면 안 된다.

6. Limitations

Domain-specific evidence nuance가 남아 있다.
- 7개 domain을 포함하지만, 법률, 의료, 금융의 세부 vertical에서는 authoritative evidence의 정의가 더 복잡할 수 있다.
자동 pipeline이 강한 MLLM에 의존한다.
- dataset construction은 state-of-the-art MLLM과 document parser에 크게 의존한다. 따라서 대규모 재현에는 비용 장벽이 있다.
Evaluation cost가 높다.
- standard VQA처럼 answer string만 비교하는 것이 아니라 bbox coordinate, relevance, answer correctness를 함께 본다. compute cost와 judge cost가 커진다.
BBox citation이 모든 document evidence를 대표하지는 않는다.
- 어떤 evidence는 spread across rows, long paragraph, implicit calculation, multiple pages로 구성된다. element-level bbox가 강력하지만 모든 reasoning trace를 완벽히 표현하는 것은 아니다.
Metric overfitting 위험이 있다.
- 논문도 CiteVQA의 metric과 document distribution에 과도하게 최적화하면 실제 다양한 문서 구조에 대한 generalization이 떨어질 수 있다고 언급한다.
Judge model bias를 완전히 제거하기 어렵다.
- human study로 judge reliability를 검증하지만, Rel.과 Ans.는 여전히 LLM judge에 의존한다. 특히 near-miss answer나 domain-specific reasoning에서는 judge selection이 영향을 줄 수 있다.
Dataset의 PDF source와 copyright handling을 주의해야 한다.
- project README는 academic research and non-commercial use를 명시하고, rights holder issue가 있을 수 있음을 안내한다. 실무 재사용 시 license와 document source를 따로 확인해야 한다.

7. My Take

7-1. Why this matters for my work

CiteVQA는 Document AI를 평가할 때 어떤 contract를 요구해야 하는지 잘 보여준다.

많은 RAG나 Doc-VQA system은 answer와 source link를 함께 보여준다. 하지만 source link가 실제로 answer를 support하는지까지는 잘 검증하지 않는다. CiteVQA는 이 부분을 benchmark의 중심으로 끌어올린다.

앞으로 enterprise document QA에서 중요한 것은 answer quality보다 citation quality다.

Answer는 LLM이 좋아지면서 계속 올라갈 수 있다. 하지만 citation이 틀리면 사용자는 answer를 검증할 수 없다. 특히 계약서, 규정, 의료 기록, 투자 리포트에서는 “어디를 보고 말한 것인지”가 answer 자체만큼 중요하다.

7-2. Reuse potential

재사용하고 싶은 포인트는 네 가지다.

SAA-style joint metric
- answer correctness와 evidence correctness를 곱처럼 묶어서 sample success를 정의한다.
- RAG evaluation에서도 answer faithfulness와 passage localization을 동시에 보는 metric으로 변형 가능하다.
Masking ablation for crucial evidence
- evidence candidate를 하나씩 제거하고 answerability가 깨지는지 본다.
- citation label을 사람이 찍는 대신, model behavior 기반으로 crucial evidence를 찾는 방식은 다른 dataset construction에도 쓸 수 있다.
Multi-document attribution setting
- 실제 업무는 single PDF보다 multi-document bundle에 가깝다.
- contract + appendix + invoice + email처럼 문서 묶음에서 evidence chain을 요구하는 benchmark로 확장하기 좋다.
Page-level and bbox-level diagnostics
- correct page를 못 찾는 문제와 correct element를 못 찾는 문제를 분리해서 본다.
- 시스템 개선 시 retriever, visual parser, reasoning model, answer generator 중 어디가 병목인지 더 잘 볼 수 있다.

7-3. Follow-up papers

DocVQA: A Dataset for VQA on Document Images
MP-DocVQA: Multi-page Document Visual Question Answering
MMLongBench-Doc: Benchmarking Long-context Document Understanding
SlideVQA: A Dataset for Document Visual Question Answering on Slides
ViDoRe: Visual Document Retrieval Benchmark
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
AgenticOCR: A Document Agent Framework for OCR and Evidence Grounding

8. Summary

CiteVQA는 Doc-VQA를 answer-only evaluation에서 answer + evidence attribution evaluation으로 확장한다.
Dataset은 711개 PDF와 1897개 질문으로 구성되며, 평균 40.6 pages, 7 domains, 2 languages를 포함한다.
핵심 metric인 SAA는 answer와 cited region이 모두 맞을 때만 sample을 성공으로 인정한다.
실험에서는 GPT-5.4처럼 answer score가 높은 모델도 SAA가 크게 낮아지는 attribution hallucination이 관찰된다.
실무적으로는 RAG, OCR agent, legal QA, finance document QA에서 citation quality를 별도 metric으로 관리해야 한다는 메시지가 크다.

Twitter Facebook LinkedIn