QuAC Dataset Card

QuAC
Basic Information
Full Name	Question Answering in Context
Domain	Natural Language Processing
Year	2018
Dataset Scale
Conversations	14,000+
Questions	100,000+
Data Points	100,000+ question-answer pairs
Research Context
Authors	Eunsol Choi , He He , Mohit Iyyer , Mark Yatskar , Wen-tau Yih , Yejin Choi , Percy Liang , Luke Zettlemoyer
Institutions	University of Washington, University of Maryland, University of Massachusetts Amherst, Microsoft Research, Allen Institute for AI, Stanford University
Conference	EMNLP 2018
Technical Details
Format	JSON format with passages, dialogues, and extractive answers
Observables	Question answering in conversational context, coreference resolution, dialogue state tracking
Subsets	1 subsets
Publication	EMNLP 2018 • DOI
Dataset	Official Website & Leaderboard

Dataset Summary

QuAC (Question Answering in Context) introduces a student-teacher framework for conversational question answering. Unlike traditional QA datasets that treat questions independently, QuAC requires models to understand dialogue context, resolve references across conversation turns, and handle the natural ambiguity of information-seeking conversations.

Key Features

Student-Teacher Framework: Asymmetric information access modeling real-world learning scenarios
Heavy Coreference Resolution: 86% of questions require resolving references across dialogue context
Context-Dependent Questions: Questions build upon previous exchanges and dialogue history
Extractive Answering: Answers extracted directly from Wikipedia passages for objective evaluation
Large Scale: 100,000+ questions across 14,000+ conversations on biographical content
Challenging Benchmark: 20+ F1 point gap between human and machine performance

Dataset Structure

QuAC conversations follow a structured student-teacher interaction:

Teacher: Has access to complete Wikipedia biographical passage
Student: Sees only title, introduction paragraph, and section heading
Dialogue: Student asks questions to learn about the content while teacher provides extractive answers
Termination: Ends after 12 questions, manual termination, or two consecutive unanswerable questions

Content Selection

QuAC Content Distribution
Count	Description	Name
100% of content	Subjects with 100+ incoming links for quality control	Wikipedia Biographical Articles

Question Analysis

QuAC Question Characteristics
Description	Percentage	Type
Questions requiring coreference resolution	86%	Context-dependent questions
Questions beyond simple fact retrieval	54%	Non-factoid questions
Realistic scenarios where information isn’t available		Unanswerable questions
Longer responses compared to extractive datasets		Extended answers

Coreference Resolution Challenges

QuAC’s primary complexity comes from extensive coreference resolution requirements:

Reference Types

Passage References: Pronouns and references to entities mentioned in the source text
Dialogue References: References to previously discussed topics in the conversation
Abstract References: Challenging cases like “what else?” requiring scope inference

The prevalence of these reference types makes QuAC particularly challenging since coreference resolution remains an active research problem in NLP.

Use Cases

Primary Applications

Conversational AI systems requiring context understanding
Educational AI systems for information-seeking dialogues
Virtual assistants handling multi-turn questioning
Reading comprehension with dialogue state tracking

Research Applications

Coreference resolution in conversational contexts
Student-teacher interaction modeling
Context-dependent question answering
Dialogue state tracking and memory systems

Quality & Limitations

Strengths

Realistic Framework: Student-teacher setup mirrors real-world information-seeking scenarios
Challenging Benchmark: Substantial performance gap indicates research opportunities
Objective Evaluation: Extractive answers reduce annotation complexity while maintaining evaluation rigor
Context Complexity: Heavy coreference requirements test sophisticated NLP capabilities

Limitations

Domain Restriction: Limited to biographical content may not generalize to all conversational contexts
Extractive Constraint: Answers must come from passage text, limiting natural conversational responses
Performance Gap: Large human-machine gap (20+ F1) indicates dataset difficulty
English Only: Limited to English language conversations

Performance Benchmarks

QuAC establishes challenging benchmarks with novel evaluation metrics:

Traditional Metrics

Human Performance: 81.1% F1 score
Best Baseline (2018): BiDAF++ with context at 60.2% F1
Performance Gap: 20+ F1 point difference

Human Equivalence Metrics

HEQ-Q (Question-level): Percentage of questions achieving human-level performance
- Human: 100%, Best Model: 55.1%
HEQ-D (Dialogue-level): Percentage of complete dialogues matching human performance
- Human: 100%, Best Model: 5.2%

These metrics reveal that while models may perform reasonably on individual questions, maintaining consistency across entire conversations remains challenging.

Getting Started

Access: Download from QuAC official website
Format: JSON files containing Wikipedia passages, student-teacher dialogues, and extractive answers
Evaluation: Use official evaluation scripts and HEQ metrics for comprehensive assessment
Baseline Models: Start with reading comprehension models extended for dialogue context

Methodological Innovation

Student-Teacher Design

The asymmetric information design ensures:

Student questions naturally differ from passage content
Realistic information-seeking scenarios
Context-dependent questioning patterns
Challenging coreference resolution requirements

Data Collection Process

Platform: Amazon Mechanical Turk with careful quality control
Setup: Two-person dialogue with defined roles and access restrictions
Quality Assurance: Multiple termination conditions and extractive answer requirements

QuAC complements other conversational QA datasets:

CoQA: Features abstractive answers and different domain coverage
SQuAD: Single-turn baseline for comparison
MS MARCO: Large-scale QA for general comparison

Research Impact

QuAC has advanced conversational AI research by:

Highlighting the importance of coreference resolution in dialogue
Establishing benchmarks for context-dependent question answering
Introducing novel evaluation metrics (HEQ) for conversational consistency
Demonstrating the complexity of natural information-seeking dialogues

The dataset continues to challenge researchers to develop systems capable of genuine conversational understanding rather than isolated fact retrieval.

Citation: Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W. T., Choi, Y., Liang, P., & Zettlemoyer, L. (2018). QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2174-2184).

Dataset Summary#

Key Features#

Dataset Structure#

Content Selection#

Question Analysis#

Coreference Resolution Challenges#

Reference Types#

Use Cases#

Primary Applications#

Research Applications#

Quality & Limitations#

Strengths#

Limitations#

Performance Benchmarks#

Traditional Metrics#

Human Equivalence Metrics#

Getting Started#

Methodological Innovation#

Student-Teacher Design#

Data Collection Process#

Related Work#

Research Impact#