QuAC: Question Answering in Context Dataset

Introduction

The QuAC dataset (Question Answering in Context) presents a conversational question answering approach that models student-teacher interactions. Published at EMNLP 2018, this work by Choi et al. addresses how systems can understand dialogue context, resolve references across conversation turns, and handle natural conversation ambiguity (unlike datasets that treat questions independently).

The dataset addresses limitations in question answering research by incorporating real-world information-seeking dialogue complexities, where questions build upon previous exchanges and context drives understanding.

For comparison with related work, see my analysis of CoQA.

The Student-Teacher Framework

QuAC models information-seeking dialogue through a student-teacher setup:

Teacher: Has complete access to information (Wikipedia passage)
Student: Seeks knowledge through questioning with limited initial context
Interaction: Handles context-dependent questions, abstract inquiries, and unanswerable requests

This framework mirrors real-world scenarios where one party has expertise while another seeks to learn through dialogue. AI systems must act as effective teachers, using available information to provide helpful responses despite ambiguous or incomplete questions.

The dataset contains over 100,000 questions across 14,000+ dialogues, providing substantial scale for training and evaluation.

Dataset Construction

QuAC was built using Amazon Mechanical Turk with a two-person dialogue setup:

Teacher role: Has access to the complete Wikipedia passage and provides answers extracted directly from the text

Student role: Sees only the article title, introduction paragraph, and section heading, then asks questions to learn about the content

This asymmetric information design ensures student questions naturally differ from the passage content, creating realistic information-seeking scenarios. The extractive answer requirement maintains objective evaluation while simplifying scoring.

Dialogue termination:

12 questions answered
Manual termination by either participant
Two consecutive unanswerable questions

Example QuAC conversation showing student-teacher interaction

Content Selection

QuAC focuses on Wikipedia biographical articles for several practical reasons:

Reduced complexity: People-focused content requires less specialized domain knowledge
Natural question flow: Biographical information lends itself to sequential questioning
Quality control: Articles filtered to include only subjects with 100+ incoming links, ensuring content depth

This focused scope enables consistent evaluation while maintaining broad coverage through diverse biographical subjects across fields and time periods.

Key Dataset Characteristics

QuAC introduces several features that distinguish it from existing question answering benchmarks:

Comparative analysis of QuAC against other QA datasets

Notable features:

High contextual dependency: 86% of questions require coreference resolution
Non-factoid focus: 54% of questions go beyond simple fact retrieval
Extended answers: Longer responses compared to extractive datasets
Unanswerable questions: Realistic scenarios where information isn’t available

The Coreference Resolution Challenge

QuAC’s complexity stems from its heavy reliance on coreference resolution across multiple contexts:

Reference types:

Passage references: Pronouns and references to entities in the source text
Dialogue references: References to previously discussed topics
Abstract references: Challenging cases like “what else?” that require inferring the inquiry scope

Types and distribution of coreferences in QuAC

The prevalence of coreference resolution makes QuAC particularly challenging, as this remains an active research problem in NLP. Models must understand passage content, track dialogue history, and resolve complex referential expressions simultaneously.

Performance Results

Models face substantial challenges on QuAC, with significant gaps between human and machine performance:

Baseline model performance comparison on QuAC

Performance summary:

Human performance: 81.1% F1 score
Best baseline: BiDAF++ with context achieves 60.2% F1
Performance gap: 20+ point difference shows room for improvement

Human Equivalence Metrics

QuAC introduces evaluation metrics beyond traditional F1 scores:

HEQ-Q (Human Equivalence Question-level): Percentage of questions where the model achieves human-level or better performance

HEQ-D (Human Equivalence Dialogue-level): Percentage of complete dialogues where the model matches human performance across all questions

Current results:

Human baseline: 100% HEQ-Q, 100% HEQ-D (by definition)
Best model: 55.1% HEQ-Q, 5.2% HEQ-D

These metrics show not just average performance but consistency across questions and conversations, important for practical dialogue systems.

Research Impact

QuAC represents an important step in question answering research by introducing realistic conversational dynamics that existing datasets lack. The student-teacher framework captures natural information-seeking behavior while maintaining extractive evaluation for objective assessment.

Key contributions:

Conversational realism: Context-dependent questions that mirror dialogue patterns
Coreference complexity: Integration of challenging NLP problems into QA evaluation
Evaluation metrics: HEQ scores that measure consistency alongside average performance
Large-scale framework: Substantial dataset enabling robust model training and evaluation

The dataset’s leaderboard provides researchers with a challenging benchmark for developing conversational AI systems. As models improve on QuAC, we can expect progress in dialogue agents, virtual assistants, and educational AI systems that engage in more natural, context-aware conversations.

QuAC’s focus on dialogue context and reference resolution pushes the field toward AI systems that can engage in genuine conversation rather than simply retrieving isolated facts.

Citation

@inproceedings{choi-etal-2018-quac,
    title = "{Q}u{AC}: Question Answering in Context",
    author = "Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    month = oct # "-" # nov,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/D18-1241/",
    doi = "10.18653/v1/D18-1241",
    pages = "2174--2184"
}

Introduction#

The Student-Teacher Framework#

Dataset Construction#

Content Selection#

Key Dataset Characteristics#

The Coreference Resolution Challenge#

Performance Results#

Human Equivalence Metrics#

Research Impact#

Citation#