QuAC
Basic Information
Full NameQuestion Answering in Context
DomainNatural Language Processing
Year2018
Dataset Scale
Conversations14,000+
Questions100,000+
Data Points100,000+ question-answer pairs
Research Context
AuthorsEunsol Choi , He He , Mohit Iyyer , Mark Yatskar , Wen-tau Yih , Yejin Choi , Percy Liang , Luke Zettlemoyer
InstitutionsUniversity of Washington, University of Maryland, University of Massachusetts Amherst, Microsoft Research, Allen Institute for AI, Stanford University
ConferenceEMNLP 2018
Technical Details
FormatJSON format with passages, dialogues, and extractive answers
ObservablesQuestion answering in conversational context, coreference resolution, dialogue state tracking
Subsets1 subsets
PublicationEMNLP 2018DOI
DatasetOfficial Website & Leaderboard

Dataset Summary

QuAC (Question Answering in Context) introduces a student-teacher framework for conversational question answering. Unlike traditional QA datasets that treat questions independently, QuAC requires models to understand dialogue context, resolve references across conversation turns, and handle the natural ambiguity of information-seeking conversations.

Key Features

  • Student-Teacher Framework: Asymmetric information access modeling real-world learning scenarios
  • Heavy Coreference Resolution: 86% of questions require resolving references across dialogue context
  • Context-Dependent Questions: Questions build upon previous exchanges and dialogue history
  • Extractive Answering: Answers extracted directly from Wikipedia passages for objective evaluation
  • Large Scale: 100,000+ questions across 14,000+ conversations on biographical content
  • Challenging Benchmark: 20+ F1 point gap between human and machine performance

Dataset Structure

QuAC conversations follow a structured student-teacher interaction:

  • Teacher: Has access to complete Wikipedia biographical passage
  • Student: Sees only title, introduction paragraph, and section heading
  • Dialogue: Student asks questions to learn about the content while teacher provides extractive answers
  • Termination: Ends after 12 questions, manual termination, or two consecutive unanswerable questions

Content Selection

QuAC Content Distribution
CountDescriptionName
100% of contentSubjects with 100+ incoming links for quality controlWikipedia Biographical Articles

Question Analysis

QuAC Question Characteristics
DescriptionPercentageType
Questions requiring coreference resolution86%Context-dependent questions
Questions beyond simple fact retrieval54%Non-factoid questions
Realistic scenarios where information isn’t availableUnanswerable questions
Longer responses compared to extractive datasetsExtended answers

Coreference Resolution Challenges

QuAC’s primary complexity comes from extensive coreference resolution requirements:

Reference Types

  • Passage References: Pronouns and references to entities mentioned in the source text
  • Dialogue References: References to previously discussed topics in the conversation
  • Abstract References: Challenging cases like “what else?” requiring scope inference

The prevalence of these reference types makes QuAC particularly challenging since coreference resolution remains an active research problem in NLP.

Use Cases

Primary Applications

  • Conversational AI systems requiring context understanding
  • Educational AI systems for information-seeking dialogues
  • Virtual assistants handling multi-turn questioning
  • Reading comprehension with dialogue state tracking

Research Applications

  • Coreference resolution in conversational contexts
  • Student-teacher interaction modeling
  • Context-dependent question answering
  • Dialogue state tracking and memory systems

Quality & Limitations

Strengths

  • Realistic Framework: Student-teacher setup mirrors real-world information-seeking scenarios
  • Challenging Benchmark: Substantial performance gap indicates research opportunities
  • Objective Evaluation: Extractive answers reduce annotation complexity while maintaining evaluation rigor
  • Context Complexity: Heavy coreference requirements test sophisticated NLP capabilities

Limitations

  • Domain Restriction: Limited to biographical content may not generalize to all conversational contexts
  • Extractive Constraint: Answers must come from passage text, limiting natural conversational responses
  • Performance Gap: Large human-machine gap (20+ F1) indicates dataset difficulty
  • English Only: Limited to English language conversations

Performance Benchmarks

QuAC establishes challenging benchmarks with novel evaluation metrics:

Traditional Metrics

  • Human Performance: 81.1% F1 score
  • Best Baseline (2018): BiDAF++ with context at 60.2% F1
  • Performance Gap: 20+ F1 point difference

Human Equivalence Metrics

  • HEQ-Q (Question-level): Percentage of questions achieving human-level performance
    • Human: 100%, Best Model: 55.1%
  • HEQ-D (Dialogue-level): Percentage of complete dialogues matching human performance
    • Human: 100%, Best Model: 5.2%

These metrics reveal that while models may perform reasonably on individual questions, maintaining consistency across entire conversations remains challenging.

Getting Started

  1. Access: Download from QuAC official website
  2. Format: JSON files containing Wikipedia passages, student-teacher dialogues, and extractive answers
  3. Evaluation: Use official evaluation scripts and HEQ metrics for comprehensive assessment
  4. Baseline Models: Start with reading comprehension models extended for dialogue context

Methodological Innovation

Student-Teacher Design

The asymmetric information design ensures:

  • Student questions naturally differ from passage content
  • Realistic information-seeking scenarios
  • Context-dependent questioning patterns
  • Challenging coreference resolution requirements

Data Collection Process

  • Platform: Amazon Mechanical Turk with careful quality control
  • Setup: Two-person dialogue with defined roles and access restrictions
  • Quality Assurance: Multiple termination conditions and extractive answer requirements

QuAC complements other conversational QA datasets:

  • CoQA: Features abstractive answers and different domain coverage
  • SQuAD: Single-turn baseline for comparison
  • MS MARCO: Large-scale QA for general comparison

Research Impact

QuAC has advanced conversational AI research by:

  • Highlighting the importance of coreference resolution in dialogue
  • Establishing benchmarks for context-dependent question answering
  • Introducing novel evaluation metrics (HEQ) for conversational consistency
  • Demonstrating the complexity of natural information-seeking dialogues

The dataset continues to challenge researchers to develop systems capable of genuine conversational understanding rather than isolated fact retrieval.


Citation: Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W. T., Choi, Y., Liang, P., & Zettlemoyer, L. (2018). QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2174-2184).