QuAC | |
---|---|
Basic Information | |
Full Name | Question Answering in Context |
Domain | Natural Language Processing |
Year | 2018 |
Dataset Scale | |
Conversations | 14,000+ |
Questions | 100,000+ |
Data Points | 100,000+ question-answer pairs |
Research Context | |
Authors | Eunsol Choi , He He , Mohit Iyyer , Mark Yatskar , Wen-tau Yih , Yejin Choi , Percy Liang , Luke Zettlemoyer |
Institutions | University of Washington, University of Maryland, University of Massachusetts Amherst, Microsoft Research, Allen Institute for AI, Stanford University |
Conference | EMNLP 2018 |
Technical Details | |
Format | JSON format with passages, dialogues, and extractive answers |
Observables | Question answering in conversational context, coreference resolution, dialogue state tracking |
Subsets | 1 subsets |
Publication | EMNLP 2018 • DOI |
Dataset | Official Website & Leaderboard |
Dataset Summary
QuAC (Question Answering in Context) introduces a student-teacher framework for conversational question answering. Unlike traditional QA datasets that treat questions independently, QuAC requires models to understand dialogue context, resolve references across conversation turns, and handle the natural ambiguity of information-seeking conversations.
Key Features
- Student-Teacher Framework: Asymmetric information access modeling real-world learning scenarios
- Heavy Coreference Resolution: 86% of questions require resolving references across dialogue context
- Context-Dependent Questions: Questions build upon previous exchanges and dialogue history
- Extractive Answering: Answers extracted directly from Wikipedia passages for objective evaluation
- Large Scale: 100,000+ questions across 14,000+ conversations on biographical content
- Challenging Benchmark: 20+ F1 point gap between human and machine performance
Dataset Structure
QuAC conversations follow a structured student-teacher interaction:
- Teacher: Has access to complete Wikipedia biographical passage
- Student: Sees only title, introduction paragraph, and section heading
- Dialogue: Student asks questions to learn about the content while teacher provides extractive answers
- Termination: Ends after 12 questions, manual termination, or two consecutive unanswerable questions
Content Selection
Count | Description | Name |
---|---|---|
100% of content | Subjects with 100+ incoming links for quality control | Wikipedia Biographical Articles |
Question Analysis
Description | Percentage | Type |
---|---|---|
Questions requiring coreference resolution | 86% | Context-dependent questions |
Questions beyond simple fact retrieval | 54% | Non-factoid questions |
Realistic scenarios where information isn’t available | Unanswerable questions | |
Longer responses compared to extractive datasets | Extended answers |
Coreference Resolution Challenges
QuAC’s primary complexity comes from extensive coreference resolution requirements:
Reference Types
- Passage References: Pronouns and references to entities mentioned in the source text
- Dialogue References: References to previously discussed topics in the conversation
- Abstract References: Challenging cases like “what else?” requiring scope inference
The prevalence of these reference types makes QuAC particularly challenging since coreference resolution remains an active research problem in NLP.
Use Cases
Primary Applications
- Conversational AI systems requiring context understanding
- Educational AI systems for information-seeking dialogues
- Virtual assistants handling multi-turn questioning
- Reading comprehension with dialogue state tracking
Research Applications
- Coreference resolution in conversational contexts
- Student-teacher interaction modeling
- Context-dependent question answering
- Dialogue state tracking and memory systems
Quality & Limitations
Strengths
- Realistic Framework: Student-teacher setup mirrors real-world information-seeking scenarios
- Challenging Benchmark: Substantial performance gap indicates research opportunities
- Objective Evaluation: Extractive answers reduce annotation complexity while maintaining evaluation rigor
- Context Complexity: Heavy coreference requirements test sophisticated NLP capabilities
Limitations
- Domain Restriction: Limited to biographical content may not generalize to all conversational contexts
- Extractive Constraint: Answers must come from passage text, limiting natural conversational responses
- Performance Gap: Large human-machine gap (20+ F1) indicates dataset difficulty
- English Only: Limited to English language conversations
Performance Benchmarks
QuAC establishes challenging benchmarks with novel evaluation metrics:
Traditional Metrics
- Human Performance: 81.1% F1 score
- Best Baseline (2018): BiDAF++ with context at 60.2% F1
- Performance Gap: 20+ F1 point difference
Human Equivalence Metrics
- HEQ-Q (Question-level): Percentage of questions achieving human-level performance
- Human: 100%, Best Model: 55.1%
- HEQ-D (Dialogue-level): Percentage of complete dialogues matching human performance
- Human: 100%, Best Model: 5.2%
These metrics reveal that while models may perform reasonably on individual questions, maintaining consistency across entire conversations remains challenging.
Getting Started
- Access: Download from QuAC official website
- Format: JSON files containing Wikipedia passages, student-teacher dialogues, and extractive answers
- Evaluation: Use official evaluation scripts and HEQ metrics for comprehensive assessment
- Baseline Models: Start with reading comprehension models extended for dialogue context
Methodological Innovation
Student-Teacher Design
The asymmetric information design ensures:
- Student questions naturally differ from passage content
- Realistic information-seeking scenarios
- Context-dependent questioning patterns
- Challenging coreference resolution requirements
Data Collection Process
- Platform: Amazon Mechanical Turk with careful quality control
- Setup: Two-person dialogue with defined roles and access restrictions
- Quality Assurance: Multiple termination conditions and extractive answer requirements
Related Work
QuAC complements other conversational QA datasets:
- CoQA: Features abstractive answers and different domain coverage
- SQuAD: Single-turn baseline for comparison
- MS MARCO: Large-scale QA for general comparison
Research Impact
QuAC has advanced conversational AI research by:
- Highlighting the importance of coreference resolution in dialogue
- Establishing benchmarks for context-dependent question answering
- Introducing novel evaluation metrics (HEQ) for conversational consistency
- Demonstrating the complexity of natural information-seeking dialogues
The dataset continues to challenge researchers to develop systems capable of genuine conversational understanding rather than isolated fact retrieval.
Citation: Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W. T., Choi, Y., Liang, P., & Zettlemoyer, L. (2018). QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2174-2184).